Go Back   Velocity Reviews > Newsgroups > XML
User Name
Password
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

Reply

XML - Preventing the UTF-8 Parser from converting an entity?

 
Thread Tools Search this Thread
Old 09-18-2006, 05:25 PM   #1
Default Preventing the UTF-8 Parser from converting an entity?


Hello all,

I'm having a little problem, The UTF-8 parser we are using converts the
newline entity ( ) within an attribute that we are using to paliate
CSS limitations.

After the parser has gone through the document, the entity is converted
to \n, which then effectively tosses out the window the behavior we are
getting by keepinig the entity AS IS within the document.

Is there a clean and easy way around this?

Any help will be greatly appreciated.

Regards
Jean-Francois Michaud



=?iso-8859-1?q?Jean-Fran=E7ois_Michaud?=
  Reply With Quote
Old 09-18-2006, 05:33 PM   #2
Bjoern Hoehrmann
 
Posts: n/a
Default Re: Preventing the UTF-8 Parser from converting an entity?

* Jean-François Michaud wrote in comp.text.xml:
>I'm having a little problem, The UTF-8 parser we are using converts the
>newline entity ( ) within an attribute that we are using to paliate
>CSS limitations.


I don't understand your question. First, is not an entity but a
numeric character reference. Second, processing those is independent of
character encodings like UTF-8. Third, I don't see what CSS limitation
you might be referring to here.

>After the parser has gone through the document, the entity is converted
>to \n, which then effectively tosses out the window the behavior we are
>getting by keepinig the entity AS IS within the document.


What is "\n" here? What do you mean by "converted"? What do you mean by
keeping it? Processing white-space characters and character references
to them in attribute values is explained in the XML specification. XML
processors keep them to the extent that they are significant. If you
connect the processor to a serializer, the input and output documents
will be canonically equivalent unless one of them has a bug. So there
should be no issue here.
--
Björn Höhrmann · private.php?do=newpm&u= · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
  Reply With Quote
Old 09-18-2006, 05:37 PM   #3
Martin Honnen
 
Posts: n/a
Default Re: Preventing the UTF-8 Parser from converting an entity?



Jean-François Michaud wrote:


> I'm having a little problem, The UTF-8 parser we are using converts the
> newline entity ( ) within an attribute that we are using to paliate
> CSS limitations.


is not an entity nor an entity reference, rather a numeric
character reference.
What is an "UTF-8 parser"?

> After the parser has gone through the document, the entity is converted
> to \n, which then effectively tosses out the window the behavior we are
> getting by keepinig the entity AS IS within the document.


It is not clear what kind of tool you use and what you produce finally
but if you want to serialize a DOM or an XSLT result tree to XML markup
and want that newline character to be escaped as as a numeric
character reference then you need an XML serializer that does that. If
you want to serialize such a tree to HTML markup then you need a HTML
serializer that does that.

--

Martin Honnen
http://JavaScript.FAQTs.com/
  Reply With Quote
Old 09-18-2006, 06:17 PM   #4
Richard Tobin
 
Posts: n/a
Default Re: Preventing the UTF-8 Parser from converting an entity?

In article <. com>,
Jean-François Michaud <> wrote:

>After the parser has gone through the document, the entity is converted
>to \n, which then effectively tosses out the window the behavior we are
>getting by keepinig the entity AS IS within the document.


>Is there a clean and easy way around this?


Not using XML. XML applications are effectively required to treat
character references in content the same way that they treat the
characters referred to. A conforming XML parser will convert it in
the way you describe.

If you want to have something that's like a newline but is treated
differently, then a character reference is not the right approach.
That's not what they're for. Using an element such as <nl/> might be
a better solution.

-- Richard
  Reply With Quote
Old 09-18-2006, 07:35 PM   #5
=?iso-8859-1?q?Jean-Fran=E7ois_Michaud?=
 
Posts: n/a
Default Re: Preventing the UTF-8 Parser from converting an entity?


Richard Tobin wrote:
> In article <. com>,
> Jean-François Michaud <> wrote:
>
> >After the parser has gone through the document, the entity is converted
> >to \n, which then effectively tosses out the window the behavior we are
> >getting by keepinig the entity AS IS within the document.

>
> >Is there a clean and easy way around this?

>
> Not using XML. XML applications are effectively required to treat
> character references in content the same way that they treat the
> characters referred to. A conforming XML parser will convert it in
> the way you describe.
>
> If you want to have something that's like a newline but is treated
> differently, then a character reference is not the right approach.
> That's not what they're for. Using an element such as <nl/> might be
> a better solution.


Understandably, but we are using a stange combinary of XML + CSS under
the VEX XML editor.

We are displaying the attribute before a bit of text, but because of a
silly CSS limitation (not being able to test for a condition in a
pseudo :before element), we thought that postpending the
character at the end of the string would do the trick. It does indeed
work, but as soon as we save the document, the character gets converted
to UTF-8 encoding. We HAVE to use this character because VEX doesn't
deal with UTF-8 encoding directly to format its output. Using an <nl/>
element is simply not an option.

Regards
Jean-Francois Michaud

  Reply With Quote
Old 09-18-2006, 07:44 PM   #6
=?iso-8859-1?q?Jean-Fran=E7ois_Michaud?=
 
Posts: n/a
Default Re: Preventing the UTF-8 Parser from converting an entity?


Bjoern Hoehrmann wrote:
> * Jean-François Michaud wrote in comp.text.xml:
> >I'm having a little problem, The UTF-8 parser we are using converts the
> >newline entity ( ) within an attribute that we are using to paliate
> >CSS limitations.

>
> I don't understand your question. First, is not an entity but a
> numeric character reference. Second, processing those is independent of
> character encodings like UTF-8. Third, I don't see what CSS limitation
> you might be referring to here.


Alright let me clarify, We allow for numeric character references to be
included in our XML document so that special characters can be included
in the output. These numeric sequences get converted to UTF-8 encoding
for proper transformation into yet another XML which is then
transformed into PDF using XSLT/XSL:FO. All the way through, encoding
has to abide by UTF-8, hence the reason why the numeric sequences have
to be converted to meet this restriction. The problem is that the XML
editor that we use to display the XML content (using XML + CSS) doesn't
use UTF-8 encoded characters when dealing with formatting. It
recognizes the character, but not the UTF-8 version of it.

The problem all stems from CSS being unable to allow for me to test a
condition while displaying using a :before pseudo element (I can either
display using :before, or I can test for a condition, but I can't do
both at the same time. Yay for CSS!).

The solution was to append the character at the end of the string
attribute that we want to display so that the carriage return only
occurs when the string is non empty. This works splendidly but as soon
as we save the document, the engine converts everything to UTF-8
encoding (booo!).

[snip]

Regards
Jean-Francois Michaud

  Reply With Quote
Old 09-18-2006, 08:57 PM   #7
Joseph Kesselman
 
Posts: n/a
Default Re: Preventing the UTF-8 Parser from converting an entity?

>The solution was to append the character at the end of the string
>attribute


If you mean inside the attribute value... A properly functioning XML
serializer should recognize line breaks within attribute values as a
special case and escape them as necessary to write them back out,
typically as .

However, the distinction between , CR, LF, and CRLF will not be
preserved elsewhere. The only place where XML cares about the difference
between these is in the details of attribute value normalization and
serialization.

And while looking at the parsed version of the data (as output from the
parser but not run back through a serializer, you will always see these
as the newline character,

I'm still not sure from your description which of these applies to your
particular problem. You might want to post a very explicit description
of what your source XML looks like, how you're viewing the result of the
parse, and what you're seeing.

In any case, UTF-8 has nothing to do with any of the above; it's
strictly XML behaviors.

--
Joe Kesselman / Beware the fury of a patient man. -- John Dryden
  Reply With Quote
Old 09-18-2006, 08:58 PM   #8
Joseph Kesselman
 
Posts: n/a
Default Re: Preventing the UTF-8 Parser from converting an entity?

Personally, I'd recommend you discard CSS and switch to XSLT. CSS was
not designed for XML processing; XSLT was (and is more powerful than CSS).
  Reply With Quote
Old 09-18-2006, 09:19 PM   #9
=?iso-8859-1?q?Jean-Fran=E7ois_Michaud?=
 
Posts: n/a
Default Re: Preventing the UTF-8 Parser from converting an entity?


Joseph Kesselman wrote:
> Personally, I'd recommend you discard CSS and switch to XSLT. CSS was
> not designed for XML processing; XSLT was (and is more powerful than CSS).


I know, that would have been my take also. The technology that we are
using is the VEX XML editor. It allows users to update XML content as
if they were in word which is not entirely uninterresting, but CSS is
not advanced enough for this XML + CSS combo to work perfectly when
more demanding formatting is necessary. VEX unfortunately uses CSS to
render the output on display. No way around this short of throwing
everything in the garbage altogether and thats just not gonna happen.

Regards
Jeff

  Reply With Quote
Old 09-18-2006, 09:20 PM   #10
Richard Tobin
 
Posts: n/a
Default Re: Preventing the UTF-8 Parser from converting an entity?

In article < .com>,
Jean-François Michaud <> wrote:
>We are displaying the attribute before a bit of text


If the character is in an attribute, rather than content, it should be
output as or an equivalent reference. This is because an
ordinary linefeed would be normalised to a space character when the
file is read in again.

>It does indeed
>work, but as soon as we save the document, the character gets converted
>to UTF-8 encoding.


Just to be clear about this: linefeed is an ASCII character, and is the
same in UTF-8 as in ASCII.

>We HAVE to use this character because VEX doesn't
>deal with UTF-8 encoding directly to format its output.


I really don't understand this at all. The encoding is not relevant
here. In your input file, you will have . A program that reads
(parses) this will have a linefeed character in its data, using
whatever internal encoding it happens to use. UTF-8 only becomes
relevant when you output the file, and as I said a linefeed in an
attribute should be output as rather than a linefeed character.


-- Richard
  Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump