![]() |
|
|
|||||||
![]() |
XML - Preventing the UTF-8 Parser from converting an entity? |
|
|
Thread Tools | Search this Thread |
|
|
#1 |
|
Hello all,
I'm having a little problem, The UTF-8 parser we are using converts the newline entity ( ) within an attribute that we are using to paliate CSS limitations. After the parser has gone through the document, the entity is converted to \n, which then effectively tosses out the window the behavior we are getting by keepinig the entity AS IS within the document. Is there a clean and easy way around this? Any help will be greatly appreciated. Regards Jean-Francois Michaud =?iso-8859-1?q?Jean-Fran=E7ois_Michaud?= |
|
|
|
|
#2 |
|
Posts: n/a
|
* Jean-François Michaud wrote in comp.text.xml:
>I'm having a little problem, The UTF-8 parser we are using converts the >newline entity ( ) within an attribute that we are using to paliate >CSS limitations. I don't understand your question. First, is not an entity but a numeric character reference. Second, processing those is independent of character encodings like UTF-8. Third, I don't see what CSS limitation you might be referring to here. >After the parser has gone through the document, the entity is converted >to \n, which then effectively tosses out the window the behavior we are >getting by keepinig the entity AS IS within the document. What is "\n" here? What do you mean by "converted"? What do you mean by keeping it? Processing white-space characters and character references to them in attribute values is explained in the XML specification. XML processors keep them to the extent that they are significant. If you connect the processor to a serializer, the input and output documents will be canonically equivalent unless one of them has a bug. So there should be no issue here. -- Björn Höhrmann · private.php?do=newpm&u= · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ |
|
|
|
#3 |
|
Posts: n/a
|
Jean-François Michaud wrote: > I'm having a little problem, The UTF-8 parser we are using converts the > newline entity ( ) within an attribute that we are using to paliate > CSS limitations. is not an entity nor an entity reference, rather a numeric character reference. What is an "UTF-8 parser"? > After the parser has gone through the document, the entity is converted > to \n, which then effectively tosses out the window the behavior we are > getting by keepinig the entity AS IS within the document. It is not clear what kind of tool you use and what you produce finally but if you want to serialize a DOM or an XSLT result tree to XML markup and want that newline character to be escaped as as a numeric character reference then you need an XML serializer that does that. If you want to serialize such a tree to HTML markup then you need a HTML serializer that does that. -- Martin Honnen http://JavaScript.FAQTs.com/ |
|
|
|
#4 |
|
Posts: n/a
|
In article <. com>,
Jean-François Michaud <> wrote: >After the parser has gone through the document, the entity is converted >to \n, which then effectively tosses out the window the behavior we are >getting by keepinig the entity AS IS within the document. >Is there a clean and easy way around this? Not using XML. XML applications are effectively required to treat character references in content the same way that they treat the characters referred to. A conforming XML parser will convert it in the way you describe. If you want to have something that's like a newline but is treated differently, then a character reference is not the right approach. That's not what they're for. Using an element such as <nl/> might be a better solution. -- Richard |
|
|
|
#5 |
|
Posts: n/a
|
Richard Tobin wrote: > In article <. com>, > Jean-François Michaud <> wrote: > > >After the parser has gone through the document, the entity is converted > >to \n, which then effectively tosses out the window the behavior we are > >getting by keepinig the entity AS IS within the document. > > >Is there a clean and easy way around this? > > Not using XML. XML applications are effectively required to treat > character references in content the same way that they treat the > characters referred to. A conforming XML parser will convert it in > the way you describe. > > If you want to have something that's like a newline but is treated > differently, then a character reference is not the right approach. > That's not what they're for. Using an element such as <nl/> might be > a better solution. Understandably, but we are using a stange combinary of XML + CSS under the VEX XML editor. We are displaying the attribute before a bit of text, but because of a silly CSS limitation (not being able to test for a condition in a pseudo :before element), we thought that postpending the character at the end of the string would do the trick. It does indeed work, but as soon as we save the document, the character gets converted to UTF-8 encoding. We HAVE to use this character because VEX doesn't deal with UTF-8 encoding directly to format its output. Using an <nl/> element is simply not an option. Regards Jean-Francois Michaud |
|
|
|
#6 |
|
Posts: n/a
|
Bjoern Hoehrmann wrote: > * Jean-François Michaud wrote in comp.text.xml: > >I'm having a little problem, The UTF-8 parser we are using converts the > >newline entity ( ) within an attribute that we are using to paliate > >CSS limitations. > > I don't understand your question. First, is not an entity but a > numeric character reference. Second, processing those is independent of > character encodings like UTF-8. Third, I don't see what CSS limitation > you might be referring to here. Alright let me clarify, We allow for numeric character references to be included in our XML document so that special characters can be included in the output. These numeric sequences get converted to UTF-8 encoding for proper transformation into yet another XML which is then transformed into PDF using XSLT/XSL:FO. All the way through, encoding has to abide by UTF-8, hence the reason why the numeric sequences have to be converted to meet this restriction. The problem is that the XML editor that we use to display the XML content (using XML + CSS) doesn't use UTF-8 encoded characters when dealing with formatting. It recognizes the character, but not the UTF-8 version of it. The problem all stems from CSS being unable to allow for me to test a condition while displaying using a :before pseudo element (I can either display using :before, or I can test for a condition, but I can't do both at the same time. Yay for CSS!). The solution was to append the character at the end of the string attribute that we want to display so that the carriage return only occurs when the string is non empty. This works splendidly but as soon as we save the document, the engine converts everything to UTF-8 encoding (booo!). [snip] Regards Jean-Francois Michaud |
|
|
|
#7 |
|
Posts: n/a
|
>The solution was to append the character
at the end of the string
>attribute If you mean inside the attribute value... A properly functioning XML serializer should recognize line breaks within attribute values as a special case and escape them as necessary to write them back out, typically as . However, the distinction between , CR, LF, and CRLF will not be preserved elsewhere. The only place where XML cares about the difference between these is in the details of attribute value normalization and serialization. And while looking at the parsed version of the data (as output from the parser but not run back through a serializer, you will always see these as the newline character, I'm still not sure from your description which of these applies to your particular problem. You might want to post a very explicit description of what your source XML looks like, how you're viewing the result of the parse, and what you're seeing. In any case, UTF-8 has nothing to do with any of the above; it's strictly XML behaviors. -- Joe Kesselman / Beware the fury of a patient man. -- John Dryden |
|
|
|
#8 |
|
Posts: n/a
|
Personally, I'd recommend you discard CSS and switch to XSLT. CSS was
not designed for XML processing; XSLT was (and is more powerful than CSS). |
|
|
|
#9 |
|
Posts: n/a
|
Joseph Kesselman wrote: > Personally, I'd recommend you discard CSS and switch to XSLT. CSS was > not designed for XML processing; XSLT was (and is more powerful than CSS). I know, that would have been my take also. The technology that we are using is the VEX XML editor. It allows users to update XML content as if they were in word which is not entirely uninterresting, but CSS is not advanced enough for this XML + CSS combo to work perfectly when more demanding formatting is necessary. VEX unfortunately uses CSS to render the output on display. No way around this short of throwing everything in the garbage altogether and thats just not gonna happen. Regards Jeff |
|
|
|
#10 |
|
Posts: n/a
|
In article < .com>,
Jean-François Michaud <> wrote: >We are displaying the attribute before a bit of text If the character is in an attribute, rather than content, it should be output as or an equivalent reference. This is because an ordinary linefeed would be normalised to a space character when the file is read in again. >It does indeed >work, but as soon as we save the document, the character gets converted >to UTF-8 encoding. Just to be clear about this: linefeed is an ASCII character, and is the same in UTF-8 as in ASCII. >We HAVE to use this character because VEX doesn't >deal with UTF-8 encoding directly to format its output. I really don't understand this at all. The encoding is not relevant here. In your input file, you will have . A program that reads (parses) this will have a linefeed character in its data, using whatever internal encoding it happens to use. UTF-8 only becomes relevant when you output the file, and as I said a linefeed in an attribute should be output as rather than a linefeed character. -- Richard |
|