Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Re: Putting a "<" in an attribute value (was about CDATA sections)

Reply
Thread Tools

Re: Putting a "<" in an attribute value (was about CDATA sections)

 
 
Jon Noring
Guest
Posts: n/a
 
      11-15-2005
As an addendum to my prior message where I asked if there is an
absolute ban on using the "<" character in attribute values (for
well-formed XML documents) no matter how the "<" is represented.

Googling around at various "authorities" on this topic I get different
answers. I suppose this is to be expected. <laugh/>

To summarize, there are four mechanisms by which the "<" character may
be included in an attribute value, some or all of which are illegal
per XML well-formedness rules:

1) <foo bar="is x < y ?">

2) <foo bar="is x &lt; y ?">

3) <foo bar="is x &#x003C; y ?">

4) <foo bar="is x &lessthan; y ?"

a) where in the DTD we have <!ENTITY lessthan "<">

b) where in the DTD we have <!ENTITY lessthan "&lt;">

c) where in the DTD we have <!ENTITY lessthan "&#x003C;">


From the latest XML spec (section 3.1, rule 41 and associated WFC),
see http://www.w3.org/TR/REC-xml/#NT-AttValue , it says

"No < in Attribute Values.
"The replacement text of any entity referred to directly or
indirectly in an attribute value MUST NOT contain a <."

So it is clear from this that #1 and #4a are illegal. But the others
are ambiguous (section 2.4 essentially says numeric character
references are equivalent to the escape strings.) It partly seems to
hinge around the definition of an "entity".

The plot thickens when looking at the 1998 first edition of the XML
spec, http://www.w3.org/TR/1998/REC-xml-19...#sec-starttags .
It says:

"No < in Attribute Values.
"The replacement text of any entity referred to directly or
indirectly in an attribute value (other than "&lt;") must not
contain a <."


The difference between the current XML spec and the first 1998 spec
is that in the 1998 spec it clearly says "&lt;" may be used to
represent the literal "<" character in an attribute value (and I
would assume, by extension in section 2.4, so would be &#x003C or
<). So in the 1998 spec, #2 and #4b appear legal, and likely #3
and #4c.

So what does the removal of the phrase '(other than "&lt;")' mean
in the current XML spec edition? Was it removed because it is
superfluous (that is, &lt;, and &#x003C; are not considered "any
entity" -- this is supported in that in section 2.4 XML calls &lt; a
"string", not an "entity".) Or was it a change to have a total,
absolute ban on using that character no matter how it is represented?

An inquiring mind wants to know.

Jon

 
Reply With Quote
 
 
 
 
Peter Flynn
Guest
Posts: n/a
 
      11-15-2005
Jon Noring wrote:

> As an addendum to my prior message where I asked if there is an
> absolute ban on using the "<" character in attribute values (for
> well-formed XML documents) no matter how the "<" is represented.
>
> Googling around at various "authorities" on this topic I get different
> answers. I suppose this is to be expected. <laugh/>


Yes. Google is a fine thing, but the pages it indexes are not subjected
to any form of authority.

> To summarize, there are four mechanisms by which the "<" character may
> be included in an attribute value, some or all of which are illegal
> per XML well-formedness rules:
>
> 1) <foo bar="is x < y ?">


No.

> 2) <foo bar="is x &lt; y ?">


Yes.

> 3) <foo bar="is x &#x003C; y ?">


Yes.

> 4) <foo bar="is x &lessthan; y ?"


That is well-formed.

> a) where in the DTD we have <!ENTITY lessthan "<">


No, that's an invalid declaration.

> b) where in the DTD we have <!ENTITY lessthan "&lt;">


That's OK.

> c) where in the DTD we have <!ENTITY lessthan "&#x003C;">


So is that.

> From the latest XML spec (section 3.1, rule 41 and associated WFC),
> see http://www.w3.org/TR/REC-xml/#NT-AttValue , it says
>
> "No < in Attribute Values.
> "The replacement text of any entity referred to directly or
> indirectly in an attribute value MUST NOT contain a <."
>
> So it is clear from this that #1 and #4a are illegal. But the others
> are ambiguous (section 2.4 essentially says numeric character
> references are equivalent to the escape strings.) It partly seems to
> hinge around the definition of an "entity".


All these terms have their formal definition in SGML (ISO 8879:1986).
You may want to borrow a copy of Goldfarb, C, "The SGML Handbook" (OUP)
to check them out, but beware the formal standards-ese language (Charles
is a lawyer XML has inherited these definitions with very few
changes.

To understand what happens may help: validity attaches to the state of
the characters making up the file at the time of parsing, without any
form of interpretation (ie no substitution of entity values for entity
references...yet). So a < in a CDATA attribute value is invalid, but
a &lt; or &#x3c; is valid because neither of them contains a literal <
character. Once validity is established, an application will receive
a data representation of the document from the parser, which includes
both the structural information (where the markup nodes were) and the
character data content information (where the document text is). This
is variously known in assorted circles as "the grove", "the
post-schema-validation infoset" and other terms. How it is presented
to the application varies, but at this stage all physical markup has
disappeared (or rather, been turned into pointers of some kind) and
all entity references and character references have been resolved.

One way to get a handle on this (and to solve any other questions of
validity or invalidity) is to install a validating parser like onsgmls
or rxp which runs from the command-line. onsgmls in particular is
useful, despite its now having some small areas of non-conformance)
in that it can output a format called ESIS, which is a line-by-line
echo of the markup interpretation. As an example, here is your XML
file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE header [
<!ELEMENT header (#PCDATA)>
<!ATTLIST header title CDATA #REQUIRED>
<!ENTITY lessthan "&lt;">
]>
<header title="Is A &lessthan; B?"> ... </header>

and here is onsgmls's unsuppressed output (there's a -s option to turn
this off and simply report validity or not):

$ onsgmls -wxml /usr/share/sgml/xml.dcl test.xml
onsgmls:/usr/share/sgml/xml.dcl:1:W: SGML declaration was not implied
?xml version="1.0" encoding="ISO-8859-1"
Atitle CDATA Is A < B?
(header
- ...
)header
C
$

Ignore the warning about the SGML declaration for the moment. The ESIS
output clearly shows the data and markup being dissected and exposed
for processing. Lines beginning with A are attribute values, ( is the
start-tag of an element type, ) is the end-tag, - is character data,
and C is the end.

> The plot thickens when looking at the 1998 first edition of the XML
> spec, http://www.w3.org/TR/1998/REC-xml-19...#sec-starttags .
> It says:
>
> "No < in Attribute Values.
> "The replacement text of any entity referred to directly or
> indirectly in an attribute value (other than "&lt;") must not
> contain a <."


Right. That means it mustn't contain a literal < sign like 4(a)
above. It may well resolve to a < sign at the end of the day, but
for the purposes of document validity we're only concerned with
the actual characters in the file, not what they represent.

> The difference between the current XML spec and the first 1998 spec
> is that in the 1998 spec it clearly says "&lt;" may be used to
> represent the literal "<" character in an attribute value (and I
> would assume, by extension in section 2.4, so would be &#x003C or
> <). So in the 1998 spec, #2 and #4b appear legal, and likely #3
> and #4c.


Yes, exactly correct.

> So what does the removal of the phrase '(other than "&lt;")' mean
> in the current XML spec edition? Was it removed because it is
> superfluous


Yes.

> (that is, &lt;, and &#x003C; are not considered "any
> entity" -- this is supported in that in section 2.4 XML calls &lt; a
> "string", not an "entity".) Or was it a change to have a total,
> absolute ban on using that character no matter how it is represented?


It was just to avoid clouding the issue, so far as I know.

///Peter
--
XML FAQ: http://xml.silmaril.ie/

 
Reply With Quote
 
 
 
 
David Håsäther
Guest
Posts: n/a
 
      11-15-2005
Peter Flynn <> wrote:

>> a) where in the DTD we have <!ENTITY lessthan "<">

>
> No, that's an invalid declaration.


It's not. The specification says this though:

Although the EntityValue production allows the definition of a
general entity consisting of a single explicit < in the literal
(e.g., <!ENTITY mylt "<">), it is strongly advised to avoid this
practice since any reference to that entity will cause a well-
formedness error.

-- http://www.w3.org/TR/REC-xml/#IDA2S1S

--
David Håsäther
 
Reply With Quote
 
Richard Tobin
Guest
Posts: n/a
 
      11-16-2005
In article <>,
Peter Flynn <> wrote:

>> c) where in the DTD we have <!ENTITY lessthan "&#x003C;">

>
>So is that.


No, see my other message, and also the comment in the spec about the
definition of the built-in entities. Both amp and lt need double
escaping in their definitions.

-- Richard
 
Reply With Quote
 
Richard Tobin
Guest
Posts: n/a
 
      11-16-2005
In article <>,
Jon Noring <> wrote:

>So what does the removal of the phrase '(other than "&lt;")' mean
>in the current XML spec edition?


In early drafts of the XML spec, the built-in entities - in particular
amp and lt - were "magic". It was pointed out that they could be
defined non-magically by use of double escaping, and the final spec
used this. The phrase you quote was probably a hangover from the
earlier version that was removed in an erratum when it was noticed
that it wasn't needed an more.

-- Richard
 
Reply With Quote
 
Peter Flynn
Guest
Posts: n/a
 
      11-16-2005
David Håsäther wrote:

> Peter Flynn <> wrote:
>
>>> a) where in the DTD we have <!ENTITY lessthan "<">

>>
>> No, that's an invalid declaration.

>
> It's not.


I'm sorry, you're quite right. I'm not sure what my brain was doing when
I wrote that. Making a reference to &lessthan; certainly would cause an
error, as you point out.

///Peter

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Change the value of an attribute according to the value of another attribute patrizio.trinchini@googlemail.com XML 8 08-22-2006 02:53 PM
wrapping long attribute value (line-continuation for attribute value) lophiomys@gmx.at XML 1 08-02-2006 01:18 PM
Can I un-CDATA my CDATA section and elaborate a transformation for the contained data? troppfigo@excite.it XML 3 03-06-2006 03:01 AM
May a CDATA section appear in an attribute value? Jon Noring XML 10 11-16-2005 01:45 AM
Extracting CDATA Text without CDATA Tags??? John Davison Java 1 07-06-2004 11:00 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57