Jon Noring wrote:
> As an addendum to my prior message where I asked if there is an
> absolute ban on using the "<" character in attribute values (for
> well-formed XML documents) no matter how the "<" is represented.
>
> Googling around at various "authorities" on this topic I get different
> answers. I suppose this is to be expected. <laugh/>
Yes. Google is a fine thing, but the pages it indexes are not subjected
to any form of authority.
> To summarize, there are four mechanisms by which the "<" character may
> be included in an attribute value, some or all of which are illegal
> per XML well-formedness rules:
>
> 1) <foo bar="is x < y ?">
No.
> 2) <foo bar="is x < y ?">
Yes.
> 3) <foo bar="is x < y ?">
Yes.
> 4) <foo bar="is x &lessthan; y ?"
That is well-formed.
> a) where in the DTD we have <!ENTITY lessthan "<">
No, that's an invalid declaration.
> b) where in the DTD we have <!ENTITY lessthan "<">
That's OK.
> c) where in the DTD we have <!ENTITY lessthan "<">
So is that.
> From the latest XML spec (section 3.1, rule 41 and associated WFC),
> see http://www.w3.org/TR/REC-xml/#NT-AttValue , it says
>
> "No < in Attribute Values.
> "The replacement text of any entity referred to directly or
> indirectly in an attribute value MUST NOT contain a <."
>
> So it is clear from this that #1 and #4a are illegal. But the others
> are ambiguous (section 2.4 essentially says numeric character
> references are equivalent to the escape strings.) It partly seems to
> hinge around the definition of an "entity".
All these terms have their formal definition in SGML (ISO 8879:1986).
You may want to borrow a copy of Goldfarb, C, "The SGML Handbook" (OUP)
to check them out, but beware the formal standards-ese language (Charles
is a lawyer

XML has inherited these definitions with very few
changes.
To understand what happens may help: validity attaches to the state of
the characters making up the file at the time of parsing, without any
form of interpretation (ie no substitution of entity values for entity
references...yet). So a < in a CDATA attribute value is invalid, but
a < or < is valid because neither of them contains a literal <
character. Once validity is established, an application will receive
a data representation of the document from the parser, which includes
both the structural information (where the markup nodes were) and the
character data content information (where the document text is). This
is variously known in assorted circles as "the grove", "the
post-schema-validation infoset" and other terms. How it is presented
to the application varies, but at this stage all physical markup has
disappeared (or rather, been turned into pointers of some kind) and
all entity references and character references have been resolved.
One way to get a handle on this (and to solve any other questions of
validity or invalidity) is to install a validating parser like onsgmls
or rxp which runs from the command-line. onsgmls in particular is
useful, despite its now having some small areas of non-conformance)
in that it can output a format called ESIS, which is a line-by-line
echo of the markup interpretation. As an example, here is your XML
file:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE header [
<!ELEMENT header (#PCDATA)>
<!ATTLIST header title CDATA #REQUIRED>
<!ENTITY lessthan "<">
]>
<header title="Is A &lessthan; B?"> ... </header>
and here is onsgmls's unsuppressed output (there's a -s option to turn
this off and simply report validity or not):
$ onsgmls -wxml /usr/share/sgml/xml.dcl test.xml
onsgmls:/usr/share/sgml/xml.dcl:1:W: SGML declaration was not implied
?xml version="1.0" encoding="ISO-8859-1"
Atitle CDATA Is A < B?
(header
- ...
)header
C
$
Ignore the warning about the SGML declaration for the moment. The ESIS
output clearly shows the data and markup being dissected and exposed
for processing. Lines beginning with A are attribute values, ( is the
start-tag of an element type, ) is the end-tag, - is character data,
and C is the end.
> The plot thickens when looking at the 1998 first edition of the XML
> spec, http://www.w3.org/TR/1998/REC-xml-19...#sec-starttags .
> It says:
>
> "No < in Attribute Values.
> "The replacement text of any entity referred to directly or
> indirectly in an attribute value (other than "<") must not
> contain a <."
Right. That means it mustn't contain a literal < sign like 4(a)
above. It may well resolve to a < sign at the end of the day, but
for the purposes of document validity we're only concerned with
the actual characters in the file, not what they represent.
> The difference between the current XML spec and the first 1998 spec
> is that in the 1998 spec it clearly says "<" may be used to
> represent the literal "<" character in an attribute value (and I
> would assume, by extension in section 2.4, so would be < or
> <). So in the 1998 spec, #2 and #4b appear legal, and likely #3
> and #4c.
Yes, exactly correct.
> So what does the removal of the phrase '(other than "<")' mean
> in the current XML spec edition? Was it removed because it is
> superfluous
Yes.
> (that is, <, and < are not considered "any
> entity" -- this is supported in that in section 2.4 XML calls < a
> "string", not an "entity".) Or was it a change to have a total,
> absolute ban on using that character no matter how it is represented?
It was just to avoid clouding the issue, so far as I know.
///Peter
--
XML FAQ:
http://xml.silmaril.ie/