Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > XML entity parsing question

Reply
Thread Tools

XML entity parsing question

 
 
Tuomas Rannikko
Guest
Posts: n/a
 
      05-30-2006

Hello,

I'm currently writing a XML processor for the fun of it. There is
something I don't understand in the spec though. I'm obviously missing
something important.

The spec states that both Internal General and Character references are
included when referenced in content. And "included" means:

<quote>
4.4.2 Included

[Definition: An entity is included when its replacement text is
retrieved and processed, in place of the reference itself, as though it
were part of the document at the location the reference was recognized.]
The replacement text MAY contain both character data and (except for
parameter entities) markup, which MUST be recognized in the usual way.
(The string "AT&amp;T;" expands to "AT&T;" and the remaining ampersand
is not recognized as an entity-reference delimiter.) A character
reference is included when the indicated character is processed in place
of the reference itself.
</quote>

If I understand correctly the specification contradicts itself when it
says the replacement text is processed in place of the reference itself
and markup MUST be recognized. Shouldn't the "&T;" in "AT&T;" then be
actually BE recognized? I understand that if it actually were recognized
then the character '&' could not be expressed in XML (nor '<' for that
matter). The question is then, when should the markup in the replacement
text be recognized and when it shouldn't?

Thank you in advance for your reply.

- Tuomas
 
Reply With Quote
 
 
 
 
Philippe Poulard
Guest
Posts: n/a
 
      05-30-2006
Tuomas Rannikko wrote:
>
> Hello,
>
> I'm currently writing a XML processor for the fun of it. There is
> something I don't understand in the spec though. I'm obviously missing
> something important.
>
> The spec states that both Internal General and Character references are
> included when referenced in content. And "included" means:
>
> <quote>
> 4.4.2 Included
>
> [Definition: An entity is included when its replacement text is
> retrieved and processed, in place of the reference itself, as though it
> were part of the document at the location the reference was recognized.]
> The replacement text MAY contain both character data and (except for
> parameter entities) markup, which MUST be recognized in the usual way.
> (The string "AT&amp;T;" expands to "AT&T;" and the remaining ampersand
> is not recognized as an entity-reference delimiter.) A character
> reference is included when the indicated character is processed in place
> of the reference itself.
> </quote>
>
> If I understand correctly the specification contradicts itself when it
> says the replacement text is processed in place of the reference itself
> and markup MUST be recognized. Shouldn't the "&T;" in "AT&T;" then be
> actually BE recognized? I understand that if it actually were recognized
> then the character '&' could not be expressed in XML (nor '<' for that
> matter). The question is then, when should the markup in the replacement
> text be recognized and when it shouldn't?
>
> Thank you in advance for your reply.
>
> - Tuomas


hi,

read more here :
http://www.w3.org/TR/2004/REC-xml-20...predefined-ent
--
Cordialement,

///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !
 
Reply With Quote
 
 
 
 
Tuomas Rannikko
Guest
Posts: n/a
 
      05-30-2006
Philippe Poulard wrote:
> Tuomas Rannikko wrote:
>>
>> Hello,
>>
>> I'm currently writing a XML processor for the fun of it. There is
>> something I don't understand in the spec though. I'm obviously missing
>> something important.
>>
>> The spec states that both Internal General and Character references
>> are included when referenced in content. And "included" means:
>>
>> <quote>
>> 4.4.2 Included
>>
>> [Definition: An entity is included when its replacement text is
>> retrieved and processed, in place of the reference itself, as though
>> it were part of the document at the location the reference was
>> recognized.] The replacement text MAY contain both character data and
>> (except for parameter entities) markup, which MUST be recognized in
>> the usual way. (The string "AT&amp;T;" expands to "AT&T;" and the
>> remaining ampersand is not recognized as an entity-reference
>> delimiter.) A character reference is included when the indicated
>> character is processed in place of the reference itself.
>> </quote>
>>
>> If I understand correctly the specification contradicts itself when it
>> says the replacement text is processed in place of the reference
>> itself and markup MUST be recognized. Shouldn't the "&T;" in "AT&T;"
>> then be actually BE recognized? I understand that if it actually were
>> recognized then the character '&' could not be expressed in XML (nor
>> '<' for that matter). The question is then, when should the markup in
>> the replacement text be recognized and when it shouldn't?
>>
>> Thank you in advance for your reply.
>>
>> - Tuomas

>
> hi,
>
> read more here :
> http://www.w3.org/TR/2004/REC-xml-20...predefined-ent


Ah, yes.

But I still think the spec contradicts itself, or is at least somewhat
ambiguous on what the "Character" column means in the table in
http://www.w3.org/TR/2004/REC-xml-20040204/#entproc

I thought it meant character references:

Here is the definition for character reference
http://www.w3.org/TR/2004/REC-xml-20040204/#dt-charref
which is of course a numeric character reference.

And then, in the link you sent, it says character references are meant
to be considered character data, rather than being included as I thought
while looking at the table.

Actually, what does the Character column mean in the table?


- Tuomas


 
Reply With Quote
 
Philippe Poulard
Guest
Posts: n/a
 
      05-30-2006
Tuomas Rannikko wrote:
>
> Ah, yes.
>
> But I still think the spec contradicts itself,


the parser works like this :

"AT&amp;T;"
&amp; is an entity : let's replace it
"AT&#38;T;"
the spec said that we must process the replacement text
& is a character reference : let's replace it
"AT&T;"
the character has been replaced, but not yet processed
"AT&T;"
now, the character is said "included" : stop process it
& doesn't stand for an entity reference

--
Cordialement,

///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !
 
Reply With Quote
 
Richard Tobin
Guest
Posts: n/a
 
      05-30-2006
In article <(E-Mail Removed)>,
Tuomas Rannikko <(E-Mail Removed)> wrote:

>But I still think the spec contradicts itself, or is at least somewhat
>ambiguous on what the "Character" column means in the table in
>http://www.w3.org/TR/2004/REC-xml-20040204/#entproc
>
>I thought it meant character references:


It does.

>And then, in the link you sent, it says character references are meant
>to be considered character data, rather than being included as I thought
>while looking at the table.


I think the definition of "Included" in 4.4.2 is unclear; it says

A character reference is included when the indicated character is
processed in place of the reference itself.

and "processed" does not mean that it is reparsed as is the case when
the replacement text of an entity is "processed". It's just, well,
included. "Processed as character data" might be better I suppose.

-- Richard
 
Reply With Quote
 
Tuomas Rannikko
Guest
Posts: n/a
 
      05-30-2006
Philippe Poulard wrote:
> Tuomas Rannikko wrote:
>>
>> Ah, yes.
>>
>> But I still think the spec contradicts itself,

>
> the parser works like this :
>
> "AT&amp;T;"
> &amp; is an entity : let's replace it
> "AT&#38;T;"
> the spec said that we must process the replacement text
> & is a character reference : let's replace it
> "AT&T;"
> the character has been replaced, but not yet processed
> "AT&T;"
> now, the character is said "included" : stop process it
> & doesn't stand for an entity reference
>


Thanks for the answer, but this doesn't answer the question of what the
Character column means in the table.

I'm sorry for pushing on with this, but I can't get the meaning of the
column...

The spec says entities such as &amp; should be declared like this:

<!ENTITY amp "&#38;">

Once this declaration is read and the "&" is recognized and the
replacement text of &amp; therefore becomes "&", not "&#38;"

The process you put forward is then slightly simpler:

"AT&amp;T;" in content --> "AT&T" --> "AT&T;"

The problem is, however, determining when to stop re-parsing the data,
and the same applies to the actual entity declaration; once "&#38;"
is parsed to be "&" if the '&' is "included" (as I read from the
table) then is is recognized as markup and "&" becomes '&', which is
in turn recognized as markup...

How I see it, character references are indeed supposed to be expanded
and then considered character data, not markup. Then if character
references are NOT to be "included", rather expanded and then "bypassed"
why doesn't the spec say so?

I quote the same bit of the spec again:

<quote>
4.4.2 Included

[Definition: An entity is included when its replacement text is
retrieved and processed, in place of the reference itself, as though it
were part of the document at the location the reference was recognized.]
The replacement text MAY contain both character data and (except for
parameter entities) markup, which MUST be recognized in the usual way.
(The string "AT&amp;T;" expands to "AT&T;" and the remaining ampersand
is not recognized as an entity-reference delimiter.) A character
reference is included when the indicated character is processed in place
of the reference itself.
</quote>

If nothing else is wrong with the spec, then the word "processed" has
multiple meanings within the same paragraph. The character references
are not to be "processed" in the same way as entity references, because
markup in the entity references' replacement text MUST be recognized and
parsed, tags, references and all.

"A character reference is included when the indicated character is
processed in place of the reference itself"... Now if I process the
indicated character, then in the case of "&", it "indicates" the
character '&', which IS markup IF processed!?! The spec is in error when
stating that the "character is processed in place of the reference
itself." The character is expanded and then bypassed, not processed.

It is obvious the "included" rule, or the "processed" part of the rule,
does not apply to character references, otherwise escaping '&' and '<'
characters would be impossible.

The table still baffles me. The Character column either means something
else than character references (which is unlikely), the spec is in plain
error, or just too damn ambiguous for my "taste".

- Tuomas
 
Reply With Quote
 
Tuomas Rannikko
Guest
Posts: n/a
 
      05-30-2006
Richard Tobin wrote:
> I think the definition of "Included" in 4.4.2 is unclear; it says
>
> A character reference is included when the indicated character is
> processed in place of the reference itself.
>
> and "processed" does not mean that it is reparsed as is the case when
> the replacement text of an entity is "processed". It's just, well,
> included. "Processed as character data" might be better I suppose.
>


I agree. I put it in eh, a few, more words in my reply to Philippe.
Thanks for confirming I'm not missing the point. I started to get a bit
worried about my logic there

--

- Tuomas
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Entity, problem with entity key ThatsIT.net.au ASP .Net 1 09-07-2009 02:20 AM
Entity Framework - Reassigning child entity's parent Norm ASP .Net 3 07-06-2009 07:28 PM
How to relate a SQL based entity with an Object based entity in Entity Framework markla ASP .Net 1 10-06-2008 09:42 AM
Entity Name or Entity Number? Samuel van Laere HTML 4 02-24-2007 10:11 PM
Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser) Erik Wasser Perl Misc 5 03-05-2006 10:09 PM



Advertisments