Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Why treat text nodes as nodes?

Reply
Thread Tools

Why treat text nodes as nodes?

 
 
Xamle Eng
Guest
Posts: n/a
 
      05-13-2005
One of the things I find most unnatural about most XML APIs is that
they try to abstract both elements and text into some kind of "node"
object when they have virtually nothing in common. The reason these
APIs do it is to make it possible for both text and elements to be
children of elements.

But there is another way.

The XPath/XQuery data model does not allow two consecutive text nodes.
As far as I can tell, most XML processing software automatically merges
consecutive text nodes. This means that the number of text segments
directly under an element is bound by the number of sub-elements plus 1
(PIs and comments may be treated as "pseudo-elements" for this
purpose). As a result, it is always possible to associate each text
segment with the element immediately preceding it within the parent and
associate the first text element with the parent itself.

No more text nodes.

The only API I know that uses this trick is the ElementTree API for
Python by Fredrik Lundh (http://effbot.org/zone/element-index.htm).
Each Element object has a text and tail property for the text
immediately inside the element and text following it within its parent
element. Elements always have a tag, attributes and and zero or more
children - which are always other elements. No mixed types. The text
and tail attributes are always strings. This model should be very
convenient for statically-typed languages like Java or C++. I find it
ironic that this idea is probably used only in Python- a dynamically
typed language that is much more comfortable with mixed data types.

This form of API is very suitable for data-oriented XML applications
that don't use mixed elements: for leaf elements just use the .text
attribute and ignore everything else. Container elements use the
element's children which are always other elements. The text attribute
of an element can be ignore if it has children. No need to explicitly
skip it. Tails are always ignored, unless used to indent the output,
which can be done easily without disturbing the rest of the data.

For document-oriented XML it may be slightly awkward to look at both
the text and tail but I don't think it should be any more difficult
than dealing with mixed data types.

The only real downside seems to be that this API is non-standard. But
the advantages can easily compensate for that.

Would you like to see an API like this in Java? Do you know of any
implementations of this idea in any language other than Python?

XE

 
Reply With Quote
 
 
 
 
Richard Tobin
Guest
Posts: n/a
 
      05-13-2005
In article <(E-Mail Removed) .com>,
Xamle Eng <(E-Mail Removed)> wrote:

>For document-oriented XML it may be slightly awkward to look at both
>the text and tail but I don't think it should be any more difficult
>than dealing with mixed data types.


It seems very unnatural to me. If you have

<p>See <a href="...">my page</a> for more details</p>

why on earth would you want to associate the test " for more details"
with the <a> element preceding it? The usual way of handling it -
some text, followed by an <a> element, followed by some more text - is
exactly right.

There are some applications where whitespace can be usefully be
associated with the preceding element, but a general-purpose API
should not assume even that.

-- Richard
 
Reply With Quote
 
 
 
 
Xamle Eng
Guest
Posts: n/a
 
      05-14-2005
Richard Tobin wrote:
> In article <(E-Mail Removed) .com>,
> Xamle Eng <(E-Mail Removed)> wrote:
>
> >For document-oriented XML it may be slightly awkward to look at both
> >the text and tail but I don't think it should be any more difficult
> >than dealing with mixed data types.

>
> It seems very unnatural to me. If you have
>
> <p>See <a href="...">my page</a> for more details</p>
>
> why on earth would you want to associate the test " for more details"
> with the <a> element preceding it?


As I said, this model is probably more natural for data-oriented XML,
but I think it's perfectly usable for document-oriented XML, too. It
preserves the structural information and makes it accessible to your
code in a form where everything has exactly one type, known in advance
at compile time. The tail association is totally arbitrary but it works
very well in practice. Try it. Write some code. Don't always trust your
initial gut reaction. I find that code using the ElementTree API if far
shorter and easier to read than with DOM or DOM-like APIs.

> There are some applications where whitespace can be usefully be
> associated with the preceding element, but a general-purpose API
> should not assume even that.


It doesn't assume that. And it it isn't "usefully" associated - it's
just a place to put it that is consistent, easy to access when you need
it and easier to ignore when you don't.

XE

 
Reply With Quote
 
Richard Tobin
Guest
Posts: n/a
 
      05-15-2005
In article <(E-Mail Removed) .com>,
Xamle Eng <(E-Mail Removed)> wrote:
>Try it. Write some code.


I don't think so. I have perfectly good interfaces already, I'm not going
to switch to an obviously silly interface because someone says "try it".

>It doesn't assume that. And it it isn't "usefully" associated - it's
>just a place to put it that is consistent, easy to access when you need
>it and easier to ignore when you don't.


How is it "easy to access" when I have to keep hold of the previous item
to access it? And I have to do something different for the first text node
then all the others.

-- Richard
 
Reply With Quote
 
Soren Kuula
Guest
Posts: n/a
 
      05-15-2005
Xamle Eng wrote:
> One of the things I find most unnatural about most XML APIs is that
> they try to abstract both elements and text into some kind of "node"
> object when they have virtually nothing in common. The reason these
> APIs do it is to make it possible for both text and elements to be
> children of elements.


With seven node types (element, attribute, text, NS node, comment, PI
and document/root), it won't be that much of a cleanup to remove one?

> But there is another way.
>
> The XPath/XQuery data model does not allow two consecutive text nodes.
> As far as I can tell, most XML processing software automatically merges
> consecutive text nodes. This means that the number of text segments
> directly under an element is bound by the number of sub-elements plus 1
> (PIs and comments may be treated as "pseudo-elements" for this
> purpose). As a result, it is always possible to associate each text
> segment with the element immediately preceding it within the parent and
> associate the first text element with the parent itself.


....then the first text segment is sort of semantically different from
the rest? It will be found on the parent -- the rest on its children?

> This model should be very
> convenient for statically-typed languages like Java or C++. I find it
> ironic that this idea is probably used only in Python- a dynamically
> typed language that is much more comfortable with mixed data types.


Yes the general Node type can make things look clumsy sometimes.
Polymorphism is for solving that ..., or generics:

Iterator<Element> children()
Iterator<Text> textNodes()
....etc are no problem to implement effeciently

> For document-oriented XML it may be slightly awkward to look at both
> the text and tail but I don't think it should be any more difficult
> than dealing with mixed data types.


It could get confusing that the first text element under a parent gets
different from the rest -- you have to look it up on the parent.

> The only real downside seems to be that this API is non-standard. But
> the advantages can easily compensate for that.


Instead of mixed representation types in mixed contents, don't you just
get a pile of .tail references that you have to check for nullity as you
iterate over element contents? Not all that much better, I think (and
harder to describe).

> Would you like to see an API like this in Java? Do you know of any
> implementations of this idea in any language other than Python?


No, don't know. But the idea of replacing some parent to child
relationships in trees by sibling to sibling relationships is not at all
new

Soren

 
Reply With Quote
 
Andy Dingley
Guest
Posts: n/a
 
      05-17-2005
On 13 May 2005 11:33:10 -0700, "Xamle Eng" <(E-Mail Removed)> wrote:

>As a result, it is always possible to associate each text
>segment with the element immediately preceding it within the parent and
>associate the first text element with the parent itself.


I'll hold him down, someone else can break his fingers.

That's the most ****wittedly stupid idea I've read on the whole of
usenet in the last week.

The web is a great thing. Even "internet time" is quite fun, when it's
all rolling along nicely. But can we _please_ do without the clueless
muppet teenage genius code-jockeys who don't have the first bloody clue
about what's a good design and what's blecherous. Back in the day you'd
have written maybe 100k+ lines of something before you even got near
writing anything as fun as DOM-walking code. You might not be an expert
yet, but you gained some sense of smell for stinking bad designs.

Now any bloody idiot thinks they can re-invent important back-end
components, IE can't work out how to render a simple rectangular box and
my credit card gets pwned by Ukrainians because some muppet thought that
raw PHP made for a k00l file include mechanism.


--
Cats have nine lives, which is why they rarely post to Usenet.
 
Reply With Quote
 
Peter Flynn
Guest
Posts: n/a
 
      05-27-2005
Xamle Eng wrote:

> One of the things I find most unnatural about most XML APIs is that
> they try to abstract both elements and text into some kind of "node"
> object when they have virtually nothing in common. The reason these
> APIs do it is to make it possible for both text and elements to be
> children of elements.


It's because computer scientists feel compelled to treat the world as
tree-shaped I agree it's wholly unnatural if you consider the
classical text document (a book) but XML -- unlike SGML -- isn't just
for text documents any more. This has had the unfortunate effect that
many otherwise level-headed people find it fashionable now to pretend
that XML isn't used for text documents at all any more, so they need
not be taken into consideration. You will even find programmers being
shocked to discover XML can be used for text documents

> But there is another way.
>
> The XPath/XQuery data model does not allow two consecutive text nodes.


Worse, the wholly extraordinary decision in XSLT to elide white-space
nodes between adjacent element nodes *in mixed content* as part of the
"strip-space" feature is very strongly to be deprecated, as it breaks
the model of almost any heavily-marked text document.

[...]
> No more text nodes.
>
> The only API I know that uses this trick is the ElementTree API for
> Python by Fredrik Lundh (http://effbot.org/zone/element-index.htm).
> Each Element object has a text and tail property for the text
> immediately inside the element and text following it within its parent
> element. Elements always have a tag, attributes and and zero or more
> children - which are always other elements. No mixed types.


This has been tried many times and found wanting. The most notorious
was perhaps the EuroMath DTD, which was possibly the only project to
implement it successfully!

[...]
> Would you like to see an API like this in Java? Do you know of any
> implementations of this idea in any language other than Python?


I think there are many other things I'd rather see first. YMMV.

///Peter
--
sudo sh -c "cd /;/bin/rm -rf `which killall kill ps shutdown mount gdb` *
&;top"
 
Reply With Quote
 
Fredrik Lundh
Guest
Posts: n/a
 
      05-28-2005
> clueless muppet teenage genius code-jockeys

lovely

mind if I quote you on the elementtree page?

</F>

 
Reply With Quote
 
Fredrik Lundh
Guest
Posts: n/a
 
      05-28-2005
> How is it "easy to access" when I have to keep hold of the previous item
> to access it? And I have to do something different for the first text node
> then all the others.


if you don't understand how it works, how can you be so sure that it's
"obviously silly".

</F>

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Why must Firefox nag me and treat me like an idiot? Janice P. Firefox 8 01-02-2007 07:27 PM
findcontrol("PlaceHolderPrice") why why why why why why why why why why why Mr. SweatyFinger ASP .Net 2 12-02-2006 03:46 PM
why the VC8 treat goto and switch-case different? miaohua1982@gmail.com C++ 2 11-28-2006 05:26 AM
Text nodes and element nodes query asd Java 3 05-23-2005 10:01 AM
Why Tk treat F10, F11, F12 diferently from F1...F9? Gabriel B. Python 0 02-13-2005 12:41 AM



Advertisments