Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Information on XML overhead analysis

Reply
Thread Tools

Information on XML overhead analysis

 
 
Rui Maciel
Guest
Posts: n/a
 
      03-01-2011
BGB wrote:

> I think it depends somewhat on the type of data.
>
> in my own binary XML format (SBXE), which is mostly used for compiler
> ASTs (for C and several other languages), I am often seeing an approx 6x
> to 9x size difference.
>
> most of the difference is likely that of eliminating redundant strings
> and tag names (SBXE handles both via MRU lists).
>
>
> grabbing a few samples (ASTs in both formats), and running them through
> gzip:
> textual XML compresses by around 29x;
> SBXE compresses by around 3.7x.
>
> the gzip'ed text XML is 1.1x (approx 10%) larger than the gzip'ed SBXE.
>
> so, purely from a sake of size (if GZIP can be reasonably used in a
> given context), binary XML is not really needed.
>
>
> the binary format is likely a little faster to decode again, and as
> typically used, I don't use deflate.
>
> it is mostly used within the same program, and also for stuffing XML
> data into a few other misc binary formats.
>
>
> however, it can be noted that most common uses of XML don't involve the
> corresponding use of deflate, so a format which is partly compressed by
> default will still save much over one which is not compressed at all.
>
> so, one would still likely need a "special" file format (lets' just call
> it ".xml.gz" or maybe ".xgz" for the moment...).


The problem with this concept is that if someone really needs a data-
interchange format which is lean and doesn't need to be human-readable
then that person is better off adopting (or even implementing) a format
which is lean and doesn't need to be human-readable. Once we start off by
picking up a human-readable format and then mangling it to make it leaner
then we simply abandon the single most important justification (and maybe
the only one) to adopt that specific format.

Adding to that, if we adopt a human-readable format and then we are forced
to implement some compression scheme so that we can use it in it's
intended purpose then we are needlessly complicating things, and even
adding yet another point of failure to our code. After all, if we are
forced to implement a compression scheme so that we can use our human-
readable format in it's then we are basically adopting two different
parsers to handle a single document format. That means we are forced to
adopt/implement two different parsers to parse the same data tice which
must be applied to the same data stream in succession, and we are forced
to do all that only to be able to encode/decode and use information.

Instead, if someone develops a binary format from the start and relies on
a single parser to encode and decode any data described through this
format then that person not only gets exactly what he needs but also ends
up with a lean format which requires a fraction of both resources and code
to be used.



Rui Maciel
 
Reply With Quote
 
 
 
 
BGB
Guest
Posts: n/a
 
      03-01-2011
On 3/1/2011 3:31 AM, Rui Maciel wrote:
> BGB wrote:
>
>> I think it depends somewhat on the type of data.
>>
>> in my own binary XML format (SBXE), which is mostly used for compiler
>> ASTs (for C and several other languages), I am often seeing an approx 6x
>> to 9x size difference.
>>
>> most of the difference is likely that of eliminating redundant strings
>> and tag names (SBXE handles both via MRU lists).
>>
>>
>> grabbing a few samples (ASTs in both formats), and running them through
>> gzip:
>> textual XML compresses by around 29x;
>> SBXE compresses by around 3.7x.
>>
>> the gzip'ed text XML is 1.1x (approx 10%) larger than the gzip'ed SBXE.
>>
>> so, purely from a sake of size (if GZIP can be reasonably used in a
>> given context), binary XML is not really needed.
>>
>>
>> the binary format is likely a little faster to decode again, and as
>> typically used, I don't use deflate.
>>
>> it is mostly used within the same program, and also for stuffing XML
>> data into a few other misc binary formats.
>>
>>
>> however, it can be noted that most common uses of XML don't involve the
>> corresponding use of deflate, so a format which is partly compressed by
>> default will still save much over one which is not compressed at all.
>>
>> so, one would still likely need a "special" file format (lets' just call
>> it ".xml.gz" or maybe ".xgz" for the moment...).

>
> The problem with this concept is that if someone really needs a data-
> interchange format which is lean and doesn't need to be human-readable
> then that person is better off adopting (or even implementing) a format
> which is lean and doesn't need to be human-readable. Once we start off by
> picking up a human-readable format and then mangling it to make it leaner
> then we simply abandon the single most important justification (and maybe
> the only one) to adopt that specific format.
>
> Adding to that, if we adopt a human-readable format and then we are forced
> to implement some compression scheme so that we can use it in it's
> intended purpose then we are needlessly complicating things, and even
> adding yet another point of failure to our code. After all, if we are
> forced to implement a compression scheme so that we can use our human-
> readable format in it's then we are basically adopting two different
> parsers to handle a single document format. That means we are forced to
> adopt/implement two different parsers to parse the same data tice which
> must be applied to the same data stream in succession, and we are forced
> to do all that only to be able to encode/decode and use information.
>
> Instead, if someone develops a binary format from the start and relies on
> a single parser to encode and decode any data described through this
> format then that person not only gets exactly what he needs but also ends
> up with a lean format which requires a fraction of both resources and code
> to be used.
>


well, for compiler ASTs, basically, one needs a tree-structured format,
and human readability is very helpful to debugging the thing (so one can
see more of what is going on inside the compiler).


now, there are many options here.
some compilers use raw structs;
some use S-Expressions;
....

my current compiler internally uses XML (mostly in the front-end),
mostly as it tends to be a reasonably flexible way to represent
tree-structured data (more flexible than S-Expressions).

however, yes, the current implementation does have some memory-footprint
issues, along with the data storage issues (using a DOM-like system eats
memory, and XML notation eats space).

a binary encoding can at least allow storing and decoding the trees more
quickly, and using a little less space, and more so, my SBXE decoder is
much simpler than a full XML parser (and is the defined format for
representing these ASTs).


however, in some ways, XML is overkill for compiler ASTs, and possibly a
few features could be eliminated (to reduce memory footprint, creating a
subset):
raw text globs and CDATA;
namespaces;
....

so, the subset would only support tags and attributes.
however, as of yet, I have not adopted such a restrictive subset (text
globs, CDATA, namespaces, ... continue to be supported even if not
really used by the compiler).

even a few extensions are supported, such as "BDATA" globs (basically,
for raw globs of binary data, although if printed textually, BDATA is
written out in hex). but, these are also not used for ASTs.

although, a compromise is possible:
the in-memory nodes could still eliminate raw text globs and CDATA, but
still support them by internally moving the text into an attribute and
using special tags (such as "!TEXT").


or such...
 
Reply With Quote
 
 
 
 
Peter Flynn
Guest
Posts: n/a
 
      03-01-2011
On 01/03/11 10:11, Rui Maciel wrote:
> Roberto Waltman wrote:
>
>> I personally find that markup/data overheads of several hundred
>> percent are difficult to justify.
>>
>> Somehow related, see "Why the Air Force needs binary XML"
>> http://www.mitre.org/news/events/xml...an_keynote.pdf

>
> At first glance, that presentation is yet another example how XML is
> inexplicably forced into inappropriate uses. The presentation basically
> states that the US air force needs to implement "seamless interoperability
> between the warfighting elements", which means adopting a protocol to
> handle communications, and then out of nowhere XML is presented as a
> given, without giving any justification why it is any good, let alone why
> it should be used. As that wasn't enough, then half of the presentation
> is spent suggesting ways to try to mitigate one of XML's many problems,
> which incidentally consists of simply eliminating XML's main (and single?)
> selling point: being a human-readable format.
>
> So, it appears it's yet another example of XML fever, where people
> involved in decision-making are attracted to a technology due to marketing
> buzzwords instead of their technological merits.


[followups reset to c.t.x]

Which is why we don't hear a lot about it now. The interoperability
features of XML (plain text, robust structure, common syntax, etc) are
ideal for open interop between multiple disparate systems, which is why
it works so well for applications like TEI. In the case of milcom, they
have the capacity to ensure identicality between stations, not
disparity, and they also have absolute control over all other stages of
messaging (capture, formation, passage, reception, and consumption), so
the argument for openness and disparity falls.

There is a well-intentioned tendency for milstd systems to be heavily
over-engineered. While redundancy, error-correction, encryption, and
other protective techniques are essential to message survival and
reconstruction in a low-bandwidth environment, XML precisely does *not*
address these aspects _per se_. Adding these to the design (at the
schema stage) adds significantly to the markup overhead, which is
typically already swelled by "design features" like unnecessarily long
names.

I see zero merit in using XML for realtime secure battle-condition
military messaging. Perhaps some potential enemies do.

///Peter
--
XML FAQ: http://xml.silmaril.ie/
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Evaluating static analysis and Dynamic analysis tools for C/C++ ssubbarayan C Programming 5 11-03-2009 12:50 AM
Dynamic analysis tools information vipindeep C Programming 1 10-26-2004 02:22 AM
Dynamic analysis tools information vipindeep C++ 1 10-25-2004 10:36 AM
Dynamic analysis tools information vipindeep Java 0 10-25-2004 09:36 AM
REVIEW: "Information Security Risk Analysis", Thomas R. Peltier Rob Slade, doting grandpa of Ryan and Trevor Computer Security 0 06-21-2004 05:55 PM



Advertisments