Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   C Programming (http://www.velocityreviews.com/forums/f42-c-programming.html)
-   -   Vanilla XML parser (http://www.velocityreviews.com/forums/t951304-vanilla-xml-parser.html)

Malcolm McLean 08-23-2012 08:45 PM

Vanilla XML parser
 
As part of the binary image processing library work I had to load some XML
files. There doesn't seem to be a lightweight XML parser available on the web.
Plenty of bloated ones that require full-fledged installs. But nothing you
can just grab and compile.

So I decided to write a vanilla one myself. It did the job, and loaded my
data files. But it only weighs in as a single average-length source file.
That's partly because it only does ascii, doesn't handle defined entities
or special tags, and so on.

But is there the potential for this to be developed into a lightweight, single
file parser? Ther's also a question for Jacob here. The structure is simply
a tree. How would the container library map on to XML?

--
Vanilla XML Parser
http://www.malcolmmclean.site11.com/www

Les Cargill 08-23-2012 11:00 PM

Re: Vanilla XML parser
 
Malcolm McLean wrote:
> As part of the binary image processing library work I had to load some XML
> files. There doesn't seem to be a lightweight XML parser available on the web.
> Plenty of bloated ones that require full-fledged installs. But nothing you
> can just grab and compile.
>


If expat doesn't cut it, try ezxml.

http://ezxml.sourceforge.net/

> So I decided to write a vanilla one myself. It did the job, and loaded my
> data files. But it only weighs in as a single average-length source file.
> That's partly because it only does ascii, doesn't handle defined entities
> or special tags, and so on.
>
> But is there the potential for this to be developed into a lightweight, single
> file parser? Ther's also a question for Jacob here. The structure is simply
> a tree. How would the container library map on to XML?
>



--
Les Cargill

BGB 08-24-2012 04:33 PM

Re: Vanilla XML parser
 
On 8/23/2012 3:45 PM, Malcolm McLean wrote:
> As part of the binary image processing library work I had to load some XML
> files. There doesn't seem to be a lightweight XML parser available on the web.
> Plenty of bloated ones that require full-fledged installs. But nothing you
> can just grab and compile.
>
> So I decided to write a vanilla one myself. It did the job, and loaded my
> data files. But it only weighs in as a single average-length source file.
> That's partly because it only does ascii, doesn't handle defined entities
> or special tags, and so on.
>
> But is there the potential for this to be developed into a lightweight, single
> file parser? Ther's also a question for Jacob here. The structure is simply
> a tree. How would the container library map on to XML?
>


I did similar as well.

wrote a simple lightweight parser/printer and basic tree-manipulation
code (partly similar to DOM).


IIRC, I initially wrote it to support XML-RPC.
as-such, it uses a similar subset to that used by both XML-RPC and XMPP
(although it does support namespaces).


later it was used as the AST format for my first BGBScript VM
interpreter (later versions used S-Expression ASTs). (actually, the
first interpreter directly walked/interpreted these ASTs, but was soon
changed to "word-code", and later interpreters switched to bytecode with
a variable-length coding for many values, and more recently use threaded
code rather than directly interpreting the bytecode).

it was later utilized as the core of my C compiler project, where
basically XML trees were used as the main AST structure, and the API was
tweaked some to be better suited to compiler-related tasks.
(of course, the C compiler wasn't very good and subsequently "decayed"
mostly into a code-processing and metadata mining tool). sadly, I have
been unable to really justify the effort that would required to "revive"
it as a full C compiler (probably using bytecode which would run in a
VM, and most likely executed as threaded-code).


or such...


Rui Maciel 08-26-2012 09:55 AM

Re: Vanilla XML parser
 
Malcolm McLean wrote:

> So I decided to write a vanilla one myself. It did the job, and loaded my
> data files. But it only weighs in as a single average-length source file.
> That's partly because it only does ascii, doesn't handle defined entities
> or special tags, and so on.


If the parser fails to parse valid XML then it isn't exactly a XML parser.
This isn't necessarily good or bad, much less a problem. Nevertheless,
there is a reason why XML parsers tend not to be tiny.


> But is there the potential for this to be developed into a lightweight,
> single file parser? Ther's also a question for Jacob here.


I suspect that the question you need to answer first is the following: do
you really need XML to begin with? In other words, isn't there any other
data format that fits your needs, is easier to parse and you are able to
adopt? JSON springs to mind, for example.

Following that, do you really need a parser that supports an entire generic
format in its full glory, or do you only need to parse a language which is a
subset of that format? In your post you mentioned that you developed your
parser as part of an image processing library. This leads to suspect that
you might not really need to support every single feature of XML, or any
other generic data format. That being the case then your job is made a bit
simpler: you would only need to specify your data format and write a parser
for it. As a consequence, your parser will be significantly lighter and
more efficient.


Rui Maciel

Malcolm McLean 08-26-2012 05:49 PM

Re: Vanilla XML parser
 
בתאריך יום ראשון, 26 באוגוסט 2012 10:55:10 UTC+1, מאת Rui Maciel:
> Malcolm McLean wrote:
>
>
> > So I decided to write a vanilla one myself. It did the job, and loaded m
> > data files. But it only weighs in as a single average-length source file.
> > That's partly because it only does ascii, doesn't handle defined entities
> > or special tags, and so on.

>
> If the parser fails to parse valid XML then it isn't exactly a XML parser..
> This isn't necessarily good or bad, much less a problem. Nevertheless,
> there is a reason why XML parsers tend not to be tiny.
>

The data has to be in XML format, to interchange with other programs.
But it's very simple - a few optional text fields, a few compulsory text
fields, width and height and an M x N variable list of cells. Then you
can have a list of any number of images in the file.
But it seemed a generic parser was the way to go, not to hardcode the fields
in the low level code. But I didn't want to throw a 5 MB executable at it.
But it seems to me that the majority of XML files are like this - you've
got tags, attributes, and text in your leaf tags. Recursively defined
"entities" and CDATA elements and all the other niggles are rare.
>




BGB 08-28-2012 03:52 AM

Re: Vanilla XML parser
 
On 8/26/2012 12:49 PM, Malcolm McLean wrote:
> בתאריך יום ראשון, 26 באוגוסט 2012 10:55:10 UTC+1, מאת Rui Maciel:
>> Malcolm McLean wrote:
>>
>>
>>> So I decided to write a vanilla one myself. It did the job, and loaded m
>>> data files. But it only weighs in as a single average-length source file.
>>> That's partly because it only does ascii, doesn't handle defined entities
>>> or special tags, and so on.

>>
>> If the parser fails to parse valid XML then it isn't exactly a XML parser.
>> This isn't necessarily good or bad, much less a problem. Nevertheless,
>> there is a reason why XML parsers tend not to be tiny.
>>

> The data has to be in XML format, to interchange with other programs.
> But it's very simple - a few optional text fields, a few compulsory text
> fields, width and height and an M x N variable list of cells. Then you
> can have a list of any number of images in the file.
> But it seemed a generic parser was the way to go, not to hardcode the fields
> in the low level code. But I didn't want to throw a 5 MB executable at it.
> But it seems to me that the majority of XML files are like this - you've
> got tags, attributes, and text in your leaf tags. Recursively defined
> "entities" and CDATA elements and all the other niggles are rare.


yeah.

if the parser can parse the basic tag syntax (and, maybe, namespace
syntax, and maybe CDATA), and the "?xml" and "!DOCTYPE" tags, then this
is pretty much the entirety of XML that most programs need to support
for most documents.

in many cases, given "?xml" and "!DOCTYPE" are mostly just formalities
anyways, many documents omit them (either not identifying the document
type at all, or identifying it via a namespace).


so, a lot depends...


Malcolm McLean 08-28-2012 11:13 AM

Re: Vanilla XML parser
 
בתאריך יום שלישי, 28 באוגוסט 2012 04:55:13 UTC+1, מאת BGB:
> On 8/26/2012 12:49 PM, Malcolm McLean wrote:
>
> if the parser can parse the basic tag syntax (and, maybe, namespace
> syntax, and maybe CDATA), and the "?xml" and "!DOCTYPE" tags, then this
> is pretty much the entirety of XML that most programs need to support
> for most documents.
>

That was my thinking. Allowing recursive defintion of "entities" complicates
things considerably. Maybe it should have a patch to support CDATA.
>
> in many cases, given "?xml" and "!DOCTYPE" are mostly just formalities
> anyways, many documents omit them (either not identifying the document
> type at all, or identifying it via a namespace).
>

It's always an issues, what to do with badly formatted input. The idea behind
the XML spec is that you can open the file in binary, then work out whetherit
is ascii, big-endian unicode or little-endian unicode, by examining the first
few bytes. But I'm not currently supporting unicode, and the second file I
had to parse didn't have the ?xml tag.

--
Check out the vanilla XML parser
http://www.malcolmmclean.site11.com/www

BGB 08-28-2012 05:12 PM

Re: Vanilla XML parser
 
On 8/28/2012 6:13 AM, Malcolm McLean wrote:
> בתאריך יום שלישי, 28 באוגוסט 2012 04:55:13 UTC+1, מאת BGB:
>> On 8/26/2012 12:49 PM, Malcolm McLean wrote:
>>
>> if the parser can parse the basic tag syntax (and, maybe, namespace
>> syntax, and maybe CDATA), and the "?xml" and "!DOCTYPE" tags, then this
>> is pretty much the entirety of XML that most programs need to support
>> for most documents.
>>

> That was my thinking. Allowing recursive defintion of "entities" complicates
> things considerably. Maybe it should have a patch to support CDATA.


my parser ignores user-defined entities (all others are hard-coded), and
basically hard-codes CDATA.


>>
>> in many cases, given "?xml" and "!DOCTYPE" are mostly just formalities
>> anyways, many documents omit them (either not identifying the document
>> type at all, or identifying it via a namespace).
>>

> It's always an issues, what to do with badly formatted input. The idea behind
> the XML spec is that you can open the file in binary, then work out whether it
> is ascii, big-endian unicode or little-endian unicode, by examining the first
> few bytes. But I'm not currently supporting unicode, and the second file I
> had to parse didn't have the ?xml tag.
>


well, as noted: many files omit them.


my code generally assumes UTF-8 unless stated otherwise.

it is possible to detect the BOM in the case of Unicode, and this much
may be required for UTF-16 files.

so, text loading could look like:
BOM detected? read as UTF-16 or UTF-32 (maybe just repack as UTF-8);
looks like valid UTF-8? parse as UTF-8;
otherwise? guess (probably ASCII + codepages).

my code largely ignores the existence of codepages, and even if I did
use them it is not clear I would go much beyond "Extended ASCII" / CP437
and/or CP1252 anyways (I was once tempted by CP437 for sake of
more-readily-addressable box-drawing characters, but ended up opting
with plain ASCII characters instead). these would just follow the CP ->
UTF-8 route anyways.

although the BOM is not strictly required for UTF-16 or 32, it is
usually present (text editors tend to emit it and often depend on its
presence).


in the situations I use my stuff for, it would be fairly unlikely to
encounter anything outside of ASCII range, and even then, something not
UTF-8 encoded.

the text editors I have also only really give a few options for saving:
ASCII, UTF-8, and UTF-16 (LE or BE).

another supports saving using codepages, but not readily (it involves a
sub-menu and going through a dialog box to enable these options for
"Save As"), with ASCII, UTF-8, and UTF-16 as the only "readily
available" options.

yeah, I think there is a pattern here...


jennywilkinson96@yahoo.co.uk 09-17-2012 02:48 PM

Re: Vanilla XML parser
 
On Thursday, August 23, 2012 9:45:16 PM UTC+1, Malcolm McLean wrote:
> As part of the binary image processing library work I had to load some XML
>
> files. There doesn't seem to be a lightweight XML parser available on the web.
>
> Plenty of bloated ones that require full-fledged installs. But nothing you
>
> can just grab and compile.
>
>
>
> So I decided to write a vanilla one myself. It did the job, and loaded my
>
> data files. But it only weighs in as a single average-length source file.
>
> That's partly because it only does ascii, doesn't handle defined entities
>
> or special tags, and so on.
>
>
>
> But is there the potential for this to be developed into a lightweight, single
>
> file parser? Ther's also a question for Jacob here. The structure is simply
>
> a tree. How would the container library map on to XML?
>
>
>
> --
>
> Vanilla XML Parser
>
> http://www.malcolmmclean.site11.com/www


I thought notepad++ was pretty bland and basic, have used Liquid Studio in comparison and that deliberately is not vanilla, http://www.liquid-technologies.com/xml-editor.aspx

John Bode 09-19-2012 07:07 PM

Re: Vanilla XML parser
 
On Thursday, August 23, 2012 3:45:16 PM UTC-5, Malcolm McLean wrote:
> As part of the binary image processing library work I had to load some XML
> files. There doesn't seem to be a lightweight XML parser available on the web.
> Plenty of bloated ones that require full-fledged installs. But nothing you
> can just grab and compile.
>
> So I decided to write a vanilla one myself. It did the job, and loaded my
> data files. But it only weighs in as a single average-length source file.
> That's partly because it only does ascii, doesn't handle defined entities
> or special tags, and so on.
>
> But is there the potential for this to be developed into a lightweight, single
> file parser? Ther's also a question for Jacob here. The structure is simply
> a tree. How would the container library map on to XML?
>


I've wrote my own XML parser for a project some years ago. It even
worked...mostly...after a couple of iterations.

If I had it to do over again I'd just go with expat and be done with it.
I'll take a little code bloat if it saves me some headaches in the end.


All times are GMT. The time now is 09:39 PM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57