Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > parsing to XML

Reply
Thread Tools

parsing to XML

 
 
steeve_dun@SoftHome.net
Guest
Posts: n/a
 
      10-06-2005
Hi everybody,
I have a document that includes definitions.
What I want is parsing the document and saving these definitions in a
xml document.
Is there a simple way to do so?
Thank you!

Example:
#### beginning of ducument ####
\glossary{HTML} {HyperText Markup Language} is the lingua franca for
publishing hypertext on the \glossary {WWW}{World Wide Web}
#### end of ducument ####

what I want is something like:

#### beginning of xml output ####
<glossary>
<definition>
<word> HTML </word>
<meaning> HyperText Markup Language </meaning>
</definition>
<definition>
<word> WWW </word>
<meaning> World Wide Web </meaning>
</definition>
</glossary>
#### end of xml output ####


---steeve

 
Reply With Quote
 
 
 
 
Anno Siegel
Guest
Posts: n/a
 
      10-06-2005
<(E-Mail Removed)> wrote in comp.lang.perl.misc:
> Hi everybody,
> I have a document that includes definitions.
> What I want is parsing the document and saving these definitions in a
> xml document.
> Is there a simple way to do so?
> Thank you!
>
> Example:
> #### beginning of ducument ####
> \glossary{HTML} {HyperText Markup Language} is the lingua franca for
> publishing hypertext on the \glossary {WWW}{World Wide Web}
> #### end of ducument ####


Your example doesn't show the variability of the data. Examples never
do, they only ever give a lower bound. There can always be a variant
that doesn't happen to appear in the example.

Can a "definition" span lines? Assuming that it can, you can't process
the text line-wise without major trickery. You'll need all of it in
memory . Here is a method that extracts the definitions from the text
and puts them in a hash:

my $text = <<'END_TEXT';
\\glossary{HTML} {HyperText Markup Language} is the lingua franca for
publishing hypertext on the \\glossary {WWW}{World Wide Web}
END_TEXT

my %definition_for = $text =~ /\\glossary\s*{([^}]*)}\s*{([^}]*)}/g;

Generating XML from the hash is probably a job for one of the XML modules.

Anno
--
If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.
 
Reply With Quote
 
 
 
 
steeve_dun@SoftHome.net
Guest
Posts: n/a
 
      10-07-2005
Thank you very much, that was very helpful !


Anno Siegel wrote:

> <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> > Hi everybody,
> > I have a document that includes definitions.
> > What I want is parsing the document and saving these definitions in a
> > xml document.
> > Is there a simple way to do so?
> > Thank you!
> >
> > Example:
> > #### beginning of ducument ####
> > \glossary{HTML} {HyperText Markup Language} is the lingua franca for
> > publishing hypertext on the \glossary {WWW}{World Wide Web}
> > #### end of ducument ####

>
> Your example doesn't show the variability of the data. Examples never
> do, they only ever give a lower bound. There can always be a variant
> that doesn't happen to appear in the example.
>
> Can a "definition" span lines? Assuming that it can, you can't process
> the text line-wise without major trickery. You'll need all of it in
> memory . Here is a method that extracts the definitions from the text
> and puts them in a hash:
>
> my $text = <<'END_TEXT';
> \\glossary{HTML} {HyperText Markup Language} is the lingua franca for
> publishing hypertext on the \\glossary {WWW}{World Wide Web}
> END_TEXT
>
> my %definition_for = $text =~ /\\glossary\s*{([^}]*)}\s*{([^}]*)}/g;
>
> Generating XML from the hash is probably a job for one of the XML modules.
>
> Anno
> --
> If you want to post a followup via groups.google.com, don't use
> the broken "Reply" link at the bottom of the article. Click on
> "show options" at the top of the article, then click on the
> "Reply" at the bottom of the article headers.


 
Reply With Quote
 
robic0@yahoo.com
Guest
Posts: n/a
 
      10-09-2005
On 6 Oct 2005 02:42:04 -0700, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

>Hi everybody,
>I have a document that includes definitions.
>What I want is parsing the document and saving these definitions in a
>xml document.
>Is there a simple way to do so?
>Thank you!
>
>Example:
>#### beginning of ducument ####
>\glossary{HTML} {HyperText Markup Language} is the lingua franca for
>publishing hypertext on the \glossary {WWW}{World Wide Web}
>#### end of ducument ####
>
>what I want is something like:
>
>#### beginning of xml output ####
><glossary>
><definition>
> <word> HTML </word>
> <meaning> HyperText Markup Language </meaning>
></definition>
><definition>
> <word> WWW </word>
> <meaning> World Wide Web </meaning>
></definition>
></glossary>
>#### end of xml output ####
>
>
>---steeve

I don't understand.
If you have a "structure" in mind, you don't show it.
XML is purely "structure" driven...
input "to" a structure, output "from" a structure.
Thats the definition of "simple" xml. If you want to
get into "complex" (nested) xml (and I don't know anybody that
does) that is beyond the scope of a question here, it seems.

So if you have simple xml "structure" in mind (with attributes)
then what would that be? You have to separate populating that
structure with the generation of "simple" xml output.
Schema can then be generated once you know what you want to do.
The largest software houses, including M$**** use simple xml
because the xml is just a medium to transport structured data.
There should be no ambiguity that "nested" html could have.
You don't want to go down that road.

Also, I don't know what you mean by "definitions" in that (html?)
document. Just what is it your trying to accomplish?


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
What libraries should I use for MIME parsing, XML parsing, and MySQL ? John Levine Ruby 0 02-02-2012 11:15 PM
Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser) Erik Wasser Perl Misc 5 03-05-2006 10:09 PM
Sequential XML parsing with xml.sax peter@hardy.dropbear.id.au Python 2 08-24-2005 01:29 AM
Clarification on XML parsing & namespaces (xml.dom.minidom) Greg Wogan-Browne Python 1 01-28-2005 03:19 AM
Print XML parsing to JspWriter (out) Class org.xml.sax.helpers.NewInstance can not access a member of class javax.xml.parsers.SAXParser with modifiers "protected" Per Magnus L?vold Java 0 11-15-2004 02:27 PM



Advertisments