Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Help parsing a text file

Reply
Thread Tools

Help parsing a text file

 
 
William Gill
Guest
Posts: n/a
 
      08-29-2011
I haven't done much with Python for a couple years, bouncing around
between other languages and scripts as needs suggest, so I have some
minor difficulty keeping Python functionality Python functionality in my
head, but I can overcome that as the cobwebs clear. Though I do seem to
keep tripping over the same Py2 -> Py3 syntax changes (old habits die hard).

I have a text file with XML like records that I need to parse. By XML
like I mean records have proper opening and closing tags. but fields
don't have closing tags (they rely on line ends). Not all fields appear
in all records, but they do adhere to a defined sequence.

My initial passes into Python have been very unfocused (a scatter gun of
too many possible directions, yielding very messy results), so I'm
asking for some suggestions, or algorithms (possibly even examples)that
may help me focus.

I'm not asking anyone to write my code, just to nudge me toward a more
disciplined approach to a common task, and I promise to put in the
effort to understand the underlying fundamentals.
 
Reply With Quote
 
 
 
 
Philip Semanchuk
Guest
Posts: n/a
 
      08-29-2011

On Aug 29, 2011, at 2:21 PM, William Gill wrote:

> I haven't done much with Python for a couple years, bouncing around between other languages and scripts as needs suggest, so I have some minor difficulty keeping Python functionality Python functionality in my head, but I can overcome that as the cobwebs clear. Though I do seem to keep tripping over the same Py2 -> Py3 syntax changes (old habits die hard).
>
> I have a text file with XML like records that I need to parse. By XML like I mean records have proper opening and closing tags. but fields don't have closing tags (they rely on line ends). Not all fields appear in all records, but they do adhere to a defined sequence.
>
> My initial passes into Python have been very unfocused (a scatter gun of too many possible directions, yielding very messy results), so I'm asking for some suggestions, or algorithms (possibly even examples)that may help me focus.
>
> I'm not asking anyone to write my code, just to nudge me toward a more disciplined approach to a common task, and I promise to put in the effort to understand the underlying fundamentals.


If the syntax really is close to XML, would it be all that difficult to convert it to proper XML? Then you have nice libraries like ElementTree to use for parsing.


Cheers
Philip
 
Reply With Quote
 
 
 
 
William Gill
Guest
Posts: n/a
 
      08-29-2011
On 8/29/2011 2:31 PM, Philip Semanchuk wrote:
>
> If the syntax really is close to XML, would it be all that difficult to convert it to proper XML? Then you have nice libraries like ElementTree to use for parsing.
>


Possibly, but I would still need the same search algorithms to find the
opening tag for the field, then find and replace the next line end with
a matching closing tag. So it seems to me that the starting point is
the same, and then it's my choice to either process the substrings
myself or employ something like ElementTree.
 
Reply With Quote
 
Thomas Jollans
Guest
Posts: n/a
 
      08-29-2011
On 29/08/11 20:21, William Gill wrote:
> I haven't done much with Python for a couple years, bouncing around
> between other languages and scripts as needs suggest, so I have some
> minor difficulty keeping Python functionality Python functionality in my
> head, but I can overcome that as the cobwebs clear. Though I do seem to
> keep tripping over the same Py2 -> Py3 syntax changes (old habits die
> hard).
>
> I have a text file with XML like records that I need to parse. By XML
> like I mean records have proper opening and closing tags. but fields
> don't have closing tags (they rely on line ends). Not all fields appear
> in all records, but they do adhere to a defined sequence.
>
> My initial passes into Python have been very unfocused (a scatter gun of
> too many possible directions, yielding very messy results), so I'm
> asking for some suggestions, or algorithms (possibly even examples)that
> may help me focus.
>
> I'm not asking anyone to write my code, just to nudge me toward a more
> disciplined approach to a common task, and I promise to put in the
> effort to understand the underlying fundamentals.


A name that is often thrown around on this list for this kind of
question is pyparsing. Now, I don't know anything about it myself, but
it may be worth looking into.

Otherwise, if you say it's similar to XML, you might want to take a cue
from XML processing when it comes to dealing with the file. You could
emulate the stream-based approach taken by SAX or eXpat - have methods
that handle the different events that can occur - for XML this is "start
tag", "end tag", "text node", "processing instruction", etc., in your
case, it might be "start/end record", "field data", etc. That way, you
could separate the code that keeps track of the current record, and how
the data fits together to make an object structure, and the parsing
code, that knows how to convert a line of data into something meaningful.

Thomas
 
Reply With Quote
 
Waldek M.
Guest
Posts: n/a
 
      08-30-2011
On Mon, 29 Aug 2011 23:05:23 +0200, Thomas Jollans wrote:
> A name that is often thrown around on this list for this kind of
> question is pyparsing. Now, I don't know anything about it myself, but
> it may be worth looking into.


Definitely. I did use it and even though it's not perfect - it's very
useful indeed. Due to it's nature it is not a demon of speed when parsing
complex and big structures, so you might want to keep it in mind.
But I whole-heartedly recommend it.

Br.
Waldek
 
Reply With Quote
 
JT
Guest
Posts: n/a
 
      09-01-2011
On Monday, August 29, 2011 1:21:48 PM UTC-5, William Gill wrote:
>
> I have a text file with XML like records that I need to parse. By XML
> like I mean records have proper opening and closing tags. but fields
> don't have closing tags (they rely on line ends). Not all fields appear
> in all records, but they do adhere to a defined sequence.


lxml can parse XML and broken HTML (see http://lxml.de/parsing.html).

- James

--
Bulbflow: A Python framework for graph databases (http://bulbflow.com)
 
Reply With Quote
 
William Gill
Guest
Posts: n/a
 
      09-01-2011
On 9/1/2011 1:58 PM, JT wrote:
> On Monday, August 29, 2011 1:21:48 PM UTC-5, William Gill wrote:
>>
>> I have a text file with XML like records that I need to parse. By XML
>> like I mean records have proper opening and closing tags. but fields
>> don't have closing tags (they rely on line ends). Not all fields appear
>> in all records, but they do adhere to a defined sequence.

>
> lxml can parse XML and broken HTML (see http://lxml.de/parsing.html).
>
> - James
>

Thanks to everyone.

Though I didn't get what I expected, it made me think more about the
reason I need to parse these files to begin with. So I'm going to do
some more homework on the overall business application and work backward
from there. Once I know how the data fits in the scheme of things, I
will create an appropriate abstraction layer, either from scratch, or
using one of the existing parsers mentioned, but I won't really know
that until I have finished modeling.




 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
SAX parsing problem, when element contains text like "[text]" Kai Schlamp Java 1 03-27-2008 08:36 PM
Help Parsing A Text File greggiefen Perl Misc 1 01-03-2007 09:52 AM
In file parsing, taking the first few characters of a text file after a readfile or streamreader file read... .Net Sports ASP .Net 11 01-17-2006 12:44 AM
Assistance parsing text file using Text::CSV_XS Domenico Discepola Perl Misc 6 09-02-2004 03:55 PM
Help with parsing text file Joey Martin ASP General 2 11-07-2003 10:19 PM



Advertisments