Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > File parser

Reply
Thread Tools

File parser

 
 
Angelic Devil
Guest
Posts: n/a
 
      08-29-2005

I'm building a file parser but I have a problem I'm not sure how to
solve. The files this will parse have the potential to be huge
(multiple GBs). There are distinct sections of the file that I
want to read into separate dictionaries to perform different
operations on. Each section has specific begin and end statements
like the following:

KEYWORD
..
..
..
END KEYWORD

The very first thing I do is read the entire file contents into a
string. I then store the contents in a list, splitting on line ends
as follows:


file_lines = file_contents.split('\n')


Next, I build smaller lists from the different sections using the
begin and end keywords:


begin_index = file_lines.index(begin_keyword)
end_index = file_lines.index(end_keyword)
small_list = [ file_lines[begin_index + 1] : file_lines[end_index - 1] ]


I then plan on parsing each list to build the different dictionaries.
The problem is that one begin statement is a substring of another
begin statement as in the following example:


BAR
END BAR

FOOBAR
END FOOBAR


I can't just look for the line in the list that contains BAR because
FOOBAR might come first in the list. My list would then look like

[foobar_1, foobar_2, ..., foobar_n, ..., bar_1, bar_2, ..., bar_m]

I don't really want to use regular expressions, but I don't see a way
to get around this without doing so. Does anyone have any suggestions
on how to accomplish this? If regexps are the way to go, is there an
efficient way to parse the contents of a potentially large list using
regular expressions?

Any help is appreciated!

Thanks,
Aaron

--
"Tis better to be silent and be thought a fool, than to speak and
remove all doubt."
-- Abraham Lincoln
 
Reply With Quote
 
 
 
 
William Park
Guest
Posts: n/a
 
      08-30-2005
Angelic Devil <(E-Mail Removed)> wrote:
> BAR
> END BAR
>
> FOOBAR
> END FOOBAR


man csplit

--
William Park <(E-Mail Removed)>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
 
Reply With Quote
 
 
 
 
Rune Strand
Guest
Posts: n/a
 
      08-30-2005
It's not clear to me from your posting what possible order the tags may
be inn. Assuming you will always END a section before beginning an new,
eg.

it's always:

A
some A-section lines.
END A

B
some B-section lines.
END B

etc.

And never:

A
some A-section lines.
B
some B-section lines.
END B
END A

etc.

is should be fairly simple. And if the file is several GB, your ought
to use a generator in order to overcome the memory problem.

Something like this:


def make_tag_lookup(begin_tags):
# create a dict with each {begin_tag : end_tag}
end_tags = [('END ' + begin_tag) for begin_tag in begin_tags]
return dict(zip(begin_tags, end_tags))


def return_sections(filepath, lookup):
# Generator returning each section

inside_section = False

for line in open(filepath, 'r').readlines():
line = line.strip()
if not inside_section:
if line in lookup:
inside_section = True
data_section = []
section_end_tag = lookup[line]
section_begin_tag = line
data_section.append(line) # store section start tag
else:
if line == section_end_tag:
data_section.append(line) # store section end tag
inside_section = False
yield data_section # yield entire section

else:
data_section.append(line) #store each line within section


# create the generator yielding each section
#
sections = return_sections(datafile,
make_tag_lookup(list_of_begin_tags))

for section in sections:
for line in section:
print line
print '\n'

 
Reply With Quote
 
MrJean1
Guest
Posts: n/a
 
      08-30-2005
Take a closer look at SimpleParse/mxTextTools

<//www.python.org/pypi/SimpleParse/2.0.1a3>

We have used these to parse log files of several 100 MB with simple and
complex grammars up to 250+ productions. Highly recommended.

/Jean Brouwers

PS) For an introduction see also this story
<http://www-128.ibm.com/developerworks/linux/library/l-simple.html>

 
Reply With Quote
 
infidel
Guest
Posts: n/a
 
      08-30-2005

Angelic Devil wrote:
> I'm building a file parser but I have a problem I'm not sure how to
> solve. The files this will parse have the potential to be huge
> (multiple GBs). There are distinct sections of the file that I
> want to read into separate dictionaries to perform different
> operations on. Each section has specific begin and end statements
> like the following:
>
> KEYWORD
> .
> .
> .
> END KEYWORD
>
> The very first thing I do is read the entire file contents into a
> string. I then store the contents in a list, splitting on line ends
> as follows:
>
>
> file_lines = file_contents.split('\n')
>
>
> Next, I build smaller lists from the different sections using the
> begin and end keywords:
>
>
> begin_index = file_lines.index(begin_keyword)
> end_index = file_lines.index(end_keyword)
> small_list = [ file_lines[begin_index + 1] : file_lines[end_index - 1] ]
>
>
> I then plan on parsing each list to build the different dictionaries.
> The problem is that one begin statement is a substring of another
> begin statement as in the following example:
>
>
> BAR
> END BAR
>
> FOOBAR
> END FOOBAR
>
>
> I can't just look for the line in the list that contains BAR because
> FOOBAR might come first in the list. My list would then look like
>
> [foobar_1, foobar_2, ..., foobar_n, ..., bar_1, bar_2, ..., bar_m]
>
> I don't really want to use regular expressions, but I don't see a way
> to get around this without doing so. Does anyone have any suggestions
> on how to accomplish this? If regexps are the way to go, is there an
> efficient way to parse the contents of a potentially large list using
> regular expressions?
>
> Any help is appreciated!
>
> Thanks,
> Aaron


Some time ago I was toying around with writing a tool in python to
parse our VB6 code (the original idea was to write our own .NET
conversion tool because the Wizard that comes with VS.NET sucks hard on
some things). I tried various parsing tools and EBNF grammars but VB6
isn't really an EBNF-esque syntax in all cases, so I needed something
else. VB6 syntax is similar to what you have, with all kinds of
different "Begin/End" blocks, and some files can be rather big. Also,
when you get to conditionals and looping constructs you can have
seriously nested logic, so the approach I took was to imitate a SAX
parser. I created a class that reads VB6 source line by line, and
calls empty "event handler" methods (just like SAX) such as
self.begin_type or self.begin_procedure and self.end_type or
self.end_procedure. Then I created a subclass that actually
implemented those event handlers by building a sort of tree that
represents the program in a more abstract fashion. I never got to the
point of writing the tree out in a new language, but I had fun hacking
on the project for a while. I think a similar approach could work for
you here.

 
Reply With Quote
 
Mike C. Fletcher
Guest
Posts: n/a
 
      08-30-2005
infidel wrote:

>Angelic Devil wrote:
>
>

....

>Some time ago I was toying around with writing a tool in python to
>parse our VB6 code (the original idea was to write our own .NET
>conversion tool because the Wizard that comes with VS.NET sucks hard on
>some things). I tried various parsing tools and EBNF grammars but VB6
>isn't really an EBNF-esque syntax in all cases, so I needed something
>else.
>

....

You may find this project interesting to play with:
http://vb2py.sourceforge.net/index.html

Have fun,
Mike

--
________________________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://www.vrplumber.com
http://blog.vrplumber.com

 
Reply With Quote
 
Angelic Devil
Guest
Posts: n/a
 
      08-30-2005
"Rune Strand" <(E-Mail Removed)> writes:


Thanks. This shows definate promise. I've already tailored it for
what I need, and it appears to be working.


--
"Society in every state is a blessing, but Government, even in its best
state, is but a necessary evil; in its worst state, an intolerable one."
-- Thomas Paine
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
import parser does not import parser.py in same dir on win Joel Hedlund Python 2 11-11-2006 03:46 PM
import parser does not import parser.py in same dir on win Joel Hedlund Python 0 11-11-2006 11:34 AM
XML Parser VS HTML Parser ZOCOR Java 11 10-05-2004 01:58 PM
XMLparser: Difference between parser.setErrorHandler() vs. parser.setContentHandler() Bernd Oninger Java 0 06-09-2004 01:26 AM
XMLparser: Difference between parser.setErrorHandler() vs. parser.setContentHandler() Bernd Oninger XML 0 06-09-2004 01:26 AM



Advertisments