Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Working with Huge Text Files

Reply
Thread Tools

Working with Huge Text Files

 
 
Lorn Davies
Guest
Posts: n/a
 
      03-19-2005
Hi there, I'm a Python newbie hoping for some direction in working with
text files that range from 100MB to 1G in size. Basically certain rows,
sorted by the first (primary) field maybe second (date), need to be
copied and written to their own file, and some string manipulations
need to happen as well. An example of the current format:

XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
|
| followed by like a million rows similar to the above, with
| incrementing date and time, and then on to next primary field
|
ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
|
| etc., there are usually 10-20 of the first field per file
| so there's a lot of repetition going on
|

The export would ideally look like this where the first field would be
written as the name of the file (XYZ.txt):

19930104, 93027, 2887, 7600, 40, 0, Z, N

Pretty ambitious for a newbie? I really hope not. I've been looking at
simpleParse, but it's a bit intense at first glance... not sure where
to start, or even if I need to go that route. Any help from you guys in
what direction to go or how to approach this would be hugely
appreciated.

Best regards,
Lorn

 
Reply With Quote
 
 
 
 
mensanator@aol.com
Guest
Posts: n/a
 
      03-19-2005

Lorn Davies wrote:
> Hi there, I'm a Python newbie hoping for some direction in working

with
> text files that range from 100MB to 1G in size. Basically certain

rows,
> sorted by the first (primary) field maybe second (date), need to be
> copied and written to their own file, and some string manipulations
> need to happen as well. An example of the current format:
>
> XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
> XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
> |
> | followed by like a million rows similar to the above, with
> | incrementing date and time, and then on to next primary field
> |
> ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
> |
> | etc., there are usually 10-20 of the first field per file
> | so there's a lot of repetition going on
> |
>
> The export would ideally look like this where the first field would

be
> written as the name of the file (XYZ.txt):
>
> 19930104, 93027, 2887, 7600, 40, 0, Z, N
>
> Pretty ambitious for a newbie? I really hope not. I've been looking

at
> simpleParse, but it's a bit intense at first glance... not sure where
> to start, or even if I need to go that route. Any help from you guys

in
> what direction to go or how to approach this would be hugely
> appreciated.
>
> Best regards,
> Lorn


You could use the csv module.

Here's the example from the manual with your sample data in a file
named simple.csv:

import csv
reader = csv.reader(file("some.csv"))
for row in reader:
print row

"""
['XYZ', '04JAN1993', '9:30:27', '28.87', '7600', '40', '0', 'Z', 'N ']
['XYZ', '04JAN1993', '9:30:28', '28.87', '1600', '40', '0', 'Z', 'N ']
['ABC', '04JAN1993', '9:30:27', '28.875', '7600', '40', '0', 'Z', 'N ']
"""

The csv module while bring each line in as a list of strings.
Of course, you want to process each line before printing it.
And you don't just want to print it, you want to write it to a file.

So after reading the first line, open a file for writing with the
first field (row[0]) as the file name. Then you want to process
fields row[1], row[2] and row[3] to get them in the right format
and then write all the row fields except row[0] to the file that's
open for writing.

On every subsequent line you must check to see if row[0] has changed,
so you'll have to store row[0] in a variable. If it's changed, close
the file you've been writing to and open a new file with the new
row[0]. Then continue processing lines as before.

It will only be this simple if you can guarantee that the original
file is actually sorted by the first field.

 
Reply With Quote
 
 
 
 
Michael Hoffman
Guest
Posts: n/a
 
      03-19-2005
wrote:

> It will only be this simple if you can guarantee that the original
> file is actually sorted by the first field.


And if not you can either sort the file ahead of time, or just keep
reopening the files in append mode when necessary. You could sort them
in memory in your Python program but given the size of these files I
think one of the other alternatives would be simpler.
--
Michael Hoffman
 
Reply With Quote
 
mensanator@aol.com
Guest
Posts: n/a
 
      03-19-2005

wrote:
> Lorn Davies wrote:
> > Hi there, I'm a Python newbie hoping for some direction in working

> with
> > text files that range from 100MB to 1G in size. Basically certain

> rows,
> > sorted by the first (primary) field maybe second (date), need to be
> > copied and written to their own file, and some string manipulations
> > need to happen as well. An example of the current format:
> >
> > XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
> > XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
> > |
> > | followed by like a million rows similar to the above, with
> > | incrementing date and time, and then on to next primary field
> > |
> > ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
> > |
> > | etc., there are usually 10-20 of the first field per file
> > | so there's a lot of repetition going on
> > |
> >
> > The export would ideally look like this where the first field would

> be
> > written as the name of the file (XYZ.txt):
> >
> > 19930104, 93027, 2887, 7600, 40, 0, Z, N
> >
> > Pretty ambitious for a newbie? I really hope not. I've been looking

> at
> > simpleParse, but it's a bit intense at first glance... not sure

where
> > to start, or even if I need to go that route. Any help from you

guys
> in
> > what direction to go or how to approach this would be hugely
> > appreciated.
> >
> > Best regards,
> > Lorn

>
> You could use the csv module.
>
> Here's the example from the manual with your sample data in a file
> named simple.csv:


Obviously, I meant "some.csv". Make sure the name in the program
matches the file you want to process, or pass the input file name
to the program as an argument.

>
> import csv
> reader = csv.reader(file("some.csv"))
> for row in reader:
> print row
>
> """
> ['XYZ', '04JAN1993', '9:30:27', '28.87', '7600', '40', '0', 'Z', 'N

']
> ['XYZ', '04JAN1993', '9:30:28', '28.87', '1600', '40', '0', 'Z', 'N

']
> ['ABC', '04JAN1993', '9:30:27', '28.875', '7600', '40', '0', 'Z', 'N

']
> """
>
> The csv module while bring each line in as a list of strings.
> Of course, you want to process each line before printing it.
> And you don't just want to print it, you want to write it to a file.
>
> So after reading the first line, open a file for writing with the
> first field (row[0]) as the file name. Then you want to process
> fields row[1], row[2] and row[3] to get them in the right format
> and then write all the row fields except row[0] to the file that's
> open for writing.
>
> On every subsequent line you must check to see if row[0] has changed,
> so you'll have to store row[0] in a variable. If it's changed, close
> the file you've been writing to and open a new file with the new
> row[0]. Then continue processing lines as before.
>
> It will only be this simple if you can guarantee that the original
> file is actually sorted by the first field.


 
Reply With Quote
 
cwazir@yahoo.com
Guest
Posts: n/a
 
      03-19-2005
Hi,

Lorn Davies wrote:

> ..... working with text files that range from 100MB to 1G in size.
> .....
> XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
> XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
> .....


I've found that for working with simple large text files like this,
nothing beats the plain old built-in string operations. Using a parsing
library is convenient if the data format is complex, but otherwise it's
overkill.
In this particular case, even the csv module isn't much of an
advantage. I'd just use split.

The following code should do the job:

data_file = open('data.txt', 'r')
months = {'JAN':'01', 'FEB':'02', 'MAR':'03', 'APR':'04', 'MAY':'05',
'JUN':'06', 'JUL':'07', 'AUG':'08', 'SEP':'09', 'OCT':'10', 'NOV':'11',
'DEC':'12'}
output_files = {}
for line in data_file:
fields = line.strip().split(',')
filename = fields[0]
if filename not in output_files:
output_files[filename] = open(filename+'.txt', 'w')
fields[1] = fields[1][5:] + months[fields[1][2:5]] + fields[1][:2]
fields[2] = fields[2].replace(':', '')
fields[3] = fields[3].replace('.', '')
print >>output_files[filename], ', '.join(fields[1:])
for filename in output_files:
output_files[filename].close()
data_file.close()

Note that it does work with unsorted data - at the minor cost of
keeping all output files open till the end of the entire process.

Chirag Wazir
http://chirag.freeshell.org

 
Reply With Quote
 
Al Christians
Guest
Posts: n/a
 
      03-19-2005
I did some similar stuff way back about 12-15 years ago -- in 640k
MS-DOS with gigabyte files on 33 MHz machines. I got good performance,
able to bring up any record out of 10 million or so on the screen in a
couple of seconds (not using Python, but that should not make much
difference, maybe even some things in Python would make it work better.)

Even though my files were text, I read them as random-access binary
files. You need to be able to dive in at an arbitrary point in the
file, read a chunk of data, split it up into lines, discarding any
partial lines at the beginning and end, pull out the keys and see where
you are. Even with a gigabyte of file, if you are reading a decent size
chunk, you can binary search down to the spot you want in 15-20 tries or
so. That's the first time, but after that you've got a better idea
where to look. Use a dictionary to save the information from each chunk
to give you an index to get a headstart on the next search. If you can
keep 10k to 100k entries in your index, you can do 1000's of searches or
so before you even have to worry about having too many index entries.

I did learn that on 32-bit hardware, doing a binary search of a file
over a gigabyte will fail if you calculate the next place to look as
(a+b)/2, because a+b can be more than 2GB and overflow. You gotta do
(a + (b-a)/2)


Al
 
Reply With Quote
 
Al Christians
Guest
Posts: n/a
 
      03-19-2005
Michael Hoffman wrote:

> wrote:
>
>> It will only be this simple if you can guarantee that the original
>> file is actually sorted by the first field.

>
>
> And if not you can either sort the file ahead of time, or just keep
> reopening the files in append mode when necessary. You could sort them
> in memory in your Python program but given the size of these files I
> think one of the other alternatives would be simpler.


There used to be a very nice sort program for PC's that came from
someplace in Nevada. It cost less than $100 and could sort files
faster than most programming languages could read or write them. For
linux, you've gotta figure out the posix sort. If you do, please splain
it to me.

Al
 
Reply With Quote
 
Lorn Davies
Guest
Posts: n/a
 
      03-19-2005
Thank you all very much for your suggestions and input... they've been
very helpful. I found the easiest apporach, as a beginner to this, was
working with Chirag's code. Thanks Chirag, I was actually able to read
and make some edit's to the code and then use it... woohooo!

My changes are annotated with ##:

data_file = open('G:\pythonRead.txt', 'r')
data_file.readline() ## this was to skip the first line
months = {'JAN':'01', 'FEB':'02', 'MAR':'03', 'APR':'04', 'MAY':'05',
'JUN':'06', 'JUL':'07', 'AUG':'08', 'SEP':'09', 'OCT':'10', 'NOV':'11',
'DEC':'12'}
output_files = {}
for line in data_file:
fields = line.strip().split(',')
length = len(fields[3]) ## check how long the field is
N = 'P','N'
filename = fields[0]
if filename not in output_files:
output_files[filename] = open(filename+'.txt', 'w')
if (fields[8] == 'N' or 'P') and (fields[6] == '0' or '1'):
## This line above doesn't work, can't figure out how to struct?
fields[1] = fields[1][5:] + months[fields[1][2:5]] +
fields[1][:2]
fields[2] = fields[2].replace(':', '')
if length == 6: ## check for 6 if not add a 0
fields[3] = fields[3].replace('.', '')
else:
fields[3] = fields[3].replace('.', '') + '0'
print >>output_files[filename], ', '.join(fields[1:5])
for filename in output_files:
output_files[filename].close()
data_file.close()

The main changes were to create a check for the length of fields[3], I
wanted to normalize it at 6 digits... the problem I can seee with it
potentially is if I come across lengths < 5, but I have some ideas to
fix that. The other change I attempted was a criteria for what to print
based on the value of fields[8] and fields[6]. It didn't work so well.
I'm a little confused at how to structure booleans like that... I come
from a little experience in a Pascal type scripting language where "x
and y" would entail both having to be true before continuing and "x or
y" would mean either could be true before continuing. Python, unless
I'm misunderstanding (very possible), doesn't organize it as such. I
thought of perhaps using a set of if, elif, else statements for
processing the fileds, but didn't think that would be the most
elegant/efficient solution.

Anyway, any critiques/ideas are welcome... they'll most definitely help
me understand this language a bit better. Thank you all again for your
great replies and thank you Chirag for getting me up and going.

Lorn

 
Reply With Quote
 
cwazir@yahoo.com
Guest
Posts: n/a
 
      03-20-2005
Lorn Davies wrote:

> if (fields[8] == 'N' or 'P') and (fields[6] == '0' or '1'):
> ## This line above doesn't work, can't figure out how to struct?


In Python you would need to phrase that as follows:
if (fields[8] == 'N' or fields[8] == 'P') and (fields[6] == '0'
or fields[6] == '1'):
or alternatively:
if (fields[8] in ['N', 'P']) and (fields[6] in ['0', '1']):

> The main changes were to create a check for the length of fields[3],
> I wanted to normalize it at 6 digits...


Well, you needn't really check the length - you could directly do this:
fields[3] = (fields[3].replace('.', '') + '000000')[:6]
(of course if there are more than 6 digits originally, they'd get
truncated in this case)

Chirag Wazir
http://chirag.freeshell.org

 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      03-20-2005

wrote:
> Lorn Davies wrote:
>
> > if (fields[8] == 'N' or 'P') and (fields[6] == '0' or '1'):
> > ## This line above doesn't work, can't figure out how to struct?

>
> In Python you would need to phrase that as follows:
> if (fields[8] == 'N' or fields[8] == 'P') and (fields[6] == '0'
> or fields[6] == '1'):
> or alternatively:
> if (fields[8] in ['N', 'P']) and (fields[6] in ['0', '1']):
>


and given that the files are huge, a little bit of preprocessing
wouldn't go astray:

initially:

valid_8 = set(['N', 'P'])
valid_6 = set(['0', '1'])

then for each record:

if fields[8] in valid_8 and fields[6] in valid_6:

More meaningful names wouldn't go astray either

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Memory error due to the huge/huge input file size tejsupra@gmail.com Python 3 11-20-2008 07:21 PM
Replacing the specific pattern of text in a huge text string.. Brown Smith ASP .Net 1 06-25-2005 05:34 AM
text view for huge files? Jan XML 6 12-10-2004 11:43 PM
Reading huge text files one line at a time.... Brock Heinz Java 8 11-23-2004 08:52 PM
Reading text in HUGE file-?? Jim Beaver Computer Support 0 06-24-2003 04:04 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57