Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > list comprehension help

Reply
Thread Tools

list comprehension help

 
 
rkmr.em@gmail.com
Guest
Posts: n/a
 
      03-18-2007
Hi
I need to process a really huge text file (4GB) and this is what i
need to do. It takes for ever to complete this. I read some where that
"list comprehension" can fast up things. Can you point out how to do
it in this case?
thanks a lot!


f = open('file.txt','r')
for line in f:
db[line.split(' ')[0]] = line.split(' ')[-1]
db.sync()
 
Reply With Quote
 
 
 
 
Marc 'BlackJack' Rintsch
Guest
Posts: n/a
 
      03-18-2007
In <(E-Mail Removed)>,
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

> I need to process a really huge text file (4GB) and this is what i
> need to do. It takes for ever to complete this. I read some where that
> "list comprehension" can fast up things. Can you point out how to do
> it in this case?


No way I can see here.

> f = open('file.txt','r')
> for line in f:
> db[line.split(' ')[0]] = line.split(' ')[-1]
> db.sync()


You can get rid of splitting the same line twice, or use `split()` and
`rsplit()` with the `maxsplit` argument to avoid splitting the line at
*every* space character.

And if the names give the right hints `db.sync()` may be a potentially
expensive operation. Try to call it at a lower frequency if possible.

Ciao,
Marc 'BlackJack' Rintsch
 
Reply With Quote
 
 
 
 
George Sakkis
Guest
Posts: n/a
 
      03-19-2007
On Mar 18, 12:11 pm, "(E-Mail Removed)" <(E-Mail Removed)> wrote:

> Hi
> I need to process a really huge text file (4GB) and this is what i
> need to do. It takes for ever to complete this. I read some where that
> "list comprehension" can fast up things. Can you point out how to do
> it in this case?
> thanks a lot!
>
> f = open('file.txt','r')
> for line in f:
> db[line.split(' ')[0]] = line.split(' ')[-1]
> db.sync()


You got several good suggestions; one that has not been mentioned but
makes a big (or even the biggest) difference for large/huge file is
the buffering parameter of open(). Set it to the largest value you can
afford to keep the I/O as low as possible. I'm processing 15-25 GB
files (you see "huge" is really relative ) on 2-4GB RAM boxes and
setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
compared to the default value. BerkeleyDB should have a buffering
option too, make sure you use it and don't synchronize on every line.

Best,
George

 
Reply With Quote
 
rkmr.em@gmail.com
Guest
Posts: n/a
 
      03-19-2007
On 18 Mar 2007 19:01:27 -0700, George Sakkis <(E-Mail Removed)> wrote:
> On Mar 18, 12:11 pm, "(E-Mail Removed)" <(E-Mail Removed)> wrote:
> > I need to process a really huge text file (4GB) and this is what i
> > need to do. It takes for ever to complete this. I read some where that
> > "list comprehension" can fast up things. Can you point out how to do
> > it in this case?
> > thanks a lot!
> >
> > f = open('file.txt','r')
> > for line in f:
> > db[line.split(' ')[0]] = line.split(' ')[-1]
> > db.sync()

> You got several good suggestions; one that has not been mentioned but
> makes a big (or even the biggest) difference for large/huge file is
> the buffering parameter of open(). Set it to the largest value you can
> afford to keep the I/O as low as possible. I'm processing 15-25 GB


Can you give example of how you process the 15-25GB files with the
buffering parameter?
It will be educational to everyone I think.

> files (you see "huge" is really relative ) on 2-4GB RAM boxes and
> setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
> compared to the default value. BerkeleyDB should have a buffering
> option too, make sure you use it and don't synchronize on every line.


I changed the sync to once in every 100,000 lines.
thanks a lot everyone!
 
Reply With Quote
 
Alex Martelli
Guest
Posts: n/a
 
      03-19-2007
George Sakkis <(E-Mail Removed)> wrote:

> On Mar 18, 12:11 pm, "(E-Mail Removed)" <(E-Mail Removed)> wrote:
>
> > Hi
> > I need to process a really huge text file (4GB) and this is what i
> > need to do. It takes for ever to complete this. I read some where that
> > "list comprehension" can fast up things. Can you point out how to do
> > it in this case?
> > thanks a lot!
> >
> > f = open('file.txt','r')
> > for line in f:
> > db[line.split(' ')[0]] = line.split(' ')[-1]
> > db.sync()

>
> You got several good suggestions; one that has not been mentioned but
> makes a big (or even the biggest) difference for large/huge file is
> the buffering parameter of open(). Set it to the largest value you can
> afford to keep the I/O as low as possible. I'm processing 15-25 GB
> files (you see "huge" is really relative ) on 2-4GB RAM boxes and
> setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
> compared to the default value. BerkeleyDB should have a buffering
> option too, make sure you use it and don't synchronize on every line.


Out of curiosity, what OS and FS are you using? On a well-tuned FS and
OS combo that does "read-ahead" properly, I would not expect such
improvements for moving from large to huge buffering (unless some other
pesky process is perking up once in a while and sending the disk heads
on a quest to never-never land). IOW, if I observed this performance
behavior on a server machine I'm responsible for, I'd look for
system-level optimizations (unless I know I'm being forced by myopic
beancounters to run inappropriate OSs/FSs, in which case I'd spend the
time polishing my resume instead) - maybe tuning the OS (or mount?)
parameters, maybe finding a way to satisfy the "other pesky process"
without flapping disk heads all over the prairie, etc, etc.

The delay of filling a "1 GB or more" buffer before actual processing
can begin _should_ defeat any gains over, say, a 1 MB buffer -- unless,
that is, something bad is seriously interfering with the normal
read-ahead system level optimization... and in that case I'd normally be
more interested in finding and squashing the "something bad", than in
trying to work around it by overprovisioning application bufferspace!-)


Alex
 
Reply With Quote
 
rkmr.em@gmail.com
Guest
Posts: n/a
 
      03-19-2007
On 3/18/07, Alex Martelli <(E-Mail Removed)> wrote:
> George Sakkis <(E-Mail Removed)> wrote:
> > On Mar 18, 12:11 pm, "(E-Mail Removed)" <(E-Mail Removed)> wrote:
> > > I need to process a really huge text file (4GB) and this is what i
> > > need to do. It takes for ever to complete this. I read some where that
> > > "list comprehension" can fast up things. Can you point out how to do
> > > f = open('file.txt','r')
> > > for line in f:
> > > db[line.split(' ')[0]] = line.split(' ')[-1]
> > > db.sync()

> > You got several good suggestions; one that has not been mentioned but
> > makes a big (or even the biggest) difference for large/huge file is
> > the buffering parameter of open(). Set it to the largest value you can
> > afford to keep the I/O as low as possible. I'm processing 15-25 GB
> > files (you see "huge" is really relative ) on 2-4GB RAM boxes and
> > setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
> > compared to the default value. BerkeleyDB should have a buffering

> Out of curiosity, what OS and FS are you using? On a well-tuned FS and


Fedora Core 4 and ext 3. Is there something I should do to the FS?

> OS combo that does "read-ahead" properly, I would not expect such
> improvements for moving from large to huge buffering (unless some other
> pesky process is perking up once in a while and sending the disk heads
> on a quest to never-never land). IOW, if I observed this performance
> behavior on a server machine I'm responsible for, I'd look for
> system-level optimizations (unless I know I'm being forced by myopic
> beancounters to run inappropriate OSs/FSs, in which case I'd spend the
> time polishing my resume instead) - maybe tuning the OS (or mount?)
> parameters, maybe finding a way to satisfy the "other pesky process"
> without flapping disk heads all over the prairie, etc, etc.
>
> The delay of filling a "1 GB or more" buffer before actual processing
> can begin _should_ defeat any gains over, say, a 1 MB buffer -- unless,
> that is, something bad is seriously interfering with the normal
> read-ahead system level optimization... and in that case I'd normally be
> more interested in finding and squashing the "something bad", than in
> trying to work around it by overprovisioning application bufferspace!-)



Which should I do? How much buffer should I allocate? I have a box
with 2GB memory.
thanks!
 
Reply With Quote
 
Alex Martelli
Guest
Posts: n/a
 
      03-19-2007
(E-Mail Removed) <(E-Mail Removed)> wrote:
...
> > > files (you see "huge" is really relative ) on 2-4GB RAM boxes and
> > > setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
> > > compared to the default value. BerkeleyDB should have a buffering

> > Out of curiosity, what OS and FS are you using? On a well-tuned FS and

>
> Fedora Core 4 and ext 3. Is there something I should do to the FS?


In theory, nothing. In practice, this is strange.

> Which should I do? How much buffer should I allocate? I have a box
> with 2GB memory.


I'd be curious to see a read-only loop on the file, opened with (say)
1MB of buffer vs 30MB vs 1GB -- just loop on the lines, do a .split() on
each, and do nothing with the results. What elapsed times do you
measure with each buffersize...?

If the huge buffers confirm their worth, it's time to take a nice
critical look at what other processes you're running and what all are
they doing to your disk -- maybe some daemon (or frequently-run cron
entry, etc) is out of control...? You could try running the benchmark
again in single-user mode (with essentially nothing else running) and
see how the elapsed-time measurements change...


Alex
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
List comprehension in if clause of another list comprehension Vedran Furac( Python 4 12-19-2008 01:35 PM
Appending a list's elements to another list using a list comprehension Debajit Adhikary Python 17 10-18-2007 06:45 PM
List comprehension returning subclassed list type? Shane Geiger Python 4 03-25-2007 09:34 AM
Re: list comprehension help rkmr.em@gmail.com Python 3 03-18-2007 11:52 PM
Re: list comprehension help Daniel Nogradi Python 5 03-18-2007 11:52 PM



Advertisments