Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Processing Multiple Large Files

Reply
Thread Tools

Processing Multiple Large Files

 
 
friend.05@gmail.com
Guest
Posts: n/a
 
      12-11-2008
Hi,

I analyzing some netwokr log files. There are around 200-300 files and
each file has more than 2 million entries in it.

Currently my script is reading each file line by line. So it will take
lot of time to process all the files.

Is there any efficient way to do it?

May be Multiprocessing, Multitasking ?


Thanks.

 
Reply With Quote
 
 
 
 
Tim Greer
Guest
Posts: n/a
 
      12-11-2008
wrote:

> Hi,
>
> I analyzing some netwokr log files. There are around 200-300 files and
> each file has more than 2 million entries in it.
>
> Currently my script is reading each file line by line. So it will take
> lot of time to process all the files.


When dealing with a lot of data, you usually want to read line by line,
if you can help it. That's the most efficient way when dealing with
large text files. If you have a ton of memory to play with, you can
try other solutions, but even reading line by line, there might be ways
to speed that up, too, depending on a few variables and your needs.

No matter how you go about it, if you have to look at every line in the
file (to use, process, skip, whatever), you're still going to have to
do that and it will have the smaller memory footprint. Maybe it's how
you're going about the task that can be improved? Do you have any
relevant code snippets?
--
Tim Greer, CEO/Founder/CTO, BurlyHost.com, Inc.
Shared Hosting, Reseller Hosting, Dedicated & Semi-Dedicated servers
and Custom Hosting. 24/7 support, 30 day guarantee, secure servers.
Industry's most experienced staff! -- Web Hosting With Muscle!
 
Reply With Quote
 
 
 
 
friend.05@gmail.com
Guest
Posts: n/a
 
      12-11-2008
On Dec 11, 3:32*pm, Tim Greer <t...@burlyhost.com> wrote:
> friend...@gmail.com wrote:
> > Hi,

>
> > I analyzing some netwokr log files. There are around 200-300 files and
> > each file has more than 2 million entries in it.

>
> > Currently my script is reading each file line by line. So it will take
> > lot of time to process all the files.

>
> When dealing with a lot of data, you usually want to read line by line,
> if you can help it. *That's the most efficient way when dealing with
> large text files. *If you have a ton of memory to play with, you can
> try other solutions, but even reading line by line, there might be ways
> to speed that up, too, depending on a few variables and your needs.
>
> No matter how you go about it, if you have to look at every line in the
> file (to use, process, skip, whatever), you're still going to have to
> do that and it will have the smaller memory footprint. *Maybe it's how
> you're going about the task that can be improved? *Do you have any
> relevant code snippets?
> --
> Tim Greer, CEO/Founder/CTO, BurlyHost.com, Inc.
> Shared Hosting, Reseller Hosting, Dedicated & Semi-Dedicated servers
> and Custom Hosting. *24/7 support, 30 day guarantee, secure servers.
> Industry's most experienced staff! -- Web Hosting With Muscle!


Yes I am reading each file line by line.

But there are more than 200 files. So is there a way if I can process
some files parallely.

Or any other solution to speed up my task.

 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      12-11-2008
"" <> wrote:
> Hi,
>
> I analyzing some netwokr log files. There are around 200-300 files and
> each file has more than 2 million entries in it.
>
> Currently my script is reading each file line by line.


Perl makes it look like you are reading the files line by line.
But really it is using internal buffering to read the files
in larger chunks (well, if the lines are short. If the lines
are long, the chunks may actually be shorter than the lines)

> So it will take
> lot of time to process all the files.
>
> Is there any efficient way to do it?


Figure out which parts are inefficient, and improve them.

> May be Multiprocessing, Multitasking ?


Do you have several CPUs? Can you I/O system keep up with them?

There are kinds of way to do parallel processing in Perl.
In this case, maybe Parallel::ForkManager would be best.
Each process can be assigned a specific one of the 300
files to work on.

See the docs for Parallel::ForkManager.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
Reply With Quote
 
Peter Makholm
Guest
Posts: n/a
 
      12-11-2008
"" <> writes:

> I analyzing some netwokr log files. There are around 200-300 files and
> each file has more than 2 million entries in it.
>
> Currently my script is reading each file line by line. So it will take
> lot of time to process all the files.


It depends on what kind of processing you're doing.

If you don't need to process lines in order you might get a speedup by
starting a couple of processes each processing their own files. Again
depending on the kind of processing the optimal number of processes
may vary from the number om cpu's to a copule times the number of
cpu's.

If part of you processing consists of doing a DNS lookup you might be
able to get a speedup by reading a few lines a time a use asyncronous
dns requests (Net:NS::Async seems to do it) instead of block on each
and every request.

Other optimizations might be possible, but almost everything depends
on the kind of processing you have to do and if you have to process
lines in some predetermined order.

//Makholm
 
Reply With Quote
 
RedGrittyBrick
Guest
Posts: n/a
 
      12-11-2008

wrote:
> Hi,
>
> I analyzing some netwokr log files. There are around 200-300 files and
> each file has more than 2 million entries in it.
>
> Currently my script is reading each file line by line. So it will take
> lot of time to process all the files.
>
> Is there any efficient way to do it?
>
> May be Multiprocessing, Multitasking ?
>


If the 200-300 files are on the same disk, are not especially fragmented
and your program is already IO-bound, parallel processing might
conceivably slow things down by increasing the number of head-seeks needed.

Just a thought.

--
RGB
 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      12-11-2008
"" <> wrote in news:5f1e2237-
b3f6-409c-aa95-:

> I analyzing some netwokr log files. There are around 200-300 files and
> each file has more than 2 million entries in it.
>
> Currently my script is reading each file line by line. So it will take
> lot of time to process all the files.
>
> Is there any efficient way to do it?
>
> May be Multiprocessing, Multitasking ?


Here is one way to do it using Parallel::Forkmanager.

If your system is somewhat typical, you'll probably run into an IO
bottleneck before you run into a CPU bottleneck.

For example:

C:\DOCUME~1\asu1\LOCALS~1\Temp\large> cat create.pl
#!/usr/bin/perl

use strict;
use warnings;

my $line = join("\t", qw( 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 )) .
"\n";

my $fn_tmpl = 'data_%2.2d.txt';
my $fn = sprintf $fn_tmpl, 0;

open my $out, '>', $fn
or die "Cannot open '$fn': $!";

for (1 .. 100_000) {
print $out $line
or die "Cannot write to '$fn': $!";
}

close $out
or die "Cannot close: '$fn': $!";

for (1 .. 19) {
system copy => $fn, sprintf($fn_tmpl, $_);
}

C:\DOCUME~1\asu1\LOCALS~1\Temp\large> timethis create
....
TimeThis : Command Line : create
TimeThis : Start Time : Thu Dec 11 18:14:12 2008
TimeThis : End Time : Thu Dec 11 18:14:16 2008
TimeThis : Elapsed Time : 00:00:03.468

Now, you have 20 input files with 100_000 lines each:

C:\DOCUME~1\asu1\LOCALS~1\Temp\large> dir
....
2008/12/11 06:14 PM 4,100,000 data_00.txt
2008/12/11 06:14 PM 4,100,000 data_01.txt
2008/12/11 06:14 PM 4,100,000 data_02.txt
2008/12/11 06:14 PM 4,100,000 data_03.txt
2008/12/11 06:14 PM 4,100,000 data_04.txt
2008/12/11 06:14 PM 4,100,000 data_05.txt
2008/12/11 06:14 PM 4,100,000 data_06.txt
2008/12/11 06:14 PM 4,100,000 data_07.txt
2008/12/11 06:14 PM 4,100,000 data_08.txt
2008/12/11 06:14 PM 4,100,000 data_09.txt
2008/12/11 06:14 PM 4,100,000 data_10.txt
2008/12/11 06:14 PM 4,100,000 data_11.txt
2008/12/11 06:14 PM 4,100,000 data_12.txt
2008/12/11 06:14 PM 4,100,000 data_13.txt
2008/12/11 06:14 PM 4,100,000 data_14.txt
2008/12/11 06:14 PM 4,100,000 data_15.txt
2008/12/11 06:14 PM 4,100,000 data_16.txt
2008/12/11 06:14 PM 4,100,000 data_17.txt
2008/12/11 06:14 PM 4,100,000 data_18.txt
2008/12/11 06:14 PM 4,100,000 data_19.txt

Here is a simple program to process the data:

C:\DOCUME~1\asu1\LOCALS~1\Temp\large> cat process.pl
#!/usr/bin/perl

use strict;
use warnings;

use Parallel::ForkManager;

my ($instances) = @ARGV;

my $fn_tmpl = 'data_%2.2d.txt';

my $pm = Parallel::ForkManager->new($instances);

for my $i (0 .. 19 ) {
$pm->start and next;

my $input = sprintf $fn_tmpl, $i;

eval {
open my $in, '<', $input
or die "Cannot open '$input': $!";

while ( my $line = <$in> ) {
my @data = split /\t/, $line;

# replace with your own processing code
# don't try to keep all your data in memory
}
close $in
or die "Cannot close '$input': $!";
};

warn $@ if $@;

$pm->finish;
}

$pm->wait_all_children;

__END__

First, try without forking to establish a baseline:

C:\DOCUME~1\asu1\LOCALS~1\Temp\large> timethis process 0

TimeThis : Command Line : process 0
TimeThis : Start Time : Thu Dec 11 18:31:50 2008
TimeThis : End Time : Thu Dec 11 18:32:41 2008
TimeThis : Elapsed Time : 00:00:51.156

Let's try a few more:

TimeThis : Command Line : process 2
TimeThis : Start Time : Thu Dec 11 18:35:15 2008
TimeThis : End Time : Thu Dec 11 18:35:58 2008
TimeThis : Elapsed Time : 00:00:43.578

TimeThis : Command Line : process 4
TimeThis : Start Time : Thu Dec 11 18:36:17 2008
TimeThis : End Time : Thu Dec 11 18:36:59 2008
TimeThis : Elapsed Time : 00:00:41.921

TimeThis : Command Line : process 8
TimeThis : Start Time : Thu Dec 11 18:37:18 2008
TimeThis : End Time : Thu Dec 11 18:38:00 2008
TimeThis : Elapsed Time : 00:00:41.328

TimeThis : Command Line : process 16
TimeThis : Start Time : Thu Dec 11 18:38:18 2008
TimeThis : End Time : Thu Dec 11 18:38:58 2008
TimeThis : Elapsed Time : 00:00:40.734

TimeThis : Command Line : process 20
TimeThis : Start Time : Thu Dec 11 18:39:17 2008
TimeThis : End Time : Thu Dec 11 18:39:58 2008
TimeThis : Elapsed Time : 00:00:40.578

Not very impressive. Between no forking vs max 20 instances, time
required to process was reduced by 20% with most of the gains coming
from running 2. That probably has more to do with the implementation of
fork on Windows than anything else.

In fact, I should probably have used threads on Windows. Anyway, I'll
boot into Linux and see if the returns there are greater.

Try this simple experiment on your system. See how many instances gives
you the best bang-per-buck.

Sinan

--
A. Sinan Unur <>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      12-12-2008
"A. Sinan Unur" <> wrote in
news:Xns9B71C0CDC9E12asu1cornelledu@127.0.0.1:

> "" <> wrote in news:5f1e2237-
> b3f6-409c-aa95-:
>
>> I analyzing some netwokr log files. There are around 200-300 files
>> and each file has more than 2 million entries in it.
>>
>> Currently my script is reading each file line by line. So it will
>> take lot of time to process all the files.
>>
>> Is there any efficient way to do it?
>>
>> May be Multiprocessing, Multitasking ?

>
> Here is one way to do it using Parallel::Forkmanager.
>

....

> Not very impressive. Between no forking vs max 20 instances, time
> required to process was reduced by 20% with most of the gains coming
> from running 2. That probably has more to do with the implementation
> of fork on Windows than anything else.
>
> In fact, I should probably have used threads on Windows. Anyway, I'll
> boot into Linux and see if the returns there are greater.


Hmmm ... I tried it on ArchLinux using perl from the repository on the
exact same hardware as the Windows tests:

[sinan@archardy large]$ time perl process.pl 0

real 0m29.983s
user 0m29.848s
sys 0m0.073s

[sinan@archardy large]$ time perl process.pl 2

real 0m15.281s
user 0m29.865s
sys 0m0.077s

with no changes going to 4, 8, 16 or 20 max instances. Exact same
program and data on the same hardware, yet the no fork version was 40%
faster. Running it in a shell window in xfce4 versus at boot-up on the
console and running it in an ntfs filesystem versus ext3 file system did
not make any meaningful difference.

The wireless connection was up but inactive in all scenarios.

-- Sinan

--
A. Sinan Unur <>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
Reply With Quote
 
cartercc
Guest
Posts: n/a
 
      12-12-2008
On Dec 11, 3:27*pm, "friend...@gmail.com" <hirenshah...@gmail.com>
wrote:
> I analyzing some netwokr log files. There are around 200-300 files and
> each file has more than 2 million entries in it.
> Currently my script is reading each file line by line. So it will take
> lot of time to process all the files.


Your question is really about data. The fact that your data is
contained in files which have rows and columns is totally irrelevant.
You would have the same problem if all the data were contained in just
one file. If you have 200,000,000 items of data, you have that much
data, and there's absolutely nothing you can do about it.

> Is there any efficient way to do it?


This is a good question, and the answer is, 'Maybe.' If you want to
generate reports from the data, you might want to look into putting in
into a database and writing queries against the database. That's what
companies like Wal-mart, Amazon.com, and eBay do. Write a script that
runs as a cron job at 2:00 am and reads all the data into a database.
Then write another script that queries the database at 4:00 am and
spits out the reports you want.

>
> May be Multiprocessing, Multitasking ?


If you are using an Intel-like processor, it multi processes, anyway.
There are only two ways to increase speed: increase the clocks of the
processor or increase the number of processors. With respect to the
latter, take a look at Erlang. I'd bet a lot of money that you could
write an Erlang script that would increase the speed by several orders
of magnitude. (On my machine, Erlang generates about 60,000 threads in
sevaral milliseconds, and I have an old, slow machine.)

CC
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      12-13-2008
On 2008-12-12 13:09, A. Sinan Unur <> wrote:
> "A. Sinan Unur" <> wrote in
> news:Xns9B71C0CDC9E12asu1cornelledu@127.0.0.1:
>> "" <> wrote in news:5f1e2237-
>> b3f6-409c-aa95-:
>>
>>> I analyzing some netwokr log files. There are around 200-300 files
>>> and each file has more than 2 million entries in it.

[...]
>>> Is there any efficient way to do it?
>>>
>>> May be Multiprocessing, Multitasking ?

>>
>> Here is one way to do it using Parallel::Forkmanager.
>>

> ...
>
>> Not very impressive. Between no forking vs max 20 instances, time
>> required to process was reduced by 20% with most of the gains coming
>> from running 2. That probably has more to do with the implementation
>> of fork on Windows than anything else.
>>
>> In fact, I should probably have used threads on Windows. Anyway, I'll
>> boot into Linux and see if the returns there are greater.

>
> Hmmm ... I tried it on ArchLinux using perl from the repository on the
> exact same hardware as the Windows tests:
>
> [sinan@archardy large]$ time perl process.pl 0
>
> real 0m29.983s
> user 0m29.848s
> sys 0m0.073s
>
> [sinan@archardy large]$ time perl process.pl 2
>
> real 0m15.281s
> user 0m29.865s
> sys 0m0.077s
>
> with no changes going to 4, 8, 16 or 20 max instances. Exact same
> program and data on the same hardware, yet the no fork version was 40%
> faster.


Where do you get this 40% figure from? As far as I can see the forking
version is almost exactly 100% faster (0m15.281s instead of 0m29.983s)
than the non-forking version.

This is to be expected. Your small test files fit completely into memory
even on rather small systems and if you ran process.pl directly after
create.pl, they almost certainly were. So the task is completely
CPU-bound, and if you have at least two cores (most current computers
have) two processes should be twice as fast as one.

Here is what I get for

for i in `seq 0 25`
do
echo -n "$i "
time ./process $i
done

on a dual-core system:

0 ./process $i 20.85s user 0.10s system 99% cpu 21.024 total
1 ./process $i 22.03s user 0.06s system 99% cpu 22.146 total
2 ./process $i 21.86s user 0.04s system 197% cpu 11.093 total
3 ./process $i 22.63s user 0.09s system 197% cpu 11.505 total
[...]
23 ./process $i 21.67s user 0.15s system 199% cpu 10.956 total
24 ./process $i 22.91s user 0.10s system 199% cpu 11.553 total
25 ./process $i 22.05s user 0.08s system 199% cpu 11.124 total


Two processes are twice as fast as one, but adding more processes
doesn't help (but doesn't hurt either).

And here's the output for an 8-core system:

0 ./process $i 10.22s user 0.05s system 99% cpu 10.275 total
1 ./process $i 10.13s user 0.07s system 100% cpu 10.196 total
2 ./process $i 10.19s user 0.06s system 199% cpu 5.138 total
3 ./process $i 10.19s user 0.06s system 284% cpu 3.606 total
4 ./process $i 10.19s user 0.06s system 395% cpu 2.589 total
5 ./process $i 10.18s user 0.06s system 472% cpu 2.167 total
6 ./process $i 10.20s user 0.05s system 495% cpu 2.069 total
7 ./process $i 10.20s user 0.07s system 650% cpu 1.580 total
8 ./process $i 10.18s user 0.06s system 652% cpu 1.571 total
9 ./process $i 10.19s user 0.05s system 659% cpu 1.553 total
10 ./process $i 10.20s user 0.06s system 667% cpu 1.538 total
11 ./process $i 10.19s user 0.06s system 666% cpu 1.538 total
12 ./process $i 10.19s user 0.06s system 706% cpu 1.451 total
13 ./process $i 10.19s user 0.05s system 662% cpu 1.545 total
14 ./process $i 10.19s user 0.06s system 689% cpu 1.486 total
15 ./process $i 10.19s user 0.05s system 708% cpu 1.446 total
16 ./process $i 10.20s user 0.06s system 755% cpu 1.357 total
17 ./process $i 10.22s user 0.06s system 756% cpu 1.360 total
18 ./process $i 10.20s user 0.05s system 741% cpu 1.383 total
19 ./process $i 10.21s user 0.06s system 729% cpu 1.407 total
20 ./process $i 10.23s user 0.05s system 726% cpu 1.415 total
21 ./process $i 10.20s user 0.06s system 749% cpu 1.368 total
22 ./process $i 10.21s user 0.05s system 726% cpu 1.411 total
23 ./process $i 10.23s user 0.06s system 739% cpu 1.392 total
24 ./process $i 10.21s user 0.04s system 712% cpu 1.440 total
25 ./process $i 10.20s user 0.05s system 739% cpu 1.386 total

Speed rises almost linearly until 7 processes (which manage to use 6.5
cores). Then it gets still a bit faster until 16 to 17 processes (using
7.5 cores) and after that it levels off. Not quite what I expected but
close enough.

For the OP's problem, this test is most likely not representative: He
has a lot more files and each is larger. So they may not fit into the
cache, and even if they do, they probably aren't in the cache when his
script runs (depends on how long ago they were last read/written and how
busy the system is).

hp
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Processing large files with TextFieldParser Jon Spivey ASP .Net 3 12-01-2009 10:02 PM
multiple rake build files? large rake files thufir Ruby 3 04-12-2008 07:28 AM
Text files read multiple files into single file, and then recreate the multiple files googlinggoogler@hotmail.com Python 4 02-13-2005 05:44 PM
Backing Up Large Files..Or A Large Amount Of Files Scott D. Weber For Unuathorized Thoughts Inc. Computer Support 1 09-19-2003 07:28 PM
Processing file input for large files[100+ MB] - Performance suggestions? Maxim ASP .Net 0 07-07-2003 05:31 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57