Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > best way to make a few changes in a large data file

Reply
Thread Tools

best way to make a few changes in a large data file

 
 
Ted Zlatanov
Guest
Posts: n/a
 
      01-09-2013
On Tue, 8 Jan 2013 10:51:11 -0800 (PST) ccc31807 <> wrote:

c> You would think so, anyway. This was the first thing I tried, and it
c> turns out (on my setup at least) that printing the outfile line by
c> line takes a lot longer than dumping the whole thing into memory then
c> printing the DS once.

I have never experienced this. Could you, for instance, be reopening
the change file repeatedly? Any chance you could post that slow version
of the code?

I would recommend, if you are stuck on the text-based data files, to
use perl -p -e 'BEGIN { # load your change file } ... process ... }'

This doesn't have to be a one-liner, but it's a good way to test quickly
the "slow performance" issue. e.g.

perl -p -e 's/^5,.*/5,edward/' myfile > myrewrite

If that's slow, something's up.

Ted
 
Reply With Quote
 
 
 
 
Rainer Weikusat
Guest
Posts: n/a
 
      01-11-2013
Rainer Weikusat <> writes:
> "C.DeRykus" <> writes:
>
> [...]
>
>> Since speed isn't critical, the Tie::File suggestion
>> would simplify the code considerably.

>
> [...]
>
>> use Tie::File;
>>
>> tie my @array, 'Tie::File', 'd0' or die $!;
>>
>> open(my $fh, '<', 'd1') or die $!;
>> while (<$fh>) {
>> chomp;
>> my($id, $value) = split /,/;
>> $array[$id-1] = "$id,$value";
>> }


[...]

> Assuming that speed doesn't matter, a simple implementation could
> look like this
>
> sub small
> {
> my ($fh, %chgs);
>
> open($fh, '<', 'd1');
> %chgs = map { split /,/ } <$fh>;
>
> open($fh, '<', 'd0');
> /(.*),(.*)/s, print ($1, ',', $chgs{$1} // $2) while <$fh>;
> }


As an afterthought: Instead of guessing at what's taking the time when
executing the code above, I've instead tested it. The 'small_hash'
implementation below (with data files constructed in the way I
described in an earlier posting) is either faster than big_hash or
runs at comparable speeds (tested with files up to 1004K in size). It
can also process a 251M file which the big_hash one can't do within a
reasonable amount of time because it first causes perl to eat all RAM
available on the system where I tested this and then makes that go into
'heavy thrashing' mode because 'all available RAM' is - by far - not
enough.

----------------
use Benchmark;

open($out, '>', '/dev/null');

timethese(-5,
{
big_hash => sub {
my ($fh, %data, $k, $d);

open($fh, '<', 'd0');
%data = map { split /,/ } <$fh>;

open($fh, '<', 'd1');
while (<$fh>) {
($k, $d) = split /,/;
$data{$k} = $d;
}

print $out ($_, ',', $data{$_}) for keys(%data);
},

small_hash => sub {
my ($fh, %chgs, $k, $d);

open($fh, '<', 'd1');
%chgs = map { split /,/ } <$fh>;

open($fh, '<', 'd0');
while (<$fh>) {
($k, $d) = split /,/;
print $out ($k, ',', $chgs{$k} // $d);
}
}});
 
Reply With Quote
 
 
 
 
BobMCT
Guest
Posts: n/a
 
      01-12-2013
On Fri, 11 Jan 2013 13:56:32 +0000, Rainer Weikusat
<> wrote:

>Rainer Weikusat <> writes:
>> "C.DeRykus" <> writes:

Just a thought, but did you ever consider loading the data into a
temporary indexed database table and 'batch' updating it using the
indexing keys? Then you could dump the table to a flat file when
done. You should be able to use shell commands to dump, run the php
script, then dump the table to a file.

Just my $.02 worth
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      01-12-2013
BobMCT <> writes:
> On Fri, 11 Jan 2013 13:56:32 +0000, Rainer Weikusat
> <> wrote:
>
>>Rainer Weikusat <> writes:
>>> "C.DeRykus" <> writes:

> Just a thought, but did you ever consider loading the data into a
> temporary indexed database table and 'batch' updating it using the
> indexing keys?


As I wrote in a reply to an earlier posting: This would be a
perfect job for one of the available 'flat file' database packages,
eg, DB_File. But unless the same 'base data' file is processed more
than once, this means 'read the big file', 'write a big file', 'read
this big file', 'write another big file' and the replacement step
would turn into 'modify the big file'. I doubt that this would be
worth the effort.
 
Reply With Quote
 
Xho Jingleheimerschmidt
Guest
Posts: n/a
 
      01-15-2013
On 01/09/2013 06:10 AM, C.DeRykus wrote:
>
> Since speed isn't critical, the Tie::File suggestion would simplify
> the code considerably. Since the whole file isn't loaded, big files
> won't be problematic


I haven't used it in a while, but if I recall correctly Tie::File stores
the entire table of line-number/byte-offset in RAM, and that can often
be about as large as storing the entire file if the lines are fairly short.

Xho
 
Reply With Quote
 
C.DeRykus
Guest
Posts: n/a
 
      01-15-2013
On Monday, January 14, 2013 7:24:45 PM UTC-8, Xho Jingleheimerschmidt wrote:
> On 01/09/2013 06:10 AM, C.DeRykus wrote:
>
> >

>
> > Since speed isn't critical, the Tie::File suggestion would simplify

>
> > the code considerably. Since the whole file isn't loaded, big files

>
> > won't be problematic

>
>
>
> I haven't used it in a while, but if I recall correctly Tie::File stores
>
> the entire table of line-number/byte-offset in RAM, and that can often
>
> be about as large as storing the entire file if the lines are fairly short.
>
>


Actually IIUC, Tie::File is more parsimonious of memory than even DB_File for instance and employs a
"lazy cache" whose size can be user-specified.

See: http://perl.plover.com/TieFile/why-not-DB_File

So, even with overhead of 310 bytes per record, that
would get slow only if the file gets really huge and
least-recently read records start to get tossed.
But the stated aim was accuracy rather than speed.

And, since there's a 10Mb record limit with only 200-300K records, that's unlikely to be show-stopper status. Only a couple of seconds to read a comparably sized file in my simple test.

--
Charles DeRykus
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      01-15-2013
"C.DeRykus" <> writes:

[...]

> Tie::File is more parsimonious of memory than even DB_File for instance and employs a
> "lazy cache" whose size can be user-specified.
>
> See: http://perl.plover.com/TieFile/why-not-DB_File
>
> So, even with overhead of 310 bytes per record, that
> would get slow only if the file gets really huge and
> least-recently read records start to get tossed.
> But the stated aim was accuracy rather than speed.


Nevertheless, Tie::File not only needs *much* more memory than a
line-by-line processing loop (~5000 bytes vs 138M for a 63M file) but
is also atrociously slow: Replacing 10 randomly selected lines in a
53,248 lines file with a total size of 251K needs (on the system
where I tested this) about 0.02s when reading but about 0.51s when
using Tie::File (and it is probably still completely unsuitable to
solve the original problem to begin with).

 
Reply With Quote
 
C.DeRykus
Guest
Posts: n/a
 
      01-15-2013
On Tuesday, January 15, 2013 12:40:32 PM UTC-8, Rainer Weikusat wrote:
> "C.DeRykus" <> writes:
>
>
>
> [...]
>
>
>
> > Tie::File is more parsimonious of memory than even DB_File for instanceand employs a

>
> > "lazy cache" whose size can be user-specified.

>
> >

>
> > See: http://perl.plover.com/TieFile/why-not-DB_File

>
> >

>
> > So, even with overhead of 310 bytes per record, that

>
> > would get slow only if the file gets really huge and

>
> > least-recently read records start to get tossed.

>
> > But the stated aim was accuracy rather than speed.

>
>
>
> Nevertheless, Tie::File not only needs *much* more memory than a
>
> line-by-line processing loop (~5000 bytes vs 138M for a 63M file) but
>
> is also atrociously slow: Replacing 10 randomly selected lines in a
>
> 53,248 lines file with a total size of 251K needs (on the system
>
> where I tested this) about 0.02s when reading but about 0.51s when
>
> using Tie::File (and it is probably still completely unsuitable to
>
> solve the original problem to begin with).


In general I'd agree. But there's an upper bound of 10M records. If that scenario changed or some threshold was impacted, you could re-design. But, who cares here if you lose a second of runtime... or memory bumps during thatshort window. The OP said accuracy - not speed - was the objective: "it wouldn't matter if it took 5 seconds to run or 5 minutes to run, as long as it produces the correct results."

The code becomes simpler, more intuitive, timelier. You can quickly move on.... to more pressing/interesting/challenging issues.

--
Charles DeRykus
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      01-15-2013
"C.DeRykus" <> writes:

[...]

>> Nevertheless, Tie::File not only needs *much* more memory than a
>> line-by-line processing loop (~5000 bytes vs 138M for a 63M file) but
>> is also atrociously slow: Replacing 10 randomly selected lines in a
>> 53,248 lines file with a total size of 251K needs (on the system
>> where I tested this) about 0.02s when reading but about 0.51s when
>> using Tie::File (and it is probably still completely unsuitable to
>> solve the original problem to begin with).

>
> In general I'd agree. But there's an upper bound of 10M records. If
> that scenario changed or some threshold was impacted, you could
> re-design. But, who cares here if you lose a second of runtime... or
> memory bumps during that short window. The OP said accuracy - not
> speed - was the objective: "it wouldn't matter if it took 5 seconds
> to run or 5 minutes to run, as long as it produces the correct
> results."
>
> The code becomes simpler, more intuitive, timelier. You can quickly
> move on... to more pressing/interesting/challenging issues.


The code does not 'become simpler', it becomes a lot more complicated.
Not even the 'front-end code' which needs to be written specifically for
this is shorter than a sensible (meaning, it performs well)
implementation without Tie::File since it was 8 lines of code in both
cases. A 'performance doesn't matter' implementation can be shorter
than that, as demonstrated. IMO, this is really an example of using a
module because it exists, despite it isn't suitable for solving the
described problem, is a lot more complicated than just using the
facilities already provided by perl and is vastly technically inferior
to these as well.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
median of large data set (from large file) friend.05@gmail.com Perl Misc 5 04-02-2009 04:06 AM
best way for a client to make simple changes to their website that I created -- Help... mattborkin@aol.com HTML 4 03-21-2006 11:04 AM
To delete few lines and add few lines at the end of a text file using c program Murali C++ 2 03-09-2006 04:45 PM
What is the best way to make a Tree Data Structure in one text file? http://links.i6networks.com Perl Misc 1 08-19-2004 06:44 PM
Best way to get a few bytes from a java.nio.FileChannel'ed file... Spendius Java 0 09-07-2003 11:37 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57