Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Best way to replace a set of strings in large files?

Reply
Thread Tools

Best way to replace a set of strings in large files?

 
 
Ryan Chan
Guest
Posts: n/a
 
      12-10-2009
Hello,

Consider the case:

You have 200 lines of mapping to replace, in a csv format, e.g.

apple,orange
boy,girl
....

You have a 500MB file, you want to replace all 200 lines of mapping,
what would be the most efficient way to do it?

Thanks.
 
Reply With Quote
 
 
 
 
cvhLE
Guest
Posts: n/a
 
      12-11-2009
On Dec 10, 3:21*pm, Ryan Chan <ryanchan...@gmail.com> wrote:
> Hello,
>
> Consider the case:
>
> You have 200 lines of mapping to replace, in a csv format, e.g.
>
> apple,orange
> boy,girl
> ...
>
> You have a 500MB file, you want to replace all 200 lines of mapping,
> what would be the most efficient way to do it?
>
> Thanks.


If you want to replace the whole line or know the column where you
need to replace it and the line has clear separators you may be be a
lot faster if you do it using awk:

cat csv|awk -F"," "$2~/apple/ {$2="orange"; print $1,$2} " ...

otherwise I don't see a reason not to use the most obvious way:
starting from line 1 and running until the end ... especially if dont
know *where* the 200 lines are ...

#! /usr/bin/perl -w
%replace=('apple'=>'orange','boy'=>'girl');
$r="(".join ("|", keys %replace ).")";$r=qr($r);
while (<>) {
s/$r/$replace{$1}/g;
print;
}




[08:07:43] cvh@lenny:~$ echo "a boy named sue sings a song for apple
jack" | perl repl.pl
a girl named sue sings a song for orange jack
[08:07:45] cvh@lenny:~$ echo "a boy named sue sings a song for apple
jack" > test.txt
[08:07:59] cvh@lenny:~$ perl repl.pl test.txt
a girl named sue sings a song for orange jack
[08:08:11] cvh@lenny:~$ perl repl.pl test.txt >test_replace.txt
[08:08:24] cvh@lenny:~$ cat test_replace.txt
a girl named sue sings a song for orange jack
[08:08:40] cvh@lenny:~$


 
Reply With Quote
 
 
 
 
sln@netherlands.com
Guest
Posts: n/a
 
      12-11-2009
On Thu, 10 Dec 2009 23:09:28 -0800 (PST), cvhLE <> wrote:

>On Dec 10, 3:21*pm, Ryan Chan <ryanchan...@gmail.com> wrote:
>> Hello,
>>
>> Consider the case:
>>
>> You have 200 lines of mapping to replace, in a csv format, e.g.
>>
>> apple,orange
>> boy,girl
>> ...
>>
>> You have a 500MB file, you want to replace all 200 lines of mapping,
>> what would be the most efficient way to do it?
>>
>> Thanks.

>
>If you want to replace the whole line or know the column where you
>need to replace it and the line has clear separators you may be be a
>lot faster if you do it using awk:
>
>cat csv|awk -F"," "$2~/apple/ {$2="orange"; print $1,$2} " ...
>
>otherwise I don't see a reason not to use the most obvious way:
>starting from line 1 and running until the end ... especially if dont
>know *where* the 200 lines are ...
>
>#! /usr/bin/perl -w
>%replace=('apple'=>'orange','boy'=>'girl');
>$r="(".join ("|", keys %replace ).")";$r=qr($r);
>while (<>) {
>s/$r/$replace{$1}/g;
>print;
>}
>


I would asume this would take a long
time to do this process.

At a minimum, it would take

500,000,000
x
200
-----------------
100,000,000,000

100 billion character comparisons
if nothing ever matched.
Still not matching word, but the first character
matched before backtracking

100,000,000,000
x
2
----------------
200,000,000,000

brings the total up to 200 billion character
comparisons.

Since this is all a conservative estimate
I would average (conservatively) 4 comparison
characters per map per byte in the file and say

500,000,000
x
800
-----------------
400,000,000,000

400 billion comparisons.
Add to that the menutia of backtracking, loading
buffers, writing to disk, and the underpining layers
Perl has to do to execute C code, and I would go out
for coffee or take a nap.

-sln



 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
matching strings in a large set of strings Karin Lagesen Python 13 05-03-2010 03:53 PM
External Hashing [was Re: matching strings in a large set of strings] Helmut Jarausch Python 3 04-30-2010 08:44 PM
How to replace all strings matching a pattern with correspondinglower case strings ? anonym Java 1 01-15-2009 07:29 PM
best efficient and readable way to concatenate strings (or the best trade-offs) Diego Martins C++ 5 06-19-2007 02:27 PM
Strings, Strings and Damned Strings Ben C Programming 14 06-24-2006 05:09 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57