![]() |
Faster file iteration
use strict;
my $file_1 = '1.txt'; # File 1 my $file_2 = '2.txt'; # File 2 if(open(FH1 , $file_1)){ print "File $file_1 Opened\n"; }else{ print "Failed to Open file $file_1\n"; exit; } if(open(FH2 , $file_2)){ print "File $file_2 Opened\n"; }else{ print "Failed to Open file $file_2\n"; close FH1; exit; } while(chomp(my $line_2 = <FH2>)){ my($dummy21,$file21_no,$file21_date) = split(/\s+/,$line_2); next if($file21_no !~ /\d+/); my $counter1 = 0; my $least_date1 = 0; seek(FH1,0,0); $least_date1 = date_compare($file21_date); while(chomp(my $line_1 = <FH1>)){ my($d,$file1_no,$file1_date) = split(/;/,$line_1); if($file1_no == $file21_no){ $file1_date =~/(\d\d\d\d)(\d\d)(\d\d)/; my $yr1 = $1; $file21_date =~/(\d\d\d\d)(\d\d)(\d\d)/; if(($yr1 - $1) < 5){ $counter1++; } } } $least_date1 = 0 if($counter1 == 0); print "$dummy21\t$file21_no\t$file21_date\t$counter1 \t $least_date1\n"; print FH3 "$dummy21\t$file21_no\t$file21_date\t$counter1 \t $least_date1\n"; } Here $file_1 has around 12000000 records , it takes 2 mins to go for a single record in $file_2. Any suggestion to make it fast ? |
Re: Faster file iteration
On Thu, 13 Mar 2008 06:41:59 -0700, vijay@iavian.com wrote:
> Here $file_1 has around 12000000 records , it takes 2 mins to go for a > single record in $file_2. > > Any suggestion to make it fast ? Read file_1 once, store it in an appropriate datastructure (hash comes to mind). It still may take two minutes to read, but after that searching is fast. Does take some memory, but 12 million records should take less than 100 Megs. M4 |
Re: Faster file iteration
On Mar 13, 7:52 pm, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
> vi...@iavian.com wrote: > > > Here $file_1 has around 12000000 records , it takes 2 mins to go for a > > single record in $file_2. > > > Any suggestion to make it fast ? > > Are the two files in date-sorted order? > > BugBear No , they are not sorted on date , no unique key .. |
Re: Faster file iteration
"vijay@iavian.com" <vijay@iavian.com> wrote:
> use strict; > > my $file_1 = '1.txt'; # File 1 > my $file_2 = '2.txt'; # File 2 > > if(open(FH1 , $file_1)){ > print "File $file_1 Opened\n"; > }else{ > print "Failed to Open file $file_1\n"; > exit; > } > > if(open(FH2 , $file_2)){ > print "File $file_2 Opened\n"; > }else{ > print "Failed to Open file $file_2\n"; > close FH1; > exit; > } > > while(chomp(my $line_2 = <FH2>)){ > my($dummy21,$file21_no,$file21_date) = split(/\s+/,$line_2); > next if($file21_no !~ /\d+/); > my $counter1 = 0; > my $least_date1 = 0; > seek(FH1,0,0); > $least_date1 = date_compare($file21_date); > while(chomp(my $line_1 = <FH1>)){ > my($d,$file1_no,$file1_date) = split(/;/,$line_1); > if($file1_no == $file21_no){ You could pre-load file1 into a hash (by $file1_no) of a list of lines that have that $file1_no. That way for each line in file2, you only need to go through those lines of file1 that already meet the above condition. This by itself should greatly improve things unless there most of the data is all in the same or just a few $file1_no. > $file1_date =~/(\d\d\d\d)(\d\d)(\d\d)/; > my $yr1 = $1; > $file21_date =~/(\d\d\d\d)(\d\d)(\d\d)/; > if(($yr1 - $1) < 5){ > $counter1++; > } And within a given $file1_no hashed list, you could sort by file1_date, that way once you meet a non-qualifying date you could abort the loop early rather than testing all the rest. (This improvement would probably be quite small, compared to the previous one) Xho -- -------------------- http://NewsReader.Com/ -------------------- The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. |
Re: Faster file iteration
"vijay@iavian.com" <vijay@iavian.com> wrote:
[code snipped] Thank you for posting the code. But what is it _supposed_ to do? What are the requirements? Unless you tell us we can't know if you are doing something unneccessary in your code. >Here $file_1 has around 12000000 records , it takes 2 mins to go for a >single record in $file_2. > >Any suggestion to make it fast ? Give us a spec and maybe someone will be able to come up with a better algorithm. jue |
Re: Faster file iteration
<vijay@iavian.com> wrote in message news:26fd1101-cf35-4b41-9cbb-945127088f84@h11g2000prf.googlegroups.com... .... > Here $file_1 has around 12000000 records , it takes 2 mins to go for a > single record in $file_2. > > Any suggestion to make it fast ? Obvious answer: If you have the memory, read file1 into memory and process it from there. Mario |
Re: Faster file iteration
On Mar 13, 9:23 pm, Jürgen Exner <jurge...@hotmail.com> wrote:
> Give us a spec and maybe someone will be able to come up with a better > algorithm. the specs We have two files. The first file,say 'one.txt', has data arranged in three columns, separated by semicolon. something like this: 1234567;7654321;20080225 1234765;5464354;19821111 342312A;5464354;19990101 ABC12;9876544;0 I002222;ACD222;19991130 ......... Note that the three columns are not of fixed length. The first two columns are of a maximum length 7 and can contain alpha-numerals. The third column is the date column (in YYYYMMDD format). It can also contain '0' or can be empty too. The second file,say 'two.txt', also has three columns separated by spaces, something like: serialno fileno date 123 1234567 20080315 2 2233442 20081130 311 1232231 20031221 44 1232123 19990831 23 2131312 20000101 132 5464354 19811111 ...... The enitre file contains only numerals, from second line onwards. The first column length ranges from 1-3 numbers. Second column strictly is of 7 number length. Third column is the date column strictly in YYYYMMDD format. Now, the requirement would be to add two additional columns in 'two.txt'. The fourth and fifth columns will be tab separated and labeled 'label4' and 'label5' respectively. The values to be populated under 'label4' should be computed as follows: Read the 7-digit number present in the second column (under fileno) of 'two.txt'. Compare the number with the alpha-numeric value present in the second column of the 'one.txt' file. on finding a perfect match, trigger a counter. Repeat the previous procedure for subsequent lines and increment the counter each time you find a match. The fourth column should then be populated with the final value in teh counter against the fileno, which is the number of exact matches you've found. If you've found no match, then just populate the entry with a '0' (zero). But, there is one condition which you need to take care of before populating-the date difference in each row should be less than or equal to 5yrs. to do this, you need to pick up the corresponding date from next to that fileno in 'two.txt'and also pick up the date next from the thrid column in 'one.txt', and take a diff. If the difference is more than 5 yrs, do not increment the counter. *NOTE: the date in file 'one.txt' is always greater(or later) than the corresponding date in 'two.txt'. The date ranges from 19900101 to 20041231 in file 'two.txt' and from 19750101 to 20011225in file 'one.txt' In the above example, the new 'two.txt' will look something like serialnumber fileno date label4 123 1234567 20080315 0 2 2233442 20081130 0 311 1232231 20031221 0 44 1232123 19990831 0 23 2131312 20000101 0 132 5464354 19811111 1 *Label5: We know that the date in 'one.txt' ends on 12/25/2001. For every matched file number in 'two.txt', pls do the following: 1. if the date in 'two.txt' is less than 12/25/2001, by 5 yrs or more, mark as 5 yrs. 2. if its between 12/25/2001 and 12/25/1996 mark the exact number in terms of number of years,months and days. 3. if its more than 12/25/2001 and till 31/12/2004, mark the exact number of years,months and days, but put a '-' (minus sign) in front of it. |
Re: Faster file iteration
On Mar 14, 4:33 pm, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
> > Overall performance is O(NlogN) + O(N) + O(NlogN) which is O(NlogN) > which is rather better than your present O(N^2) > > BugBear Any suggestions on using Thread? #!/usr/bin/perl use strict; #use Data::Dumper; #use CGI; use Date::Calc qw(Delta_YMD); use Thread; my $file_1 = '1.txt'; # File 1 my $file_2 = '2.txt'; # File 2 my $file_3 = 'f.txt'; # Final output file if(open(FH1 , $file_1)){ print "File $file_1 Opened\n"; close(FH1); }else{ print "Failed to Open file $file_1\n"; exit; } if(open(FH2 , $file_2)){ print "File $file_2 Opened\n"; }else{ print "Failed to Open file $file_2\n"; close FH1; exit; } if(open(FH3,">$file_3")){ print "File $file_3 Opened\n"; print FH3 "serialno\tfileno\tdate\tlabel4\tlabel5\n"; }else{ print "Failed to Open file $file_3\n"; close FH1;close FH2; exit; } while(my $line_2 = <FH2>){ chomp($line_2);print $line_2."\n"; my($dummy,$file2_no,$file2_date) = split(/\s+/,$line_2); next if($file2_no !~ /\d+/); my $counter = 0; my $least_date = date_compare($file2_date); my $thr = new Thread \&traverse, $dummy,$file2_no,$file2_date, $counter,$least_date; #$counter = traverse($file2_no,$file2_date); } sleep(500); close FH1; close FH2; close FH3; sub traverse{ my($dummy,$file2_no,$file2_date,$counter,$least_da te) = @_; my $counter = 0; open(FHT , $file_1); seek(FHT,0,0); while(my $line_1 = <FHT>){ chomp($line_1); my ($d,$file1_no,$file1_date) = split(/;/,$line_1); if($file1_no == $file2_no){ #print $file1_date."=".$file2_date."\n"; if((date_compare5($file1_date,$file2_date)) == 1){ $counter++; } } } close(FHT); $least_date = 0 if($counter == 0); print "$dummy\t$file2_no\t$file2_date\t$counter\t$least_ date\n"; print FH3 "$dummy\t$file2_no\t$file2_date\t$counter\t$least_ date \n"; return $counter; } sub date_compare5{ # Comparision for 5 Years my($date_1,$date_2) = @_; $date_1 =~/(\d\d\d\d)(\d\d)(\d\d)/; my $yr1 = $1; $date_2 =~/(\d\d\d\d)(\d\d)(\d\d)/; my $yr2 = $1; #print "$yr1=$mn1=$dt1: ";print "$yr2=$mn2=$dt2\n"; if(($yr1 - $yr2) < 5){ #print "$yr1=$mn1=$dt1: ";print "$yr2=$mn2=$dt2\n"; return 1; } return -1; } sub date_compare{ # Comparision for actual date , return 1 if date1 is big otherwise -1 , if equal then 0 my($date_1) = @_; $date_1 =~/(\d\d\d\d)(\d\d)(\d\d)/; my($yr1,$mn1,$dt1) = ($1,$2,$3); if($yr1 < 1996){ return "5 Yrs"; }elsif($yr1 == 1996 && $mn1 < 12){ return "5 Yrs"; }elsif($yr1 == 1996 && $mn1 == 12 && $dt1 <= 25 ){ return "5 Yrs"; }elsif($yr1 < 2001 && $yr1 > 1996){ return delta($yr1,$mn1,$dt1); }elsif($yr1 == 1996 && $mn1 == 12 && $dt1 >=25){ return delta($yr1,$mn1,$dt1); }elsif($yr1 == 2001 && $mn1 < 12 ){ return delta($yr1,$mn1,$dt1); }elsif($yr1 == 2001 && $mn1 == 12 && $dt1 <=24){ return delta($yr1,$mn1,$dt1); }elsif($yr1 > 2001){ return delta($yr1,$mn1,$dt1); }elsif($yr1 == 2001 && $mn1 == 12 && $dt1 > 24 ){ return delta($yr1,$mn1,$dt1); }else{ return "No case ".$date_1; } } sub delta{ my $yr = shift;my $mn = shift; my $dt= shift; ($yr,$mn,$dt) = Delta_YMD($yr,$mn,$dt,2001,12,25); return "$yr-$mn-$dt"; } |
Re: Faster file iteration
"vijay@iavian.com" <vijay@iavian.com> wrote:
> On Mar 14, 4:33 pm, bugbear <bugbear@trim_papermule.co.uk_trim> wrote: > > > > Overall performance is O(NlogN) + O(N) + O(NlogN) which is O(NlogN) > > which is rather better than your present O(N^2) > > > > BugBear > > Any suggestions on using Thread? God, I hope not. It seems like you want to try every bad way to solve this problem. What about the suggestions you already received--ones that would actually work and make things fast? Xho -- -------------------- http://NewsReader.Com/ -------------------- The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. |
Re: Faster file iteration
Quoth "vijay@iavian.com" <vijay@iavian.com>: > On Mar 14, 4:33 pm, bugbear <bugbear@trim_papermule.co.uk_trim> wrote: > > > > Overall performance is O(NlogN) + O(N) + O(NlogN) which is O(NlogN) > > which is rather better than your present O(N^2) > > Any suggestions on using Thread? Thread.pm is deprecated: it supported the old 5005-threads threading model, which never worked right and was removed from perl 5.8. Thread.pm is just a passthrough to threads.pm; new code should be using this directly. Ben |
| All times are GMT. The time now is 04:22 AM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.