![]() |
Possibly useful perl script to filter lines in one file out of another.
Hi!
I needed to take the email addresses that bounced out of an original mailing list. grep -v -f was far to slow, and comm produced unexpected results, and so I just wrote something to do it in perl. Thought this might be useful to somebody else; #!/usr/bin/perl # # filter $file1 $file2 # # Filters all lines in file1 against lines in file2, copying only lines # from file 1 not found in file2 to STDOUT # # get arguments my $file1 = shift; my $file2 = shift; if(!defined($file1) || !defined($file2)) { print "\nError, must have two arguments.\n"; print "filter <masterfile> <excludefile>\n"; exit 1; } # Copy all lines from file2 into a hash open (EXCLUDE, $file2); my %exclude = (); while ($line = <EXCLUDE>) { chomp($line); $exclude{$line} = 1; } close EXCLUDE; # Now go through input line-by-line comparing to hash and only # printing lines that do not match open (DATA, $file1); while ($line = <DATA>) { chomp($line); if(!exists($exclude{$line})) { print "$line\n"; } } close DATA; exit 0; |
Re: Possibly useful perl script to filter lines in one file out of another.
In article <230820091346263431%benburch@pobox.com>,
Ben Burch <benburch@pobox.com> wrote: >I needed to take the email addresses that bounced out of an original >mailing list. grep -v -f was far to slow, and comm produced >unexpected results, and so I just wrote something to do it in perl. comm requires that both input files be sorted -- presumably in byte value order rather than by dictionary order. When comm bites me, it's because I've forgotten that. -- Tim McDaniel, tmcd@panix.com |
Re: Possibly useful perl script to filter lines in one file out of another.
>>>>> "BB" == Ben Burch <benburch@pobox.com> writes:
BB> I needed to take the email addresses that bounced out of an original BB> mailing list. grep -v -f was far to slow, and comm produced unexpected BB> results, and so I just wrote something to do it in perl. Thought this BB> might be useful to somebody else; i find it hard to believe that grep -v -f is slower than perl. did you benchmark the final results? BB> #!/usr/bin/perl BB> # no warnings or strict. use them. BB> # get arguments BB> my $file1 = shift; BB> my $file2 = shift; BB> if(!defined($file1) || !defined($file2)) BB> { BB> print "\nError, must have two arguments.\n"; BB> print "filter <masterfile> <excludefile>\n"; BB> exit 1; BB> } much simpler and slightly more accurate is to check @ARGV if it has 2 elements: unless( @ARGV == 2 ) { die 'blah' ; } and use better names than file1 and file2. they are files of different data my( $exc_file, $data_file ) = @ARGV ; BB> # Copy all lines from file2 into a hash BB> open (EXCLUDE, $file2); BB> my %exclude = (); BB> while ($line = <EXCLUDE>) BB> { BB> chomp($line); BB> $exclude{$line} = 1; BB> } BB> close EXCLUDE; use File::Slurp ; my %exclude = map { chomp; $_ => 1 } read_file( $exc_file ) ; BB> # Now go through input line-by-line comparing to hash and only BB> # printing lines that do not match BB> open (DATA, $file1); don't use DATA for a file handle as it is the handle name for data in the source file after the __END__ marker BB> while ($line = <DATA>) BB> { BB> chomp($line); BB> if(!exists($exclude{$line})) BB> { BB> print "$line\n"; BB> } invert that logic for simpler code: next if $exclude{ $line } ; print "$line\n" ; and if your bounce line file isn't that large (for some definition of large) you can also slurp and filter it out too. and since your bounce and exclude lines are all ending in newline, there is no need to chomp in either case. it makes this much easier. <untested entire main code> my %exclude = map { $_ => 1 } read_file( $exc_file ) ; print grep { !$exclude{ $_ } } read_file( $data_file ) ; ain't perl cool! :) uri -- Uri Guttman ------ uri@stemsystems.com -------- http://www.sysarch.com -- ----- Perl Code Review , Architecture, Development, Training, Support ------ --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com --------- |
Re: Possibly useful perl script to filter lines in one file out of another.
On Sun, 23 Aug 2009 14:37:45 -0500, Tad J McClellan <tadmc@seesig.invalid> wrote:
>Ben Burch <benburch@pobox.com> wrote: > >> #!/usr/bin/perl > > use warnings; > use strict; > >> open (EXCLUDE, $file2); > >You should always, yes *always*, check the return value from open(): > > open my $EXCLUDE, '<', $file2 or die "could not open '$file2' $!"; > >> while ($line = <EXCLUDE>) > > while ($line = <$EXCLUDE>) > >> if(!exists($exclude{$line})) > > unless ( exists $exclude{$line} ) > >or at least make wise use of whitespace and punctuation: > > if ( ! exists $exclude{$line} ) Hi Tad. I've seen that always check the return value of open here on this NG, then die if not true? Why die if open didn't die? Whats the worse thing that can happen? I think the worse thing is that a read or write doesen't happen. It won't crash the system or mess up the file allocation tables. Its funny, if you pass a failed open filehandle like open my $fh, 'non-existant-file.txt' to a read $fh,... the read passivily fails. There is no fatal error. But if you pass an undefined filehandle to read, it die's. Something to consider since a failed open does not really cause problems because and apparently an undefined handle is enough to cause a die from Perl's i/o functions (well at least read ). So, why is it always, yes always, necessary to check the return value from open() ? -sln |
Re: Possibly useful perl script to filter lines in one file out of another.
sln@netherlands.com wrote:
> On Sun, 23 Aug 2009 14:37:45 -0500, Tad J McClellan > <tadmc@seesig.invalid> wrote: > >>Ben Burch <benburch@pobox.com> wrote: >> >>> #!/usr/bin/perl >> >> use warnings; >> use strict; >> >>> open (EXCLUDE, $file2); >> >>You should always, yes *always*, check the return value from open(): >> >> open my $EXCLUDE, '<', $file2 or die "could not open '$file2' $!"; >> >>> while ($line = <EXCLUDE>) >> >> while ($line = <$EXCLUDE>) >> >>> if(!exists($exclude{$line})) >> >> unless ( exists $exclude{$line} ) >> >>or at least make wise use of whitespace and punctuation: >> >> if ( ! exists $exclude{$line} ) > > Hi Tad. > > I've seen that always check the return value of open > here on this NG, then die if not true? > > Why die if open didn't die? Whats the worse thing that can happen? > I think the worse thing is that a read or write doesen't happen. > It won't crash the system or mess up the file allocation tables. > > Its funny, if you pass a failed open filehandle like > open my $fh, 'non-existant-file.txt' > to a read $fh,... the read passivily fails. There is no > fatal error. > > But if you pass an undefined filehandle to read, it > die's. > > Something to consider since a failed open does not really > cause problems because and apparently an undefined handle is > enough to cause a die from Perl's i/o functions (well at least read ). > > So, why is it always, yes always, necessary to check the return > value from open() ? > > -sln If you want to open/read/write to a file, there's an intended reason. It doesn't have to be a die, the point is to be aware of the problem and have it output or log the problem, which helps troubleshoot problems (and unintended bugs). He said to always check the return value, he didn't say to always die. If you have a script that doesn't need to open a file you told it to, why are you opening it? |
Re: Possibly useful perl script to filter lines in one file out of another.
On Sun, 23 Aug 2009 17:27:13 -0700, Nathan Keel <nat.k@gm.ml> wrote:
>sln@netherlands.com wrote: > >> On Sun, 23 Aug 2009 14:37:45 -0500, Tad J McClellan >> <tadmc@seesig.invalid> wrote: >> >>>Ben Burch <benburch@pobox.com> wrote: >>> >>>> #!/usr/bin/perl >>> >>> use warnings; >>> use strict; >>> >>>> open (EXCLUDE, $file2); >>> >>>You should always, yes *always*, check the return value from open(): >>> >>> open my $EXCLUDE, '<', $file2 or die "could not open '$file2' $!"; >>> >>>> while ($line = <EXCLUDE>) >>> >>> while ($line = <$EXCLUDE>) >>> >>>> if(!exists($exclude{$line})) >>> >>> unless ( exists $exclude{$line} ) >>> >>>or at least make wise use of whitespace and punctuation: >>> >>> if ( ! exists $exclude{$line} ) >> >> Hi Tad. >> >> I've seen that always check the return value of open >> here on this NG, then die if not true? >> >> Why die if open didn't die? Whats the worse thing that can happen? >> I think the worse thing is that a read or write doesen't happen. >> It won't crash the system or mess up the file allocation tables. >> >> Its funny, if you pass a failed open filehandle like >> open my $fh, 'non-existant-file.txt' >> to a read $fh,... the read passivily fails. There is no >> fatal error. >> >> But if you pass an undefined filehandle to read, it >> die's. >> >> Something to consider since a failed open does not really >> cause problems because and apparently an undefined handle is >> enough to cause a die from Perl's i/o functions (well at least read ). >> >> So, why is it always, yes always, necessary to check the return >> value from open() ? >> >> -sln > >If you want to open/read/write to a file, there's an intended reason. >It doesn't have to be a die, the point is to be aware of the problem >and have it output or log the problem, which helps troubleshoot >problems (and unintended bugs). He said to always check the return >value, he didn't say to always die. If you have a script that doesn't >need to open a file you told it to, why are you opening it? I used to work for a company that did a lot of automation using perl. I was new to Perl, but was hired because of my c++ background, but ended up having to do all perl. Looking back on it, thier motto was don't die on anything, do not stop the automation. The entire environment was dynamically generated. There was not a die anywhere in any line of code. The check for existence is fine, but you can't wrap all your other code in if's all the time. Definetly logs though, lots of them, on the chance something didn't work. They could have used something like this, though they didn't have it. use strict; use warnings; my ($buf,$length) = ('',5); # Invoke error #1, NON - FATAL error on read. # File doesen't exist, however, $fh is valid open my $fh, '<', 'notexists.txt'; # Invoke error #2, FATAL error on read #my $fh; open STDERR, '>errors.txt'; { local $!; my $status = eval { read ($fh, $buf, $length) }; $@ =~ s/\s+$//; if ($@ || (!$status && $!)) { print "Error in read: ". ($@ ? $@ : $! ). "\n"; } } print "More code ...\n"; exit; __END__ -sln |
Re: Possibly useful perl script to filter lines in one file out of another.
In article <slrnh94199.d8a.tadmc@tadmc30.sbcglobal.net>,
Tad J McClellan <tadmc@seesig.invalid> wrote: >sln@netherlands.com <sln@netherlands.com> wrote: >> On Sun, 23 Aug 2009 14:37:45 -0500, Tad J McClellan <tadmc@seesig.invalid> wrote: > >>>You should always, yes *always*, check the return value from open(): > >> I've seen that always check the return value of open >> here on this NG, .... >> Whats the worse thing that can happen? > >You silently get the wrong output and wonder what went wrong. That's bad, but I think the worst is that you get the wrong output and DON'T notice (and therefore don't wonder). Instead, you work with bad or missing data. -- Tim McDaniel, tmcd@panix.com |
Re: Possibly useful perl script to filter lines in one file out of another.
On Sun, 23 Aug 2009 22:22:19 -0500, Tad J McClellan <tadmc@seesig.invalid> wrote:
>sln@netherlands.com <sln@netherlands.com> wrote: >> On Sun, 23 Aug 2009 14:37:45 -0500, Tad J McClellan <tadmc@seesig.invalid> wrote: > >>>You should always, yes *always*, check the return value from open(): > >> I've seen that always check the return value of open >> here on this NG, > > >It's been said often enough. > > >> then die if not true? > > >No, then do "whatever is appropriate for your situation". > >The admonition is to check the return value. > >It is not to take any particular action if the check fails, though >die is often used, as it is most often the appropriate action. > >(the purpose of most programs is to process a file's > contents, so there is no point in continuing if such a program > cannot read the file's contents. >) > >> Why die if open didn't die? > > >open() does not die, if it fails it fails silently (which is why >you should always, yes *always*, check its return value). > >So I don't know what you mean. > >Show me some code where an open() dies... > Yeah, show me, so why die? > >> Whats the worse thing that can happen? > > >You silently get the wrong output and wonder what went wrong. > > >> But if you pass an undefined filehandle to read, it >> die's. > > >No it doesn't. > >Show me a program where you pass an undefined filehandle to read >and it dies... > use strict; use warnings; my ($buf,$length) = ('',5); # Invoke error #2, FATAL error on read my $fh; { local $!; my $status = eval { read ($fh, $buf, $length) }; $@ =~ s/\s+$//; if ($@ || (!$status && $!)) { print "Error in read: ". ($@ ? $@ : $! ). "\n"; } } print "More code ...\n"; __END__ c:\temp>perl ss.pl Error in read: Can't use an undefined value as a symbol reference at ss.pl line 13. More code ... c:\temp> > >> Something to consider since a failed open does not really >> cause problems > > >It does if the purpose of the program is to process that file. > Not if its a juggernaught program that isin't allowed to die. Aka automation > >> So, why is it always, yes always, necessary to check the return >> value from open() ? > > >So that it will fail noisily rather than fail silently! Fail all you want, but please don't die... -sln |
Re: Possibly useful perl script to filter lines in one file out of another.
sln@netherlands.com wrote:
> > I used to work for a company that did a lot of automation using perl. > I was new to Perl, but was hired because of my c++ background, but > ended up having to do all perl. > Looking back on it, thier motto was don't die on anything, do not stop > the automation. > The entire environment was dynamically generated. There was not a die > anywhere in any line of code. The check for existence is fine, but you > can't wrap all your other code in if's all the time. Definetly logs > though, lots of them, on the chance something didn't work. I'm not suggesting any code needs to die. I'm also not suggesting every read is vital and can't be ignored. Just for the record. |
Re: Possibly useful perl script to filter lines in one file out of another.
On Mon, 24 Aug 2009 03:41:34 +0000 (UTC), tmcd@panix.com (Tim McDaniel) wrote:
>In article <slrnh94199.d8a.tadmc@tadmc30.sbcglobal.net>, >Tad J McClellan <tadmc@seesig.invalid> wrote: >>sln@netherlands.com <sln@netherlands.com> wrote: >>> On Sun, 23 Aug 2009 14:37:45 -0500, Tad J McClellan <tadmc@seesig.invalid> wrote: >> >>>>You should always, yes *always*, check the return value from open(): >> >>> I've seen that always check the return value of open >>> here on this NG, >... >>> Whats the worse thing that can happen? >> >>You silently get the wrong output and wonder what went wrong. > >That's bad, but I think the worst is that you get the wrong output and >DON'T notice (and therefore don't wonder). Instead, you work with bad >or missing data. In reality, you should never need to check the return value from open(). If you can't program to that spec, you haven't been paid to program. -sln |
| All times are GMT. The time now is 03:59 AM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.