Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   Possibly useful perl script to filter lines in one file out of another. (http://www.velocityreviews.com/forums/t911244-possibly-useful-perl-script-to-filter-lines-in-one-file-out-of-another.html)

Ben Burch 08-23-2009 06:46 PM

Possibly useful perl script to filter lines in one file out of another.
 
Hi!

I needed to take the email addresses that bounced out of an original
mailing list. grep -v -f was far to slow, and comm produced unexpected
results, and so I just wrote something to do it in perl. Thought this
might be useful to somebody else;

#!/usr/bin/perl
#
# filter $file1 $file2
#
# Filters all lines in file1 against lines in file2, copying only lines
# from file 1 not found in file2 to STDOUT
#

# get arguments

my $file1 = shift;
my $file2 = shift;

if(!defined($file1) || !defined($file2))
{
print "\nError, must have two arguments.\n";
print "filter <masterfile> <excludefile>\n";
exit 1;
}

# Copy all lines from file2 into a hash

open (EXCLUDE, $file2);

my %exclude = ();

while ($line = <EXCLUDE>)
{
chomp($line);
$exclude{$line} = 1;
}

close EXCLUDE;

# Now go through input line-by-line comparing to hash and only
# printing lines that do not match

open (DATA, $file1);

while ($line = <DATA>)
{
chomp($line);
if(!exists($exclude{$line}))
{
print "$line\n";
}
}

close DATA;

exit 0;

Tim McDaniel 08-23-2009 07:48 PM

Re: Possibly useful perl script to filter lines in one file out of another.
 
In article <230820091346263431%benburch@pobox.com>,
Ben Burch <benburch@pobox.com> wrote:
>I needed to take the email addresses that bounced out of an original
>mailing list. grep -v -f was far to slow, and comm produced
>unexpected results, and so I just wrote something to do it in perl.


comm requires that both input files be sorted -- presumably in byte
value order rather than by dictionary order. When comm bites me, it's
because I've forgotten that.

--
Tim McDaniel, tmcd@panix.com

Uri Guttman 08-23-2009 09:16 PM

Re: Possibly useful perl script to filter lines in one file out of another.
 
>>>>> "BB" == Ben Burch <benburch@pobox.com> writes:

BB> I needed to take the email addresses that bounced out of an original
BB> mailing list. grep -v -f was far to slow, and comm produced unexpected
BB> results, and so I just wrote something to do it in perl. Thought this
BB> might be useful to somebody else;

i find it hard to believe that grep -v -f is slower than perl. did you
benchmark the final results?

BB> #!/usr/bin/perl
BB> #

no warnings or strict. use them.
BB> # get arguments

BB> my $file1 = shift;
BB> my $file2 = shift;

BB> if(!defined($file1) || !defined($file2))
BB> {
BB> print "\nError, must have two arguments.\n";
BB> print "filter <masterfile> <excludefile>\n";
BB> exit 1;
BB> }

much simpler and slightly more accurate is to check @ARGV if it has 2
elements:

unless( @ARGV == 2 ) {

die 'blah' ;
}

and use better names than file1 and file2. they are files of different data

my( $exc_file, $data_file ) = @ARGV ;

BB> # Copy all lines from file2 into a hash

BB> open (EXCLUDE, $file2);

BB> my %exclude = ();

BB> while ($line = <EXCLUDE>)
BB> {
BB> chomp($line);
BB> $exclude{$line} = 1;
BB> }

BB> close EXCLUDE;

use File::Slurp ;

my %exclude = map { chomp; $_ => 1 } read_file( $exc_file ) ;

BB> # Now go through input line-by-line comparing to hash and only
BB> # printing lines that do not match

BB> open (DATA, $file1);

don't use DATA for a file handle as it is the handle name for data in
the source file after the __END__ marker

BB> while ($line = <DATA>)
BB> {
BB> chomp($line);
BB> if(!exists($exclude{$line}))
BB> {
BB> print "$line\n";
BB> }

invert that logic for simpler code:

next if $exclude{ $line } ;
print "$line\n" ;

and if your bounce line file isn't that large (for some definition of
large) you can also slurp and filter it out too.

and since your bounce and exclude lines are all ending in newline, there
is no need to chomp in either case. it makes this much easier.

<untested entire main code>

my %exclude = map { $_ => 1 } read_file( $exc_file ) ;
print grep { !$exclude{ $_ } } read_file( $data_file ) ;

ain't perl cool! :)

uri

--
Uri Guttman ------ uri@stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

sln@netherlands.com 08-23-2009 11:50 PM

Re: Possibly useful perl script to filter lines in one file out of another.
 
On Sun, 23 Aug 2009 14:37:45 -0500, Tad J McClellan <tadmc@seesig.invalid> wrote:

>Ben Burch <benburch@pobox.com> wrote:
>
>> #!/usr/bin/perl

>
> use warnings;
> use strict;
>
>> open (EXCLUDE, $file2);

>
>You should always, yes *always*, check the return value from open():
>
> open my $EXCLUDE, '<', $file2 or die "could not open '$file2' $!";
>
>> while ($line = <EXCLUDE>)

>
> while ($line = <$EXCLUDE>)
>
>> if(!exists($exclude{$line}))

>
> unless ( exists $exclude{$line} )
>
>or at least make wise use of whitespace and punctuation:
>
> if ( ! exists $exclude{$line} )


Hi Tad.

I've seen that always check the return value of open
here on this NG, then die if not true?

Why die if open didn't die? Whats the worse thing that can happen?
I think the worse thing is that a read or write doesen't happen.
It won't crash the system or mess up the file allocation tables.

Its funny, if you pass a failed open filehandle like
open my $fh, 'non-existant-file.txt'
to a read $fh,... the read passivily fails. There is no
fatal error.

But if you pass an undefined filehandle to read, it
die's.

Something to consider since a failed open does not really
cause problems because and apparently an undefined handle is
enough to cause a die from Perl's i/o functions (well at least read ).

So, why is it always, yes always, necessary to check the return
value from open() ?

-sln

Nathan Keel 08-24-2009 12:27 AM

Re: Possibly useful perl script to filter lines in one file out of another.
 
sln@netherlands.com wrote:

> On Sun, 23 Aug 2009 14:37:45 -0500, Tad J McClellan
> <tadmc@seesig.invalid> wrote:
>
>>Ben Burch <benburch@pobox.com> wrote:
>>
>>> #!/usr/bin/perl

>>
>> use warnings;
>> use strict;
>>
>>> open (EXCLUDE, $file2);

>>
>>You should always, yes *always*, check the return value from open():
>>
>> open my $EXCLUDE, '<', $file2 or die "could not open '$file2' $!";
>>
>>> while ($line = <EXCLUDE>)

>>
>> while ($line = <$EXCLUDE>)
>>
>>> if(!exists($exclude{$line}))

>>
>> unless ( exists $exclude{$line} )
>>
>>or at least make wise use of whitespace and punctuation:
>>
>> if ( ! exists $exclude{$line} )

>
> Hi Tad.
>
> I've seen that always check the return value of open
> here on this NG, then die if not true?
>
> Why die if open didn't die? Whats the worse thing that can happen?
> I think the worse thing is that a read or write doesen't happen.
> It won't crash the system or mess up the file allocation tables.
>
> Its funny, if you pass a failed open filehandle like
> open my $fh, 'non-existant-file.txt'
> to a read $fh,... the read passivily fails. There is no
> fatal error.
>
> But if you pass an undefined filehandle to read, it
> die's.
>
> Something to consider since a failed open does not really
> cause problems because and apparently an undefined handle is
> enough to cause a die from Perl's i/o functions (well at least read ).
>
> So, why is it always, yes always, necessary to check the return
> value from open() ?
>
> -sln


If you want to open/read/write to a file, there's an intended reason.
It doesn't have to be a die, the point is to be aware of the problem
and have it output or log the problem, which helps troubleshoot
problems (and unintended bugs). He said to always check the return
value, he didn't say to always die. If you have a script that doesn't
need to open a file you told it to, why are you opening it?

sln@netherlands.com 08-24-2009 02:41 AM

Re: Possibly useful perl script to filter lines in one file out of another.
 
On Sun, 23 Aug 2009 17:27:13 -0700, Nathan Keel <nat.k@gm.ml> wrote:

>sln@netherlands.com wrote:
>
>> On Sun, 23 Aug 2009 14:37:45 -0500, Tad J McClellan
>> <tadmc@seesig.invalid> wrote:
>>
>>>Ben Burch <benburch@pobox.com> wrote:
>>>
>>>> #!/usr/bin/perl
>>>
>>> use warnings;
>>> use strict;
>>>
>>>> open (EXCLUDE, $file2);
>>>
>>>You should always, yes *always*, check the return value from open():
>>>
>>> open my $EXCLUDE, '<', $file2 or die "could not open '$file2' $!";
>>>
>>>> while ($line = <EXCLUDE>)
>>>
>>> while ($line = <$EXCLUDE>)
>>>
>>>> if(!exists($exclude{$line}))
>>>
>>> unless ( exists $exclude{$line} )
>>>
>>>or at least make wise use of whitespace and punctuation:
>>>
>>> if ( ! exists $exclude{$line} )

>>
>> Hi Tad.
>>
>> I've seen that always check the return value of open
>> here on this NG, then die if not true?
>>
>> Why die if open didn't die? Whats the worse thing that can happen?
>> I think the worse thing is that a read or write doesen't happen.
>> It won't crash the system or mess up the file allocation tables.
>>
>> Its funny, if you pass a failed open filehandle like
>> open my $fh, 'non-existant-file.txt'
>> to a read $fh,... the read passivily fails. There is no
>> fatal error.
>>
>> But if you pass an undefined filehandle to read, it
>> die's.
>>
>> Something to consider since a failed open does not really
>> cause problems because and apparently an undefined handle is
>> enough to cause a die from Perl's i/o functions (well at least read ).
>>
>> So, why is it always, yes always, necessary to check the return
>> value from open() ?
>>
>> -sln

>
>If you want to open/read/write to a file, there's an intended reason.
>It doesn't have to be a die, the point is to be aware of the problem
>and have it output or log the problem, which helps troubleshoot
>problems (and unintended bugs). He said to always check the return
>value, he didn't say to always die. If you have a script that doesn't
>need to open a file you told it to, why are you opening it?


I used to work for a company that did a lot of automation using perl.
I was new to Perl, but was hired because of my c++ background, but
ended up having to do all perl.
Looking back on it, thier motto was don't die on anything, do not stop
the automation.
The entire environment was dynamically generated. There was not a die
anywhere in any line of code. The check for existence is fine, but you
can't wrap all your other code in if's all the time. Definetly logs though,
lots of them, on the chance something didn't work.

They could have used something like this, though they didn't have it.

use strict;
use warnings;

my ($buf,$length) = ('',5);

# Invoke error #1, NON - FATAL error on read.
# File doesen't exist, however, $fh is valid
open my $fh, '<', 'notexists.txt';

# Invoke error #2, FATAL error on read
#my $fh;

open STDERR, '>errors.txt';

{
local $!;
my $status = eval { read ($fh, $buf, $length) };
$@ =~ s/\s+$//;
if ($@ || (!$status && $!)) {
print "Error in read: ". ($@ ? $@ : $! ). "\n";
}
}

print "More code ...\n";

exit;

__END__


-sln

Tim McDaniel 08-24-2009 03:41 AM

Re: Possibly useful perl script to filter lines in one file out of another.
 
In article <slrnh94199.d8a.tadmc@tadmc30.sbcglobal.net>,
Tad J McClellan <tadmc@seesig.invalid> wrote:
>sln@netherlands.com <sln@netherlands.com> wrote:
>> On Sun, 23 Aug 2009 14:37:45 -0500, Tad J McClellan <tadmc@seesig.invalid> wrote:

>
>>>You should always, yes *always*, check the return value from open():

>
>> I've seen that always check the return value of open
>> here on this NG,

....
>> Whats the worse thing that can happen?

>
>You silently get the wrong output and wonder what went wrong.


That's bad, but I think the worst is that you get the wrong output and
DON'T notice (and therefore don't wonder). Instead, you work with bad
or missing data.

--
Tim McDaniel, tmcd@panix.com

sln@netherlands.com 08-24-2009 04:30 AM

Re: Possibly useful perl script to filter lines in one file out of another.
 
On Sun, 23 Aug 2009 22:22:19 -0500, Tad J McClellan <tadmc@seesig.invalid> wrote:

>sln@netherlands.com <sln@netherlands.com> wrote:
>> On Sun, 23 Aug 2009 14:37:45 -0500, Tad J McClellan <tadmc@seesig.invalid> wrote:

>
>>>You should always, yes *always*, check the return value from open():

>
>> I've seen that always check the return value of open
>> here on this NG,

>
>
>It's been said often enough.
>
>
>> then die if not true?

>
>
>No, then do "whatever is appropriate for your situation".
>
>The admonition is to check the return value.
>
>It is not to take any particular action if the check fails, though
>die is often used, as it is most often the appropriate action.
>
>(the purpose of most programs is to process a file's
> contents, so there is no point in continuing if such a program
> cannot read the file's contents.
>)
>
>> Why die if open didn't die?

>
>
>open() does not die, if it fails it fails silently (which is why
>you should always, yes *always*, check its return value).
>
>So I don't know what you mean.
>
>Show me some code where an open() dies...
>

Yeah, show me, so why die?

>
>> Whats the worse thing that can happen?

>
>
>You silently get the wrong output and wonder what went wrong.
>
>
>> But if you pass an undefined filehandle to read, it
>> die's.

>
>
>No it doesn't.
>
>Show me a program where you pass an undefined filehandle to read
>and it dies...
>

use strict;
use warnings;

my ($buf,$length) = ('',5);

# Invoke error #2, FATAL error on read
my $fh;

{
local $!;
my $status = eval { read ($fh, $buf, $length) };
$@ =~ s/\s+$//;
if ($@ || (!$status && $!)) {
print "Error in read: ". ($@ ? $@ : $! ). "\n";
}
}

print "More code ...\n";

__END__
c:\temp>perl ss.pl
Error in read: Can't use an undefined value as a symbol reference at ss.pl line
13.
More code ...

c:\temp>

>
>> Something to consider since a failed open does not really
>> cause problems

>
>
>It does if the purpose of the program is to process that file.
>

Not if its a juggernaught program that isin't allowed to die.
Aka automation

>
>> So, why is it always, yes always, necessary to check the return
>> value from open() ?

>
>
>So that it will fail noisily rather than fail silently!


Fail all you want, but please don't die...
-sln

Nathan Keel 08-24-2009 04:36 AM

Re: Possibly useful perl script to filter lines in one file out of another.
 
sln@netherlands.com wrote:

>
> I used to work for a company that did a lot of automation using perl.
> I was new to Perl, but was hired because of my c++ background, but
> ended up having to do all perl.
> Looking back on it, thier motto was don't die on anything, do not stop
> the automation.
> The entire environment was dynamically generated. There was not a die
> anywhere in any line of code. The check for existence is fine, but you
> can't wrap all your other code in if's all the time. Definetly logs
> though, lots of them, on the chance something didn't work.


I'm not suggesting any code needs to die. I'm also not suggesting every
read is vital and can't be ignored. Just for the record.

sln@netherlands.com 08-24-2009 06:11 AM

Re: Possibly useful perl script to filter lines in one file out of another.
 
On Mon, 24 Aug 2009 03:41:34 +0000 (UTC), tmcd@panix.com (Tim McDaniel) wrote:

>In article <slrnh94199.d8a.tadmc@tadmc30.sbcglobal.net>,
>Tad J McClellan <tadmc@seesig.invalid> wrote:
>>sln@netherlands.com <sln@netherlands.com> wrote:
>>> On Sun, 23 Aug 2009 14:37:45 -0500, Tad J McClellan <tadmc@seesig.invalid> wrote:

>>
>>>>You should always, yes *always*, check the return value from open():

>>
>>> I've seen that always check the return value of open
>>> here on this NG,

>...
>>> Whats the worse thing that can happen?

>>
>>You silently get the wrong output and wonder what went wrong.

>
>That's bad, but I think the worst is that you get the wrong output and
>DON'T notice (and therefore don't wonder). Instead, you work with bad
>or missing data.


In reality, you should never need to check the return value from open().
If you can't program to that spec, you haven't been paid to program.
-sln


All times are GMT. The time now is 03:59 AM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.