Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Possibly useful perl script to filter lines in one file out of another.

Reply
Thread Tools

Possibly useful perl script to filter lines in one file out of another.

 
 
Ben Burch
Guest
Posts: n/a
 
      08-23-2009
Hi!

I needed to take the email addresses that bounced out of an original
mailing list. grep -v -f was far to slow, and comm produced unexpected
results, and so I just wrote something to do it in perl. Thought this
might be useful to somebody else;

#!/usr/bin/perl
#
# filter $file1 $file2
#
# Filters all lines in file1 against lines in file2, copying only lines
# from file 1 not found in file2 to STDOUT
#

# get arguments

my $file1 = shift;
my $file2 = shift;

if(!defined($file1) || !defined($file2))
{
print "\nError, must have two arguments.\n";
print "filter <masterfile> <excludefile>\n";
exit 1;
}

# Copy all lines from file2 into a hash

open (EXCLUDE, $file2);

my %exclude = ();

while ($line = <EXCLUDE>)
{
chomp($line);
$exclude{$line} = 1;
}

close EXCLUDE;

# Now go through input line-by-line comparing to hash and only
# printing lines that do not match

open (DATA, $file1);

while ($line = <DATA>)
{
chomp($line);
if(!exists($exclude{$line}))
{
print "$line\n";
}
}

close DATA;

exit 0;
 
Reply With Quote
 
 
 
 
Tim McDaniel
Guest
Posts: n/a
 
      08-23-2009
In article <230820091346263431%(E-Mail Removed)>,
Ben Burch <(E-Mail Removed)> wrote:
>I needed to take the email addresses that bounced out of an original
>mailing list. grep -v -f was far to slow, and comm produced
>unexpected results, and so I just wrote something to do it in perl.


comm requires that both input files be sorted -- presumably in byte
value order rather than by dictionary order. When comm bites me, it's
because I've forgotten that.

--
Tim McDaniel, http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
 
 
 
Uri Guttman
Guest
Posts: n/a
 
      08-23-2009
>>>>> "BB" == Ben Burch <(E-Mail Removed)> writes:

BB> I needed to take the email addresses that bounced out of an original
BB> mailing list. grep -v -f was far to slow, and comm produced unexpected
BB> results, and so I just wrote something to do it in perl. Thought this
BB> might be useful to somebody else;

i find it hard to believe that grep -v -f is slower than perl. did you
benchmark the final results?

BB> #!/usr/bin/perl
BB> #

no warnings or strict. use them.
BB> # get arguments

BB> my $file1 = shift;
BB> my $file2 = shift;

BB> if(!defined($file1) || !defined($file2))
BB> {
BB> print "\nError, must have two arguments.\n";
BB> print "filter <masterfile> <excludefile>\n";
BB> exit 1;
BB> }

much simpler and slightly more accurate is to check @ARGV if it has 2
elements:

unless( @ARGV == 2 ) {

die 'blah' ;
}

and use better names than file1 and file2. they are files of different data

my( $exc_file, $data_file ) = @ARGV ;

BB> # Copy all lines from file2 into a hash

BB> open (EXCLUDE, $file2);

BB> my %exclude = ();

BB> while ($line = <EXCLUDE>)
BB> {
BB> chomp($line);
BB> $exclude{$line} = 1;
BB> }

BB> close EXCLUDE;

use File::Slurp ;

my %exclude = map { chomp; $_ => 1 } read_file( $exc_file ) ;

BB> # Now go through input line-by-line comparing to hash and only
BB> # printing lines that do not match

BB> open (DATA, $file1);

don't use DATA for a file handle as it is the handle name for data in
the source file after the __END__ marker

BB> while ($line = <DATA>)
BB> {
BB> chomp($line);
BB> if(!exists($exclude{$line}))
BB> {
BB> print "$line\n";
BB> }

invert that logic for simpler code:

next if $exclude{ $line } ;
print "$line\n" ;

and if your bounce line file isn't that large (for some definition of
large) you can also slurp and filter it out too.

and since your bounce and exclude lines are all ending in newline, there
is no need to chomp in either case. it makes this much easier.

<untested entire main code>

my %exclude = map { $_ => 1 } read_file( $exc_file ) ;
print grep { !$exclude{ $_ } } read_file( $data_file ) ;

ain't perl cool!

uri

--
Uri Guttman ------ (E-Mail Removed) -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
 
Reply With Quote
 
sln@netherlands.com
Guest
Posts: n/a
 
      08-23-2009
On Sun, 23 Aug 2009 14:37:45 -0500, Tad J McClellan <(E-Mail Removed)> wrote:

>Ben Burch <(E-Mail Removed)> wrote:
>
>> #!/usr/bin/perl

>
> use warnings;
> use strict;
>
>> open (EXCLUDE, $file2);

>
>You should always, yes *always*, check the return value from open():
>
> open my $EXCLUDE, '<', $file2 or die "could not open '$file2' $!";
>
>> while ($line = <EXCLUDE>)

>
> while ($line = <$EXCLUDE>)
>
>> if(!exists($exclude{$line}))

>
> unless ( exists $exclude{$line} )
>
>or at least make wise use of whitespace and punctuation:
>
> if ( ! exists $exclude{$line} )


Hi Tad.

I've seen that always check the return value of open
here on this NG, then die if not true?

Why die if open didn't die? Whats the worse thing that can happen?
I think the worse thing is that a read or write doesen't happen.
It won't crash the system or mess up the file allocation tables.

Its funny, if you pass a failed open filehandle like
open my $fh, 'non-existant-file.txt'
to a read $fh,... the read passivily fails. There is no
fatal error.

But if you pass an undefined filehandle to read, it
die's.

Something to consider since a failed open does not really
cause problems because and apparently an undefined handle is
enough to cause a die from Perl's i/o functions (well at least read ).

So, why is it always, yes always, necessary to check the return
value from open() ?

-sln
 
Reply With Quote
 
Nathan Keel
Guest
Posts: n/a
 
      08-24-2009
(E-Mail Removed) wrote:

> On Sun, 23 Aug 2009 14:37:45 -0500, Tad J McClellan
> <(E-Mail Removed)> wrote:
>
>>Ben Burch <(E-Mail Removed)> wrote:
>>
>>> #!/usr/bin/perl

>>
>> use warnings;
>> use strict;
>>
>>> open (EXCLUDE, $file2);

>>
>>You should always, yes *always*, check the return value from open():
>>
>> open my $EXCLUDE, '<', $file2 or die "could not open '$file2' $!";
>>
>>> while ($line = <EXCLUDE>)

>>
>> while ($line = <$EXCLUDE>)
>>
>>> if(!exists($exclude{$line}))

>>
>> unless ( exists $exclude{$line} )
>>
>>or at least make wise use of whitespace and punctuation:
>>
>> if ( ! exists $exclude{$line} )

>
> Hi Tad.
>
> I've seen that always check the return value of open
> here on this NG, then die if not true?
>
> Why die if open didn't die? Whats the worse thing that can happen?
> I think the worse thing is that a read or write doesen't happen.
> It won't crash the system or mess up the file allocation tables.
>
> Its funny, if you pass a failed open filehandle like
> open my $fh, 'non-existant-file.txt'
> to a read $fh,... the read passivily fails. There is no
> fatal error.
>
> But if you pass an undefined filehandle to read, it
> die's.
>
> Something to consider since a failed open does not really
> cause problems because and apparently an undefined handle is
> enough to cause a die from Perl's i/o functions (well at least read ).
>
> So, why is it always, yes always, necessary to check the return
> value from open() ?
>
> -sln


If you want to open/read/write to a file, there's an intended reason.
It doesn't have to be a die, the point is to be aware of the problem
and have it output or log the problem, which helps troubleshoot
problems (and unintended bugs). He said to always check the return
value, he didn't say to always die. If you have a script that doesn't
need to open a file you told it to, why are you opening it?
 
Reply With Quote
 
sln@netherlands.com
Guest
Posts: n/a
 
      08-24-2009
On Sun, 23 Aug 2009 17:27:13 -0700, Nathan Keel <(E-Mail Removed)> wrote:

>(E-Mail Removed) wrote:
>
>> On Sun, 23 Aug 2009 14:37:45 -0500, Tad J McClellan
>> <(E-Mail Removed)> wrote:
>>
>>>Ben Burch <(E-Mail Removed)> wrote:
>>>
>>>> #!/usr/bin/perl
>>>
>>> use warnings;
>>> use strict;
>>>
>>>> open (EXCLUDE, $file2);
>>>
>>>You should always, yes *always*, check the return value from open():
>>>
>>> open my $EXCLUDE, '<', $file2 or die "could not open '$file2' $!";
>>>
>>>> while ($line = <EXCLUDE>)
>>>
>>> while ($line = <$EXCLUDE>)
>>>
>>>> if(!exists($exclude{$line}))
>>>
>>> unless ( exists $exclude{$line} )
>>>
>>>or at least make wise use of whitespace and punctuation:
>>>
>>> if ( ! exists $exclude{$line} )

>>
>> Hi Tad.
>>
>> I've seen that always check the return value of open
>> here on this NG, then die if not true?
>>
>> Why die if open didn't die? Whats the worse thing that can happen?
>> I think the worse thing is that a read or write doesen't happen.
>> It won't crash the system or mess up the file allocation tables.
>>
>> Its funny, if you pass a failed open filehandle like
>> open my $fh, 'non-existant-file.txt'
>> to a read $fh,... the read passivily fails. There is no
>> fatal error.
>>
>> But if you pass an undefined filehandle to read, it
>> die's.
>>
>> Something to consider since a failed open does not really
>> cause problems because and apparently an undefined handle is
>> enough to cause a die from Perl's i/o functions (well at least read ).
>>
>> So, why is it always, yes always, necessary to check the return
>> value from open() ?
>>
>> -sln

>
>If you want to open/read/write to a file, there's an intended reason.
>It doesn't have to be a die, the point is to be aware of the problem
>and have it output or log the problem, which helps troubleshoot
>problems (and unintended bugs). He said to always check the return
>value, he didn't say to always die. If you have a script that doesn't
>need to open a file you told it to, why are you opening it?


I used to work for a company that did a lot of automation using perl.
I was new to Perl, but was hired because of my c++ background, but
ended up having to do all perl.
Looking back on it, thier motto was don't die on anything, do not stop
the automation.
The entire environment was dynamically generated. There was not a die
anywhere in any line of code. The check for existence is fine, but you
can't wrap all your other code in if's all the time. Definetly logs though,
lots of them, on the chance something didn't work.

They could have used something like this, though they didn't have it.

use strict;
use warnings;

my ($buf,$length) = ('',5);

# Invoke error #1, NON - FATAL error on read.
# File doesen't exist, however, $fh is valid
open my $fh, '<', 'notexists.txt';

# Invoke error #2, FATAL error on read
#my $fh;

open STDERR, '>errors.txt';

{
local $!;
my $status = eval { read ($fh, $buf, $length) };
$@ =~ s/\s+$//;
if ($@ || (!$status && $!)) {
print "Error in read: ". ($@ ? $@ : $! ). "\n";
}
}

print "More code ...\n";

exit;

__END__


-sln
 
Reply With Quote
 
Tim McDaniel
Guest
Posts: n/a
 
      08-24-2009
In article <(E-Mail Removed)>,
Tad J McClellan <(E-Mail Removed)> wrote:
>(E-Mail Removed) <(E-Mail Removed)> wrote:
>> On Sun, 23 Aug 2009 14:37:45 -0500, Tad J McClellan <(E-Mail Removed)> wrote:

>
>>>You should always, yes *always*, check the return value from open():

>
>> I've seen that always check the return value of open
>> here on this NG,

....
>> Whats the worse thing that can happen?

>
>You silently get the wrong output and wonder what went wrong.


That's bad, but I think the worst is that you get the wrong output and
DON'T notice (and therefore don't wonder). Instead, you work with bad
or missing data.

--
Tim McDaniel, (E-Mail Removed)
 
Reply With Quote
 
sln@netherlands.com
Guest
Posts: n/a
 
      08-24-2009
On Sun, 23 Aug 2009 22:22:19 -0500, Tad J McClellan <(E-Mail Removed)> wrote:

>(E-Mail Removed) <(E-Mail Removed)> wrote:
>> On Sun, 23 Aug 2009 14:37:45 -0500, Tad J McClellan <(E-Mail Removed)> wrote:

>
>>>You should always, yes *always*, check the return value from open():

>
>> I've seen that always check the return value of open
>> here on this NG,

>
>
>It's been said often enough.
>
>
>> then die if not true?

>
>
>No, then do "whatever is appropriate for your situation".
>
>The admonition is to check the return value.
>
>It is not to take any particular action if the check fails, though
>die is often used, as it is most often the appropriate action.
>
>(the purpose of most programs is to process a file's
> contents, so there is no point in continuing if such a program
> cannot read the file's contents.
>)
>
>> Why die if open didn't die?

>
>
>open() does not die, if it fails it fails silently (which is why
>you should always, yes *always*, check its return value).
>
>So I don't know what you mean.
>
>Show me some code where an open() dies...
>

Yeah, show me, so why die?

>
>> Whats the worse thing that can happen?

>
>
>You silently get the wrong output and wonder what went wrong.
>
>
>> But if you pass an undefined filehandle to read, it
>> die's.

>
>
>No it doesn't.
>
>Show me a program where you pass an undefined filehandle to read
>and it dies...
>

use strict;
use warnings;

my ($buf,$length) = ('',5);

# Invoke error #2, FATAL error on read
my $fh;

{
local $!;
my $status = eval { read ($fh, $buf, $length) };
$@ =~ s/\s+$//;
if ($@ || (!$status && $!)) {
print "Error in read: ". ($@ ? $@ : $! ). "\n";
}
}

print "More code ...\n";

__END__
c:\temp>perl ss.pl
Error in read: Can't use an undefined value as a symbol reference at ss.pl line
13.
More code ...

c:\temp>

>
>> Something to consider since a failed open does not really
>> cause problems

>
>
>It does if the purpose of the program is to process that file.
>

Not if its a juggernaught program that isin't allowed to die.
Aka automation

>
>> So, why is it always, yes always, necessary to check the return
>> value from open() ?

>
>
>So that it will fail noisily rather than fail silently!


Fail all you want, but please don't die...
-sln
 
Reply With Quote
 
Nathan Keel
Guest
Posts: n/a
 
      08-24-2009
(E-Mail Removed) wrote:

>
> I used to work for a company that did a lot of automation using perl.
> I was new to Perl, but was hired because of my c++ background, but
> ended up having to do all perl.
> Looking back on it, thier motto was don't die on anything, do not stop
> the automation.
> The entire environment was dynamically generated. There was not a die
> anywhere in any line of code. The check for existence is fine, but you
> can't wrap all your other code in if's all the time. Definetly logs
> though, lots of them, on the chance something didn't work.


I'm not suggesting any code needs to die. I'm also not suggesting every
read is vital and can't be ignored. Just for the record.
 
Reply With Quote
 
sln@netherlands.com
Guest
Posts: n/a
 
      08-24-2009
On Mon, 24 Aug 2009 03:41:34 +0000 (UTC), (E-Mail Removed) (Tim McDaniel) wrote:

>In article <(E-Mail Removed)>,
>Tad J McClellan <(E-Mail Removed)> wrote:
>>(E-Mail Removed) <(E-Mail Removed)> wrote:
>>> On Sun, 23 Aug 2009 14:37:45 -0500, Tad J McClellan <(E-Mail Removed)> wrote:

>>
>>>>You should always, yes *always*, check the return value from open():

>>
>>> I've seen that always check the return value of open
>>> here on this NG,

>...
>>> Whats the worse thing that can happen?

>>
>>You silently get the wrong output and wonder what went wrong.

>
>That's bad, but I think the worst is that you get the wrong output and
>DON'T notice (and therefore don't wonder). Instead, you work with bad
>or missing data.


In reality, you should never need to check the return value from open().
If you can't program to that spec, you haven't been paid to program.
-sln
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Some interactive Python tutorials on basic stats, possibly useful for teaching Raj Python 0 02-14-2011 02:53 PM
"rec.photo.digital.txt" Filter File Posted Online (for newsreadersthat can import a list of e-mail addresses to filter out) SMS 斯蒂文• 夏 Digital Photography 2 11-25-2007 11:00 AM
How to filter out lines from a variable that has multi-lines? mike Perl Misc 3 10-02-2007 12:08 AM
possibly someday useful ?? google search MsOsWin@anon.com XML 0 12-10-2004 09:12 PM
Perl Help - Windows Perl script accessing a Unix perl Script dpackwood Perl 3 09-30-2003 02:56 AM



Advertisments