Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > dup remove - why/how does this work - NEWBIE

Reply
Thread Tools

dup remove - why/how does this work - NEWBIE

 
 
jason@cyberpine.com
Guest
Posts: n/a
 
      02-16-2004
The below simple code works at removing dups from a 20k record file.
Looking for somebody to explain how/why.

$db = "workb.txt";
open (FILE,"$db");
@lines=<FILE>;
close(FILE);
foreach $key (@lines){
$lines{$key} = 1;
}
@lines = keys(%lines);
print @lines;


I understand I am adding a key = 1 to every line (is it to every
line?), but when we recreate @lines what exactly is keys(%lines)
doing/saying? I see that %lines contains 1+unique records in the
file).

Thanks.
 
Reply With Quote
 
 
 
 
Tony Curtis
Guest
Posts: n/a
 
      02-16-2004
>> On 16 Feb 2004 13:44:10 -0800,
>> said:


> The below simple code works at removing dups from a 20k
> record file. Looking for somebody to explain how/why.


It's not even close, I'm afraid.

No strict, warnings.

> $db = "workb.txt";
> open (FILE,"$db");


open() untested. Unnecessary quotes around variable.

> @lines=<FILE>;
> close(FILE);


Slurp all lines into memory, then below do a 2nd pass. This
is wasteful, you only need to see each line once.

You'll probably want to chomp() the lines too, since the
trailing newline sequence is usually part of the file
representation, not part of the data content per se.

> foreach $key (@lines){
> $lines{$key} = 1;
> }
> @lines = keys(%lines);
> print @lines;


> I understand I am adding a key = 1 to every line (is it to
> every line?), but when we recreate @lines what exactly is


"Adding" is a misleading word here, implying that the value of
the line is being changed. "Associating" would be closer.

> keys(%lines) doing/saying? I see that %lines contains
> 1+unique records in the file).


Using a hash is the right choice here, but see

perldoc -q duplicate

Essentially you want to, for each line, output the line only
if you haven't seen that same line before (i.e. it's not th
key of a hash). Output means either print() or save into an
array for later processing, judging from your code.

hth
t
 
Reply With Quote
 
 
 
 
gnari
Guest
Posts: n/a
 
      02-16-2004
<> wrote in message
news: om...
> The below simple code works at removing dups from a 20k record file.
> Looking for somebody to explain how/why.
>
> $db = "workb.txt";
> open (FILE,"$db");
> @lines=<FILE>;
> close(FILE);
> foreach $key (@lines){
> $lines{$key} = 1;
> }
> @lines = keys(%lines);
> print @lines;
>
>
> I understand I am adding a key = 1 to every line (is it to every
> line?), but when we recreate @lines what exactly is keys(%lines)
> doing/saying? I see that %lines contains 1+unique records in the
> file).


this is a common technique using a hash.

a hash is a data structure that map a set of 'keys' to their
respective 'values'. each key has one value.

in this case the hash is %lines (totally unrelated to the array @lines)
each line of the input file is in turn addad as a key to the hash, with
an arbitrary value, in this case 1. as each key can only have 1 value,
when a duplicate is encountered, the value is simply replaced with
the new value, in this case the same value 1.

the function keys() returns a list of the keys of a hash in an
undefined order. in this case, the lines of the input file, with
duplicates removed.

the nice integration of hashes into the language, is one of the
distinctive features of Perl, and they are, along with regexes,
usually the key to solve most perl problems.

perldoc perldata

gnari





 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      02-16-2004

Tony Curtis <tony_curtis32@_SPAMTRAP_yahoo.com> wrote:
> >> On 16 Feb 2004 13:44:10 -0800,
> >> said:

>
> > The below simple code works at removing dups from a 20k
> > record file. Looking for somebody to explain how/why.

>
> It's not even close, I'm afraid.


Well, it solves the problem asked. Yes, it has problems, but...

> You'll probably want to chomp() the lines too, since the
> trailing newline sequence is usually part of the file
> representation, not part of the data content per se.


In this case it isn't necessary: the lines are being compared for
uniquness, so the line with the $/ on the end is just as good as
without. Think before you say things like this.

> > foreach $key (@lines){
> > $lines{$key} = 1;
> > }
> > @lines = keys(%lines);
> > print @lines;

>
> > I understand I am adding a key = 1 to every line (is it to
> > every line?), but when we recreate @lines what exactly is

>
> "Adding" is a misleading word here, implying that the value of
> the line is being changed. "Associating" would be closer.


Indeed. The important point, though, is that each key can only go into
the hash once.

> > keys(%lines) doing/saying? I see that %lines contains
> > 1+unique records in the file).

>
> Using a hash is the right choice here, but see
>
> perldoc -q duplicate
>
> Essentially you want to, for each line, output the line only
> if you haven't seen that same line before (i.e. it's not th
> key of a hash).


Yes, another WTDI would be to print the lines as you go along: this is
more parsimonious, and outputs the lines in the original order.

while (<F>) {
print unless $lines{$_};
$lines{$_} = 1;
}

This doesn't mean that the script as given is wrong, however.

Ben

--
$.=1;*g=sub{print@_};sub r($$\$){my($w,$x,$y)=@_;for(keys%$x){/main/&&next;*p=$
$x{$_};/(\w)::$/&&(r($w.$1,$x.$_,$y),next);$y eq\$p&&&g("$w$_")}};sub t{for(@_)
{$f&&($_||&g(" "));$f=1;r"","::",$_;$_&&&g(chr(0012))}};t #
$J::u::t, $a::n:::t::h::e::r, $P::e::r::l, $h::a::c::k::e::r, $.
 
Reply With Quote
 
Tony Curtis
Guest
Posts: n/a
 
      02-16-2004
>> On Mon, 16 Feb 2004 23:49:10 +0000 (UTC),
>> Ben Morrow <> said:


>> Me:
>> You'll probably want to chomp() the lines too, since the
>> trailing newline sequence is usually part of the file
>> representation, not part of the data content per se.


> In this case it isn't necessary: the lines are being
> compared for uniquness, so the line with the $/ on the end
> is just as good as without. Think before you say things like
> this.


Oh, I thought about it

The OP posted similar code before that did something slightly
different. It all depends on what is meant to happen later,
this small example is almost certainly mot the full story.
Which is why I qualified the suggestion ("probably").

For myself, I'd rather lose the newline as it's read; this way
I have a canonicalised internal representation of my data
immediately. The newline is a sequence that serves to
separate individual data units in a serialisation of the data,
so away it goes.



 
Reply With Quote
 
Eric Bohlman
Guest
Posts: n/a
 
      02-17-2004
Tony Curtis <tony_curtis32@_SPAMTRAP_yahoo.com> wrote in
news::

> For myself, I'd rather lose the newline as it's read; this way
> I have a canonicalised internal representation of my data
> immediately. The newline is a sequence that serves to
> separate individual data units in a serialisation of the data,
> so away it goes.


Except the only thing the OP needed to do with the data was print (part of)
it out again, which means he'd just have to put the newlines back anyway.
IOW, he's not working with his lines as abstract data, just as pure
representations of the serialized form.
 
Reply With Quote
 
Tony Curtis
Guest
Posts: n/a
 
      02-17-2004
>> On 17 Feb 2004 00:33:45 GMT,
>> Eric Bohlman <> said:


> Tony Curtis <tony_curtis32@_SPAMTRAP_yahoo.com> wrote in
> news::


>> For myself, I'd rather lose the newline as it's read; this
>> way I have a canonicalised internal representation of my
>> data immediately. The newline is a sequence that serves to
>> separate individual data units in a serialisation of the
>> data, so away it goes.


> Except the only thing the OP needed to do with the data was
> print (part of) it out again, which means he'd just have to
> put the newlines back anyway. IOW, he's not working with
> his lines as abstract data, just as pure representations of
> the serialized form.


Possibly. But we don't know for sure do we?

Do it or don't do it; whichever is best for the situation...
 
Reply With Quote
 
Mina Naguib
Guest
Posts: n/a
 
      02-17-2004
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Ben Morrow wrote:
> while (<F>) {
> print unless $lines{$_};
> $lines{$_} = 1;
> }


Not for the clarity-seekers (or good-coding-standards learning
purposes), but the whole script can be summarized to:

#!/usr/bin/perl -n

print unless $seen{$_}++;

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFAMZ4ueS99pGMif6wRAk7AAKD0qZKmQLr0/9ovvsXFG9YQRU2iNwCghRBg
X7eM2zh8SnOjedrZd/7erIE=
=zdHW
-----END PGP SIGNATURE-----
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Rubyzip - `dup': can't dup NilClass (TypeError) Luka Stolyarov Ruby 10 09-11-2010 12:13 PM
Use of dup to remove references Brian Ross Ruby 5 08-15-2008 06:56 AM
:s.respond_to?(:dup) && :s.dup raises François Beausoleil Ruby 1 04-05-2007 05:55 PM
Dup does not duplicate singleton methods Mystifier Ruby 1 01-30-2005 12:56 PM
[newbie] What is the difference between dup and clone ? Alexey Verkhovsky Ruby 1 04-03-2004 10:04 PM



Advertisments