Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Remove duplicate lines from array - Yes I checked before posting

Reply
Thread Tools

Remove duplicate lines from array - Yes I checked before posting

 
 
phillyfan
Guest
Posts: n/a
 
      09-09-2005
I have an .csv file I have pulled into an array. I have searched for a
way to remove duplicate lines from the array. I have used a couple of
different coding techques but because they are use the hash key value
technique I end up removing lines I need. Here is a sample of my file:
The fields are Classcode, start time, end time, building number, days
of week, class title, proff id, and professor name. They are comma
delimited in the .csv file.

ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS1 920 1030 172 222 MWF Accounting
I 901063085 Arnold Schneider
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
ACCT2102TS1 1040 1150 172 222 MWF Accounting
II 901063085 Arnold Schneider
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn

If I use:

#! /perl/bin/perl
use strict;
use warnings;
$| = 1;


my @bannerfile = ();
open(INTO, 'data-banner.csv') or die "Can't open data-banner.csv for
reading: $!\n";
chomp(@bannerfile = <INTO>);
close(INTO) or die "Can't close data-banner.csv: $!\n";

my %seen = ();
my $item;


my @uniq = @bannerfile;
@uniq = do { my %seen; grep !$seen{$_}++, @uniq };

or

foreach $item(@bannerfile){
push(@uniq, $item) unless exists $seen{$item};}

What happens, I am sure you already know is because the same classcode
is found it is removed regardless if the information after itis
different. My goal is to strip off the duplicate records that exist
from the file. Example:
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
shows up twice just keep one instance of this record and also be able
to keep
ACCT2101TS1 920 1030 172 222 MWF Accounting
I 901063085 Arnold Schneider
because it is a different record.
Hopefully I have made sense in what I am trying to achieve. Thank you
for your help and tutelage.

 
Reply With Quote
 
 
 
 
John Bokma
Guest
Posts: n/a
 
      09-09-2005
"phillyfan" <(E-Mail Removed)> wrote:

> What happens, I am sure you already know is because the same classcode
> is found it is removed regardless if the information after itis
> different. My goal is to strip off the duplicate records that exist


Did you really test your code? Since it's *line* based, ie.

A12 foo
A12 bar

will be seen as *not* unique.

#!/usr/bin/perl

use strict;
use warnings;

my $filename = 'data-banner.csv';

open my $fh, $filename or
die "Can't open '$filename' for reading: $!";

my %check;
my @lines;

while ( my $line = <$fh> ) {

exists $check{ $line } and next;

$check{ $line } = 1;
push @lines, $line; # keep original order
}

close $fh or die "Can't close '$filename' after reading: $!";

print @lines;

(untested)


--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html

 
Reply With Quote
 
 
 
 
phillyfan
Guest
Posts: n/a
 
      09-09-2005
Yes I did check the code but did not do a thorough check of my results
all three variations of the code worked. A sort helped me see the error
of my ways. I thank you for waking me up.

 
Reply With Quote
 
axel@white-eagle.invalid.uk
Guest
Posts: n/a
 
      09-09-2005
phillyfan <(E-Mail Removed)> wrote:
> I have an .csv file I have pulled into an array. I have searched for a
> way to remove duplicate lines from the array. I have used a couple of
> different coding techques but because they are use the hash key value
> technique I end up removing lines I need. Here is a sample of my file:
> The fields are Classcode, start time, end time, building number, days
> of week, class title, proff id, and professor name. They are comma
> delimited in the .csv file.
>
> ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
> ACCT2101TS1 920 1030 172 222 MWF Accounting
> I 901063085 Arnold Schneider
> ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
> ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
> ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
> ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
> ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
> ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
> ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
> ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
> ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
> ACCT2102TS1 1040 1150 172 222 MWF Accounting
> II 901063085 Arnold Schneider
> ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn


> If I use:


> #! /perl/bin/perl
> use strict;
> use warnings;
> $| = 1;



> my @bannerfile = ();
> open(INTO, 'data-banner.csv') or die "Can't open data-banner.csv for
> reading: $!\n";
> chomp(@bannerfile = <INTO>);
> close(INTO) or die "Can't close data-banner.csv: $!\n";


It would be better to read in the data line by line for scaleability

> my %seen = ();
> my $item;


$item should only be introduced when it is actually needed.

> [snip]


> or
>
> foreach $item(@bannerfile){
> push(@uniq, $item) unless exists $seen{$item};}


It will never be 'seen'... as you never mark it that way.

foreach my $item (@bannerfile) {
push(@uniq, $item) unless exists $seen{$item};
print "Yes" if $seen{$item}; # Diagnostic so you can see what happened
print "No" if ! $seen{$item}; # Remove these after testing
$seen{$item} = 1;
}

> What happens, I am sure you already know is because the same classcode
> is found it is removed regardless if the information after itis
> different. My goal is to strip off the duplicate records that exist


No that is not what happened at all.

Axel
 
Reply With Quote
 
Joe Smith
Guest
Posts: n/a
 
      09-10-2005
phillyfan wrote:
> my %seen = ();
> push(@uniq, $item) unless exists $seen{$item};}


push(@uniq, $item) unless $seen{$item}++;

-Joe
 
Reply With Quote
 
William James
Guest
Posts: n/a
 
      09-10-2005
phillyfan wrote:
> I have an .csv file I have pulled into an array. I have searched for a
> way to remove duplicate lines from the array. I have used a couple of
> different coding techques but because they are use the hash key value
> technique I end up removing lines I need. Here is a sample of my file:
> The fields are Classcode, start time, end time, building number, days
> of week, class title, proff id, and professor name. They are comma
> delimited in the .csv file.
>
> ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
> ACCT2101TS1 920 1030 172 222 MWF Accounting
> I 901063085 Arnold Schneider
> ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
> ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
> ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
> ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
> ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
> ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
> ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
> ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
> ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
> ACCT2102TS1 1040 1150 172 222 MWF Accounting
> II 901063085 Arnold Schneider
> ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn


In Ruby:

array = DATA.read.split("\n")
puts array.size
puts array.uniq.size
puts array.uniq

__END__
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS1 920 1030 172 222 MWF Accounting
I 901063085 Arnold Schneider
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
ACCT2102TS1 1040 1150 172 222 MWF Accounting
II 901063085 Arnold Schneider
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn

Output:

15
11
ACCT2101TS1 1305 1355 172 103 MWF Accounting I 901463900 Michael Ely
ACCT2101TS1 920 1030 172 222 MWF Accounting
I 901063085 Arnold Schneider
ACCT2101TS2 1005 1055 172 300 MWF Accounting I 901790899 Robert Dunn
ACCT2101TS3 1635 1755 172 300 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSA 1635 1755 172 200 TR Accounting I 900255352 Michael Kilgore
ACCT2101TSB 1105 1155 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2101TSC 1205 1255 172 200 MWF Accounting I 901063046 Deborah Turner
ACCT2102TS1 1305 1355 172 201 MWF Accounting II 901790899 Robert Dunn
ACCT2102TS1 1040 1150 172 222 MWF Accounting
II 901063085 Arnold Schneider

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
CheckBox Checked=false when checked in DataList yurps ASP .Net 1 02-29-2008 01:08 PM
yes minister and yes prime minister wanted music_mania DVD Video 11 12-11-2006 07:32 PM
[IE: Yes Opera:Yes Mozilla:No] : Error on Postback and Validation teo ASP .Net 3 11-11-2006 04:53 AM
Kernel.y and yes,yes,yes not least surprise Jamie Herre Ruby 1 01-07-2005 07:33 PM
yes yes Kevin Walsh Computer Support 1 08-30-2004 12:55 AM



Advertisments