Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > [Q] removing array duplicates where a subset is unique

Reply
Thread Tools

[Q] removing array duplicates where a subset is unique

 
 
Chuck Remes
Guest
Posts: n/a
 
      07-17-2009
I need to remove duplicates from an array of arrays. I can't use
Array#uniq because some fields are different and not part of the
"key." Here's an example where the first 3 elements of each sub array
are the "key" and determine uniqueness. I want to keep only the first
one I get.

>> a = [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]

=> [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]

The return value of deduplicating this array should be: [[1, 2, 3, 4,
5]]

Here is my first attempt at solving the problem:


>> def dedup ary
>> ary.map do |line|

?> dupes = ary.select { |row| row[0..2] == line[0..2] }
?> dupes.first
>> end.uniq
>> end

=> nil
>>

?> dedup a
=> [[1, 2, 3, 4, 5]]

This works. However, it is *super slow* when operating on my dataset.
My arrays contain hundreds of thousands of sub arrays. The unique key
for each sub array is the first 12 (of 1 elements. It is taking many
seconds to produce each intermediate array ("dupes" in the example
above), so deduping the entire thing would likely take days.

Anyone have a superior and faster solution?

cr


 
Reply With Quote
 
 
 
 
Joel VanderWerf
Guest
Posts: n/a
 
      07-17-2009
Chuck Remes wrote:
> I need to remove duplicates from an array of arrays. I can't use
> Array#uniq because some fields are different and not part of the "key."
> Here's an example where the first 3 elements of each sub array are the
> "key" and determine uniqueness. I want to keep only the first one I get.
>
> >> a = [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]

> => [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]
>
> The return value of deduplicating this array should be: [[1, 2, 3, 4, 5]]


Might be faster if your intermediate is a hash, so instead of N**2 time
it's N.

a = [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]

h = {}
a.each do |row|
h[row[0..2]] ||= row # record the first match
end

p h.values # ==> [[1, 2, 3, 4, 5]]


Note that the output may come out in a different order. Does that matter?

--
vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

 
Reply With Quote
 
 
 
 
David A. Black
Guest
Posts: n/a
 
      07-17-2009
Hi --

On Sat, 18 Jul 2009, Chuck Remes wrote:

> I need to remove duplicates from an array of arrays. I can't use Array#uniq
> because some fields are different and not part of the "key." Here's an
> example where the first 3 elements of each sub array are the "key" and
> determine uniqueness. I want to keep only the first one I get.
>
>>> a = [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]

> => [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]
>
> The return value of deduplicating this array should be: [[1, 2, 3, 4, 5]]
>
> Here is my first attempt at solving the problem:
>
>
>>> def dedup ary
>>> ary.map do |line|

> ?> dupes = ary.select { |row| row[0..2] == line[0..2] }
> ?> dupes.first
>>> end.uniq
>>> end

> => nil
>>>

> ?> dedup a
> => [[1, 2, 3, 4, 5]]
>
> This works. However, it is *super slow* when operating on my dataset. My
> arrays contain hundreds of thousands of sub arrays. The unique key for each
> sub array is the first 12 (of 1 elements. It is taking many seconds to
> produce each intermediate array ("dupes" in the example above), so deduping
> the entire thing would likely take days.
>
> Anyone have a superior and faster solution?


See if this speeds it up meaningfully (and make sure I've got the
logic right):

def dedup(ary)
uniq = {}
ary.each do |line|
uniq[ary[0..2]] ||= line
end
uniq.values
end


David

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Now available: The Well-Grounded Rubyist (http://manning.com/black2)
Training! Intro to Ruby, with Black & Kastner, September 14-17
(More info: http://rubyurl.com/vmzN)

 
Reply With Quote
 
Bil Kleb
Guest
Posts: n/a
 
      07-17-2009
David A. Black wrote:
> def dedup(ary)
> uniq = {}
> ary.each do |line|
> uniq[ary[0..2]] ||= line
> end
> uniq.values
> end


Sweet! (I love Ruby; thanks Matz.)

Regards,
--
Bil
 
Reply With Quote
 
Chuck Remes
Guest
Posts: n/a
 
      07-17-2009

On Jul 17, 2009, at 5:39 PM, David A. Black wrote:

> Hi --
>
> On Sat, 18 Jul 2009, Chuck Remes wrote:
>
>> I need to remove duplicates from an array of arrays. I can't use
>> Array#uniq because some fields are different and not part of the
>> "key." Here's an example where the first 3 elements of each sub
>> array are the "key" and determine uniqueness. I want to keep only
>> the first one I get.
>>
>>>> a = [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]

>> => [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]
>>
>> The return value of deduplicating this array should be: [[1, 2, 3,
>> 4, 5]]
>>
>> Here is my first attempt at solving the problem:
>>
>>
>>>> def dedup ary
>>>> ary.map do |line|

>> ?> dupes = ary.select { |row| row[0..2] == line[0..2] }
>> ?> dupes.first
>>>> end.uniq
>>>> end

>> => nil
>> ?> dedup a
>> => [[1, 2, 3, 4, 5]]
>>
>> This works. However, it is *super slow* when operating on my
>> dataset. My arrays contain hundreds of thousands of sub arrays. The
>> unique key for each sub array is the first 12 (of 1 elements. It
>> is taking many seconds to produce each intermediate array ("dupes"
>> in the example above), so deduping the entire thing would likely
>> take days.
>>
>> Anyone have a superior and faster solution?

>
> See if this speeds it up meaningfully (and make sure I've got the
> logic right):
>
> def dedup(ary)
> uniq = {}
> ary.each do |line|
> uniq[ary[0..2]] ||= line
> end
> uniq.values
> end


David and Joel,

you both provided the same solution. I will test this to see what kind
of performance I get. It will be hell on memory, but I assumed any
solution likely would be. (And Joel, I have presorted the array prior
to removing the dupes so I have already taken care of the ordering
issue.)

Thanks for your suggestions.

cr


 
Reply With Quote
 
brabuhr@gmail.com
Guest
Posts: n/a
 
      07-18-2009
On Fri, Jul 17, 2009 at 7:51 PM, Chuck Remes<(E-Mail Removed)> wrote:
> (And Joel, I have presorted the array prior to removing the
> dupes so I have already taken care of the ordering issue.)


I think what Joel was referring to was that in Ruby 1.8 a Hash doesn't
maintain insertion order when traversed (Ruby 1.9 does maintain
insertion order):

ruby 1.8.2 (2004-12-25) [powerpc-darwin8.0]:
irb(main):001:0> h = {}
=> {}
irb(main):002:0> 5.times{|n| h[n] = n}
=> 5
irb(main):003:0> h
=> {0=>0, 1=>1, 2=>2, 3=>3, 4=>4}
irb(main):004:0> h["sadf"] = 3
=> 3
irb(main):005:0> h
=> {0=>0, 1=>1, "sadf"=>3, 2=>2, 3=>3, 4=>4}

 
Reply With Quote
 
David A. Black
Guest
Posts: n/a
 
      07-18-2009
Hi --

On Sat, 18 Jul 2009, Chuck Remes wrote:

>
> On Jul 17, 2009, at 5:39 PM, David A. Black wrote:
>
>> Hi --
>>
>> On Sat, 18 Jul 2009, Chuck Remes wrote:
>>
>>> I need to remove duplicates from an array of arrays. I can't use
>>> Array#uniq because some fields are different and not part of the "key."
>>> Here's an example where the first 3 elements of each sub array are the
>>> "key" and determine uniqueness. I want to keep only the first one I get.
>>>
>>>>> a = [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]
>>> => [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]
>>>
>>> The return value of deduplicating this array should be: [[1, 2, 3, 4, 5]]
>>>
>>> Here is my first attempt at solving the problem:
>>>
>>>
>>>>> def dedup ary
>>>>> ary.map do |line|
>>> ?> dupes = ary.select { |row| row[0..2] == line[0..2] }
>>> ?> dupes.first
>>>>> end.uniq
>>>>> end
>>> => nil
>>> ?> dedup a
>>> => [[1, 2, 3, 4, 5]]
>>>
>>> This works. However, it is *super slow* when operating on my dataset. My
>>> arrays contain hundreds of thousands of sub arrays. The unique key for
>>> each sub array is the first 12 (of 1 elements. It is taking many seconds
>>> to produce each intermediate array ("dupes" in the example above), so
>>> deduping the entire thing would likely take days.
>>>
>>> Anyone have a superior and faster solution?

>>
>> See if this speeds it up meaningfully (and make sure I've got the
>> logic right):
>>
>> def dedup(ary)
>> uniq = {}
>> ary.each do |line|
>> uniq[ary[0..2]] ||= line
>> end
>> uniq.values
>> end

>
> David and Joel,
>
> you both provided the same solution. I will test this to see what kind of
> performance I get. It will be hell on memory, but I assumed any solution
> likely would be. (And Joel, I have presorted the array prior to removing the
> dupes so I have already taken care of the ordering issue.)


I believe the version you had originally, where you do a mapping of
the whole array, will typically use much more memory than the hash
version. Let's say your original array has 1000 inner arrays, with 10
that are considered unique. The mapping will be a new array, also of
1000 elements. The hash will have 10 key/value pairs -- thus much
smaller.


David

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Now available: The Well-Grounded Rubyist (http://manning.com/black2)
Training! Intro to Ruby, with Black & Kastner, September 14-17
(More info: http://rubyurl.com/vmzN)

 
Reply With Quote
 
David A. Black
Guest
Posts: n/a
 
      07-18-2009
Hi --

On Sat, 18 Jul 2009, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

> On Fri, Jul 17, 2009 at 7:51 PM, Chuck Remes<(E-Mail Removed)> wrote:
>> (And Joel, I have presorted the array prior to removing the
>> dupes so I have already taken care of the ordering issue.)

>
> I think what Joel was referring to was that in Ruby 1.8 a Hash doesn't
> maintain insertion order when traversed (Ruby 1.9 does maintain
> insertion order):
>
> ruby 1.8.2 (2004-12-25) [powerpc-darwin8.0]:
> irb(main):001:0> h = {}
> => {}
> irb(main):002:0> 5.times{|n| h[n] = n}
> => 5
> irb(main):003:0> h
> => {0=>0, 1=>1, 2=>2, 3=>3, 4=>4}
> irb(main):004:0> h["sadf"] = 3
> => 3
> irb(main):005:0> h
> => {0=>0, 1=>1, "sadf"=>3, 2=>2, 3=>3, 4=>4}


If you wanted to maintain order you could (for some but probably not
much performance penalty) do something like:

def dedup(ary)
uniq = {}
res = []
ary.each do |line|
key = line[0..2]
next if uniq[key]
res << (uniq[key] = line)
end
res
end


David

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Now available: The Well-Grounded Rubyist (http://manning.com/black2)
Training! Intro to Ruby, with Black & Kastner, September 14-17
(More info: http://rubyurl.com/vmzN)

 
Reply With Quote
 
Chuck Remes
Guest
Posts: n/a
 
      07-18-2009

On Jul 17, 2009, at 7:55 PM, David A. Black wrote:

> Hi --
>
> On Sat, 18 Jul 2009, Chuck Remes wrote:
>> David and Joel,
>>
>> you both provided the same solution. I will test this to see what
>> kind of performance I get. It will be hell on memory, but I assumed
>> any solution likely would be. (And Joel, I have presorted the array
>> prior to removing the dupes so I have already taken care of the
>> ordering issue.)

>
> I believe the version you had originally, where you do a mapping of
> the whole array, will typically use much more memory than the hash
> version. Let's say your original array has 1000 inner arrays, with 10
> that are considered unique. The mapping will be a new array, also of
> 1000 elements. The hash will have 10 key/value pairs -- thus much
> smaller.


Oh yes, my version had terrible execution performance and memory
performance. I was trying to figure out how to use a hash but did not
make the leap to the ||= construction on my own. I knew I was missing
something obvious... all of your rapid responses proved it.

FYI, the dedup code you provided performs quite admirably. I'll take a
look at its memory footprint when I get in the office Monday and
report back.

cr


 
Reply With Quote
 
s.ross
Guest
Posts: n/a
 
      07-19-2009
Hi---

On Jul 17, 2009, at 6:14 PM, David A. Black wrote:

> Hi --
>
> On Sat, 18 Jul 2009, (E-Mail Removed) wrote:
>
>> On Fri, Jul 17, 2009 at 7:51 PM, Chuck
>> Remes<(E-Mail Removed)> wrote:
>>> (And Joel, I have presorted the array prior to removing the
>>> dupes so I have already taken care of the ordering issue.)

>>
>> I think what Joel was referring to was that in Ruby 1.8 a Hash
>> doesn't
>> maintain insertion order when traversed (Ruby 1.9 does maintain
>> insertion order):
>>
>> ruby 1.8.2 (2004-12-25) [powerpc-darwin8.0]:
>> irb(main):001:0> h = {}
>> => {}
>> irb(main):002:0> 5.times{|n| h[n] = n}
>> => 5
>> irb(main):003:0> h
>> => {0=>0, 1=>1, 2=>2, 3=>3, 4=>4}
>> irb(main):004:0> h["sadf"] = 3
>> => 3
>> irb(main):005:0> h
>> => {0=>0, 1=>1, "sadf"=>3, 2=>2, 3=>3, 4=>4}

>
> If you wanted to maintain order you could (for some but probably not
> much performance penalty) do something like:
>
> def dedup(ary)
> uniq = {}
> res = []
> ary.each do |line|
> key = line[0..2]
> next if uniq[key]
> res << (uniq[key] = line)
> end
> res
> end
>
>
> David


I missed the beginning of this thread, but here is an implementation
I've used successfully:

def uniq_by(subject, &block)
h = {}
a = []
subject.each do |s|
comparator = yield(s)
unless h[comparator]
a.push(s)
h[comparator] = s
end
end
a
end

Usage:

u = uniq_by(ary|{ |item| item.element }

Basically, what this allows you to do is specify what exactly about an
array item must be unique. It also preserves the original array order,
with a "first entry wins" approach to duplicate elimination.

Hope this is useful.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Removing duplicates and substrings from an array Sam Larbi Ruby 10 11-28-2007 10:32 PM
identifying the duplicate when removing duplicates from any array Jack Perl Misc 1 06-10-2006 08:11 PM
list of unique non-subset sets les_ander@yahoo.com Python 15 03-19-2005 07:40 PM
removing duplicates from a string array Fred Java 15 03-12-2005 12:32 AM
array - removing duplicates Jerry Preston Perl Misc 5 11-15-2004 02:24 PM



Advertisments