Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > sorting just the largest values

Reply
Thread Tools

sorting just the largest values

 
 
Alex Hart
Guest
Posts: n/a
 
      01-27-2005
I have a very large list of data (up to 20,000 lines), and I just want
to print out the largest 50 values. What is the quickest way to sort
this using perl?

Thanks in advance.

- Alex Hart

 
Reply With Quote
 
 
 
 
Arndt Jonasson
Guest
Posts: n/a
 
      01-27-2005

"Alex Hart" <(E-Mail Removed)> writes:
> I have a very large list of data (up to 20,000 lines), and I just want
> to print out the largest 50 values. What is the quickest way to sort
> this using perl?


For very large lists of data: while reading in the lines, keep a
sorted array with the up to 50 largest seen values.

But 20000 is not very large, or even large. Read the whole file and
sort it.

If you say "sort" to search.cpan.org, it will show you some modules
that are likely to be useful.
 
Reply With Quote
 
 
 
 
Gregory Toomey
Guest
Posts: n/a
 
      01-27-2005
Alex Hart wrote:

> I have a very large list of data (up to 20,000 lines), and I just want
> to print out the largest 50 values. What is the quickest way to sort
> this using perl?
>
> Thanks in advance.
>
> - Alex Hart


Selection sort - just do the outer loop 50 times.
http://en.wikipedia.org/wiki/Selection_sort

The time complexity is O(n) (sorting top k items ;k<<n)
compared to O(n log n) for sorting the whole list.

gtoomey
 
Reply With Quote
 
John W. Krahn
Guest
Posts: n/a
 
      01-27-2005
Alex Hart wrote:
> I have a very large list of data (up to 20,000 lines), and I just want
> to print out the largest 50 values. What is the quickest way to sort
> this using perl?


my @largest = `sort -n yourfile | tail -50`;



John
--
use Perl;
program
fulfillment
 
Reply With Quote
 
Eric Bohlman
Guest
Posts: n/a
 
      01-27-2005
"Alex Hart" <(E-Mail Removed)> wrote in
news:(E-Mail Removed) oups.com:

> I have a very large list of data (up to 20,000 lines), and I just want
> to print out the largest 50 values. What is the quickest way to sort
> this using perl?


Most people wouldn't consider 20,000 items to be a "very large list."
I'd suggest first just trying the brute-force approach of reading them
into an array, sorting them, and taking the last 50 results. Only if
this turns out to be unacceptably slow or memory-consuming for your
application should you consider something different. Remember that
perl's sort routines are implemented in optimized C code, so even if you
implement a lower-time-complexity algorithm yourself, the value of N for
which it becomes faster may be quite large.

If the brute-force approach turns out to be unacceptable (again, based on
actual measurement, not guesswork), here's a linear-time (O(N)) selection
algorithm:

1) Read the first 50 values into the top 50 list.

2) For each subsequent value, find the smallest value in the top 50 list
that's less than it. If there isn't one, do nothing. If there is, kick
it out of the list and add the new value to the list.

Step 2 will be simpler to program if you sort the list once after step 1.
You'll find shift() and splice() particularly helpful.

[UNTESTED]

my @top;
while (<>) {
push @top,$_ if $.<=50;
@top=sort @top if $.=50;
if ($.>50) {
for (my $i=49;$i>=0;--$i) {
if ($top[$i] lt $_) {
# this is where the new value belongs
splice @top,$i,0,$_; # insert the new value
shift @top; # remove the smallest value
last;
}
}
}
}

 
Reply With Quote
 
Alex Hart
Guest
Posts: n/a
 
      01-27-2005
> Most people wouldn't consider 20,000 items to be a "very large list."
> I'd suggest first just trying the brute-force approach of reading

them
> into an array, sorting them, and taking the last 50 results. Only if


I'm developing a real-time application, and a particular sort was
taking several minutes when the list got up to 20,000. It might be a
problem with the data being too ordered to start with (or does per
shuffle before it sorts?). I already extract the sort key and just
sort on that. I can try the packed string sort-key, but I think even a
slow method of finding the top 50 must be faster than sorting the whole
list.

I'll try some of the suggestions here, and I'll be sure to benchmark
along the way.

 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      01-27-2005
"Alex Hart" <(E-Mail Removed)> wrote:
> > Most people wouldn't consider 20,000 items to be a "very large list."
> > I'd suggest first just trying the brute-force approach of reading

> them
> > into an array, sorting them, and taking the last 50 results. Only if

>
> I'm developing a real-time application, and a particular sort was
> taking several minutes when the list got up to 20,000.


That is definitely pathological.

> It might be a
> problem with the data being too ordered to start with (or does per
> shuffle before it sorts?).


Depends on the version. Which version are you using?

> I'll try some of the suggestions here, and I'll be sure to benchmark
> along the way.


Unfortunately, benchmarking is extremely difficult in cases where there are
unknown and unpredictable pathological patterns in the data. Be careful.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      01-27-2005
<(E-Mail Removed)> wrote in comp.lang.perl.misc:
> "Alex Hart" <(E-Mail Removed)> wrote:
> > > Most people wouldn't consider 20,000 items to be a "very large list."
> > > I'd suggest first just trying the brute-force approach of reading

> > them
> > > into an array, sorting them, and taking the last 50 results. Only if

> >
> > I'm developing a real-time application, and a particular sort was
> > taking several minutes when the list got up to 20,000.

>
> That is definitely pathological.


I agree. There's some gross inefficiency going on. We should see the
sort, it's certainly speed-up-able.

In general, the solution to the "bottom-n" problem is a heap. In pseudo-
code (using a mythical module Heap, but there are real ones on CPAN):

use Heap;
my $h = Heap->new();
my $n = 50;
for ( @data ) {
$h->insert( $_);
$h->extract_maximum if $h->size > $n;
}
my @top_n = map $h->extract_maximum, 1 .. $h->size;

If the heap implementation doesn't have a ->size method, it is easy to
use an external counter instead.

If $n is larger than the data size, the algorithm does a heap sort on
the data. If $n is smaller, it is faster than the equivalent sort.

Anno
 
Reply With Quote
 
David Combs
Guest
Posts: n/a
 
      02-20-2005
In article <ctb88g$ng0$(E-Mail Removed)-Berlin.DE>,
Anno Siegel <(E-Mail Removed)-berlin.de> wrote:
>...
>In general, the solution to the "bottom-n" problem is a heap. In pseudo-
>code (using a mythical module Heap, but there are real ones on CPAN):
>
> use Heap;
> my $h = Heap->new();
> my $n = 50;
> for ( @data ) {
> $h->insert( $_);
> $h->extract_maximum if $h->size > $n;
> }
> my @top_n = map $h->extract_maximum, 1 .. $h->size;
>
>If the heap implementation doesn't have a ->size method, it is easy to
>use an external counter instead.
>
>If $n is larger than the data size, the algorithm does a heap sort on
>the data. If $n is smaller, it is faster than the equivalent sort.
>
>Anno


Perhaps (not sure about this) it would be faster (fewer steps?)
to use one of the *newer* (than heapsort) priority-queue methods,
eg splay trees and the like?

(Or maybe no real difference?)

David


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Keeping track of the N largest values Roy Smith Python 11 12-31-2010 10:52 PM
Sorting list vs sorting vector boltar2003@boltar.world C++ 2 07-06-2010 09:40 AM
How to get a set of keys with largest values? Davy Python 7 11-13-2007 08:42 AM
Worlds Largest Photo and Worlds Largest Camera... Somebody Digital Photography 1 08-16-2007 02:51 AM
The smallest and largest values of numeric types tom@finland.com Python 9 04-18-2007 02:35 PM



Advertisments