Velocity Reviews > Finding 1000 largest numbers from a file having some billion numbers

# Finding 1000 largest numbers from a file having some billion numbers

Subra
Guest
Posts: n/a

 03-06-2007
Hi,

What is the best way to find the 1000 largest numbers from the file
way ? Do I need to go for B+ trees ??

Subramanya M

Richard Tobin
Guest
Posts: n/a

 03-06-2007
In article <(E-Mail Removed) om>,
Subra <(E-Mail Removed)> wrote:

> What is the best way to find the 1000 largest numbers from the file
>way ? Do I need to go for B+ trees ??

This is really just an algorithm question, nothing to do with C in
particular.

I can't see a better way to do it than keeping the 1000 largest so far
in a sorted list with some structure that allows you to quickly insert
a new item in the correct position and discard the old 1000th number.
A B+-tree has those properties.

Keep a note of the current 1000th value so that you can discard
smaller values without even looking at the tree. If you really have a
billion numbers, and they're in a random order, the vast majority will
be smaller than the currernt 1000th, so the efficiency of the structure
in which you keep the largest 1000 may be of little importance. You
might consider how the average number of insertions varies with the
number of items in the file.

-- Richard

--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.

Eric Sosman
Guest
Posts: n/a

 03-06-2007
Subra wrote:
> Hi,
>
> What is the best way to find the 1000 largest numbers from the file
> way ? Do I need to go for B+ trees ??

I'd suggest using a heap, as in Heapsort.

--
Eric Sosman
http://www.velocityreviews.com/forums/(E-Mail Removed)lid

bytebro
Guest
Posts: n/a

 03-06-2007
On 6 Mar, 10:30, "Subra" <(E-Mail Removed)> wrote:
> What is the best way to find the 1000 largest numbers from the file
> way ? Do I need to go for B+ trees ??

sort file | tail -1000

Richard Tobin
Guest
Posts: n/a

 03-06-2007
In article <(E-Mail Removed). com>,
bytebro <(E-Mail Removed)> wrote:

>> What is the best way to find the 1000 largest numbers from the file
>> way ? Do I need to go for B+ trees ??

>sort file | tail -1000

Apart from the fact that you at least need "sort -n", it's true that
that might be the best way if you only wanted to do it a small number
of times. If the number were increased from a billion to, say, a
hundred billion it would probably not be usable on most current machines.

-- Richard
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.

user923005
Guest
Posts: n/a

 03-06-2007
On Mar 6, 2:30 am, "Subra" <(E-Mail Removed)> wrote:
> Hi,
>
> What is the best way to find the 1000 largest numbers from the file
> way ? Do I need to go for B+ trees ??
>
> Subramanya M

Your question is better suited to news:comp.programming.

in detail in:
T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
Algorithms, Cambridge: The MIT Press, 1990.

It's O(N). All the other solutions posed here are O(N*log(N))

Eric Sosman
Guest
Posts: n/a

 03-06-2007
user923005 wrote On 03/06/07 13:12,:
> On Mar 6, 2:30 am, "Subra" <(E-Mail Removed)> wrote:
>
>>Hi,
>>
>> What is the best way to find the 1000 largest numbers from the file
>>way ? Do I need to go for B+ trees ??
>>
>>Subramanya M

>
>
> Your question is better suited to news:comp.programming.
>
> The answer to your question is called Quickselect() and is described
> in detail in:
> T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
> Algorithms, Cambridge: The MIT Press, 1990.

Observation #1: The "Quick Select" Google turns up
chooses *one* value from a set, not a thousand.

entire set, which consists in this case of a billion
elements.

> It's O(N). All the other solutions posed here are O(N*log(N))

Observation #3: This is wrong. For (counter-)example,
I suggested using a heap, whose time complexity would be
O(N*log(1000))=O(N).

--
(E-Mail Removed)

user923005
Guest
Posts: n/a

 03-06-2007
On Mar 6, 10:46 am, Eric Sosman <(E-Mail Removed)> wrote:
> user923005 wrote On 03/06/07 13:12,:
>
>
>
> > On Mar 6, 2:30 am, "Subra" <(E-Mail Removed)> wrote:

>
> >>Hi,

>
> >> What is the best way to find the 1000 largest numbers from the file
> >>way ? Do I need to go for B+ trees ??

>
> >>Subramanya M

>
> > Your question is better suited to news:comp.programming.

>
> > The answer to your question is called Quickselect() and is described
> > in detail in:
> > T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
> > Algorithms, Cambridge: The MIT Press, 1990.

>
> Observation #1: The "Quick Select" Google turns up
> chooses *one* value from a set, not a thousand.

Sorry, but google is full of crap (I bet it was a Wikipedia article
which half the time are bologna). True, one value is returned, but
the set is partitioned with the top k elements at the top of the set.
That's how it finds the one it is looking for, by partitioning random
partitions.

> entire set, which consists in this case of a billion
> elements.

There you have me. You would have to be able to load all billion
keys. As an alternative, you could process the data one million rows
at a time. Take the top 1000 from those one thousand sets and perform
the algorithm on that. It's still O(1), with a larger constant of
proportionality/

> > It's O(N). All the other solutions posed here are O(N*log(N))

>
> Observation #3: This is wrong. For (counter-)example,
> I suggested using a heap, whose time complexity would be
> O(N*log(1000))=O(N).

Using a heap is not O(n). If you consider the set a fixed size (e.g.
one billion) and you consider the region of interest a fixed size
(e.g. 1000) then true, we can call it O(n). But in algorithmic terms
both of those things are variables. We could (in fact) bogosort the
whole mess and then pick the top 1000 items. Since 1e9 is a constant,
and if N is a constant, then N! is a constant -- we could say that the
whole thing is still O(1). Of course, it does not work like that.

> --
> (E-Mail Removed)

CBFalconer
Guest
Posts: n/a

 03-06-2007
Subra wrote:
>
> What is the best way to find the 1000 largest numbers from the file
> way ? Do I need to go for B+ trees ??

Look up heaps. All you need is an array[1000] and a few auxiliary
variables.

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfocus.com/columnists/423>

"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews

user923005
Guest
Posts: n/a

 03-06-2007
It turns out that the ideal solution to his problem is discussed in
this paper:

"Improved Algorithms and Analysis for Secretary Problems and
Generalizations"
Miklos Ajtai, Nimrod Megiddo, Orli Waarts
IEEE Symposium on Foundations of Computer Science

It deals not only with the selection issue, but also directly with the
problem of not being able to hold all of the items in memory.