Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Finding 1000 largest numbers from a file having some billion numbers

Reply
Thread Tools

Finding 1000 largest numbers from a file having some billion numbers

 
 
Subra
Guest
Posts: n/a
 
      03-06-2007
Hi,

What is the best way to find the 1000 largest numbers from the file
having hell lot of entries ? Can you please help me to find out the
way ? Do I need to go for B+ trees ??

Please help,
Subramanya M

 
Reply With Quote
 
 
 
 
Richard Tobin
Guest
Posts: n/a
 
      03-06-2007
In article <(E-Mail Removed) om>,
Subra <(E-Mail Removed)> wrote:

> What is the best way to find the 1000 largest numbers from the file
>having hell lot of entries ? Can you please help me to find out the
>way ? Do I need to go for B+ trees ??


This is really just an algorithm question, nothing to do with C in
particular.

I can't see a better way to do it than keeping the 1000 largest so far
in a sorted list with some structure that allows you to quickly insert
a new item in the correct position and discard the old 1000th number.
A B+-tree has those properties.

Keep a note of the current 1000th value so that you can discard
smaller values without even looking at the tree. If you really have a
billion numbers, and they're in a random order, the vast majority will
be smaller than the currernt 1000th, so the efficiency of the structure
in which you keep the largest 1000 may be of little importance. You
might consider how the average number of insertions varies with the
number of items in the file.

-- Richard

--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.
 
Reply With Quote
 
 
 
 
Eric Sosman
Guest
Posts: n/a
 
      03-06-2007
Subra wrote:
> Hi,
>
> What is the best way to find the 1000 largest numbers from the file
> having hell lot of entries ? Can you please help me to find out the
> way ? Do I need to go for B+ trees ??


I'd suggest using a heap, as in Heapsort.

--
Eric Sosman
http://www.velocityreviews.com/forums/(E-Mail Removed)lid
 
Reply With Quote
 
bytebro
Guest
Posts: n/a
 
      03-06-2007
On 6 Mar, 10:30, "Subra" <(E-Mail Removed)> wrote:
> What is the best way to find the 1000 largest numbers from the file
> having hell lot of entries ? Can you please help me to find out the
> way ? Do I need to go for B+ trees ??


sort file | tail -1000

 
Reply With Quote
 
Richard Tobin
Guest
Posts: n/a
 
      03-06-2007
In article <(E-Mail Removed). com>,
bytebro <(E-Mail Removed)> wrote:

>> What is the best way to find the 1000 largest numbers from the file
>> having hell lot of entries ? Can you please help me to find out the
>> way ? Do I need to go for B+ trees ??


>sort file | tail -1000


Apart from the fact that you at least need "sort -n", it's true that
that might be the best way if you only wanted to do it a small number
of times. If the number were increased from a billion to, say, a
hundred billion it would probably not be usable on most current machines.

-- Richard
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.
 
Reply With Quote
 
user923005
Guest
Posts: n/a
 
      03-06-2007
On Mar 6, 2:30 am, "Subra" <(E-Mail Removed)> wrote:
> Hi,
>
> What is the best way to find the 1000 largest numbers from the file
> having hell lot of entries ? Can you please help me to find out the
> way ? Do I need to go for B+ trees ??
>
> Please help,
> Subramanya M


Your question is better suited to news:comp.programming.

The answer to your question is called Quickselect() and is described
in detail in:
T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
Algorithms, Cambridge: The MIT Press, 1990.

It's O(N). All the other solutions posed here are O(N*log(N))

 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      03-06-2007
user923005 wrote On 03/06/07 13:12,:
> On Mar 6, 2:30 am, "Subra" <(E-Mail Removed)> wrote:
>
>>Hi,
>>
>> What is the best way to find the 1000 largest numbers from the file
>>having hell lot of entries ? Can you please help me to find out the
>>way ? Do I need to go for B+ trees ??
>>
>>Please help,
>>Subramanya M

>
>
> Your question is better suited to news:comp.programming.
>
> The answer to your question is called Quickselect() and is described
> in detail in:
> T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
> Algorithms, Cambridge: The MIT Press, 1990.


Observation #1: The "Quick Select" Google turns up
chooses *one* value from a set, not a thousand.

Observation #2: It also needs random access to the
entire set, which consists in this case of a billion
elements.

> It's O(N). All the other solutions posed here are O(N*log(N))


Observation #3: This is wrong. For (counter-)example,
I suggested using a heap, whose time complexity would be
O(N*log(1000))=O(N).

--
(E-Mail Removed)
 
Reply With Quote
 
user923005
Guest
Posts: n/a
 
      03-06-2007
On Mar 6, 10:46 am, Eric Sosman <(E-Mail Removed)> wrote:
> user923005 wrote On 03/06/07 13:12,:
>
>
>
> > On Mar 6, 2:30 am, "Subra" <(E-Mail Removed)> wrote:

>
> >>Hi,

>
> >> What is the best way to find the 1000 largest numbers from the file
> >>having hell lot of entries ? Can you please help me to find out the
> >>way ? Do I need to go for B+ trees ??

>
> >>Please help,
> >>Subramanya M

>
> > Your question is better suited to news:comp.programming.

>
> > The answer to your question is called Quickselect() and is described
> > in detail in:
> > T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
> > Algorithms, Cambridge: The MIT Press, 1990.

>
> Observation #1: The "Quick Select" Google turns up
> chooses *one* value from a set, not a thousand.


Sorry, but google is full of crap (I bet it was a Wikipedia article
which half the time are bologna). True, one value is returned, but
the set is partitioned with the top k elements at the top of the set.
That's how it finds the one it is looking for, by partitioning random
partitions.

> Observation #2: It also needs random access to the
> entire set, which consists in this case of a billion
> elements.


There you have me. You would have to be able to load all billion
keys. As an alternative, you could process the data one million rows
at a time. Take the top 1000 from those one thousand sets and perform
the algorithm on that. It's still O(1), with a larger constant of
proportionality/

> > It's O(N). All the other solutions posed here are O(N*log(N))

>
> Observation #3: This is wrong. For (counter-)example,
> I suggested using a heap, whose time complexity would be
> O(N*log(1000))=O(N).


Using a heap is not O(n). If you consider the set a fixed size (e.g.
one billion) and you consider the region of interest a fixed size
(e.g. 1000) then true, we can call it O(n). But in algorithmic terms
both of those things are variables. We could (in fact) bogosort the
whole mess and then pick the top 1000 items. Since 1e9 is a constant,
and if N is a constant, then N! is a constant -- we could say that the
whole thing is still O(1). Of course, it does not work like that.

> --
> (E-Mail Removed)



 
Reply With Quote
 
CBFalconer
Guest
Posts: n/a
 
      03-06-2007
Subra wrote:
>
> What is the best way to find the 1000 largest numbers from the file
> having hell lot of entries ? Can you please help me to find out the
> way ? Do I need to go for B+ trees ??


Look up heaps. All you need is an array[1000] and a few auxiliary
variables.

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfocus.com/columnists/423>

"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews


 
Reply With Quote
 
user923005
Guest
Posts: n/a
 
      03-06-2007
It turns out that the ideal solution to his problem is discussed in
this paper:

"Improved Algorithms and Analysis for Secretary Problems and
Generalizations"
Miklos Ajtai, Nimrod Megiddo, Orli Waarts
IEEE Symposium on Foundations of Computer Science

It deals not only with the selection issue, but also directly with the
problem of not being able to hold all of the items in memory.


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
A billion pixels for a billion stars - CCDs for astronomy David J Taylor Digital Photography 9 10-11-2011 06:24 PM
Worlds Largest Photo and Worlds Largest Camera... Somebody Digital Photography 1 08-16-2007 02:51 AM
The largest digital image: 8.6 billion pixel Hires Digital Photography 0 10-22-2006 02:21 PM
finding largest numbers ramu C Programming 19 06-27-2006 06:21 PM
Junit tests, setting up tests without having to create a billion methods xyzzy12@hotmail.com Java 8 02-28-2006 08:59 PM



Advertisments