Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Selecting Container for "Top 20" Application

Reply
Thread Tools

Selecting Container for "Top 20" Application

 
 
Mike Copeland
Guest
Posts: n/a
 
      08-17-2013
I am looking for an STL container that will efficiently handle a "Top
20" list. Specifically, I have >300,000 data records that I must scan
to find and store the highest 20 values.
None of the available STL containers seems well suited for this
application. For example:
- array, even though fixed in size, seems cumbersome to use: no delete
or erase function is available; insertion at some point is difficult;
etc. Replacement of the "end value" and sorting might work, but it's
tedious and slow...
- vector, while having insert & erase functions, incurs repeated size
expansion (expensive!) and truncation to maintain a fixed size.
Appending a new value if it's greater than the 20th element, followed by
sorting and truncation might be less work, but it might be very slow to
execute.
- queue/deque, set, and stack seem inappropriate for this application,
and the others (map, list, etc.) are completely unusable here.
Am I missing something? Is/are there reasonably efficient ways to
use array or vector that are appropriate? TIA
 
Reply With Quote
 
 
 
 
Stefan Ram
Guest
Posts: n/a
 
      08-17-2013
http://www.velocityreviews.com/forums/(E-Mail Removed) (Mike Copeland) writes:
>Am I missing something? Is/are there reasonably efficient ways to


Interfaces. Do not code to an implementation. Code to an interface.

By ğinterfaceĞ I mean: (abstract) pure-virtual (base) class.

Write that interface to contain the functions you dream of
(the perfect container of your dreams, that is perfectly
suited to your needs)- as if they were already implemented.
Do not think about how to implement them at this time.

Next, implement (using a derived class with implementations
for the function signatures of the interface) them in a way
that optimizes not run-time, but your development time. If
you believe in testing or documentation, at this point, you
now might write documentation or tests for these classes.

Now, you can write your actual program using the dream
container.

Only when you then should observe that it is too slow, you
then can derive another class from the interface with a more
run-time efficient implementation.

 
Reply With Quote
 
 
 
 
Jorgen Grahn
Guest
Posts: n/a
 
      08-17-2013
On Sat, 2013-08-17, Stefan Ram wrote:
> (E-Mail Removed) (Mike Copeland) writes:
>>Am I missing something? Is/are there reasonably efficient ways to

>
> Interfaces. Do not code to an implementation. Code to an interface.


Are you sure you're replying to the right posting? I cannot see how
run-time polymorphism would help Mr Copeland -- and I cannot see that
he mistakenly tries to code to an implementation.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
 
Reply With Quote
 
Jorgen Grahn
Guest
Posts: n/a
 
      08-17-2013
On Sat, 2013-08-17, Mike Copeland wrote:
> I am looking for an STL container that will efficiently handle a "Top
> 20" list. Specifically, I have >300,000 data records that I must scan
> to find and store the highest 20 values.


Do you have all 300K records in a container already for some other
purpose? Or do you e.g. read them from file and really only want the
top 20?

Does ">300,000" mean "slightly more than that" or "the program should
deal gracefully with more entries than will fit into memory"? Because
that's feasible, and sometimes desirable.

In either case, it's not a matter of finding a container, but finding
a suitable algorithm+container pair.

Personally I think I'd settle for:
- accumulating into a std::set<Record>
- trim the set as soon as it grows to size 21
- if the set is at size 20, never try to insert a
record less than the current min (pointless)

That can handle any input data set size. The only drawback I can see
is it's slow if the input is already sorted. (Depending on what you
want it for, that can be a showstopper.)

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
 
Reply With Quote
 
Carlo Milanesi
Guest
Posts: n/a
 
      08-17-2013
Mike Copeland wrote:
> I am looking for an STL container that will efficiently handle a "Top
> 20" list. Specifically, I have >300,000 data records that I must scan
> to find and store the highest 20 values.
> None of the available STL containers seems well suited for this
> application.


In the standard library there are several sorting algorithms.
Look here for what is more appropriate to you:
http://en.wikibooks.org/wiki/Optimiz...artial_sorting

If your data is in the vector v, use:
std::nth_element(v.begin(), v.begin() + 20, v.end());
to get in the first 20 places the smallest 20 items of the vector,
unsorted; use the slower:
std:artial_sort(v.begin(), v.begin() + 20, v.end());
if you want them sorted.

--

Carlo Milanesi
http://carlomilanesi.wordpress.com/
 
Reply With Quote
 
kfrank29.c@gmail.com
Guest
Posts: n/a
 
      08-17-2013
Hello Mike!

On Saturday, August 17, 2013 3:07:18 PM UTC-4, Mike Copeland wrote:
> I am looking for an STL container that will efficiently handle a "Top
> 20" list. Specifically, I have >300,000 data records that I must scan
> to find and store the highest 20 values.


To clarify, let me add some details I am assuming about
the problem you are trying to solve.

You are reading 300,000 records sequentially from some
"external" source (e.g., a file or database), but not
storing them. The records, as they are read in, are
unordered.

You need to find and store in memory the top 20, as given
by some sort order.

Please correct me if my assumptions are wrong, especially
assuming that you don't need to store the entire 300,000
records.

For the sake of simplicity, I'll assume that all 300,000
records are distinct, and that your sort order is a "total"
order, that is, that for any two (distinct) records, either
A < B or B < A.

> None of the available STL containers seems well suited for this
> application.


I think std::set would work for this. Note, std::set
is explicitly a sorted container (unlike the new
std::unordered_set, a hash table.).

> For example:
> - array, even though fixed in size, seems cumbersome to use: no delete
> or erase function is available; insertion at some point is difficult;
> etc. Replacement of the "end value" and sorting might work, but it's
> tedious and slow...


Agreed. std::array doesn't sound appropriate.

> - vector,


From the perspective of your problem, array and vector are
basically the same, so I don't think vector is appropriate.

> while having insert & erase functions, incurs repeated size
> expansion (expensive!) and truncation to maintain a fixed size.


Some more comments:

But if you only need to store a maximum of 20 elements, the
size expansions won't matter. (And even if it did, you could
reserve the needed capacity.)

> Appending a new value if it's greater than the 20th element, followed by
> sorting and truncation might be less work, but it might be very slow to
> execute.


Agreed. Resorting upon insertion sounds suboptimal.

> - queue/deque, set, and stack seem inappropriate for this application,


Queue and stack are container adapters that live on top
of, for example, a std::deque. They and deque don't give
efficient insertion into the middle, and suffer the same
problems as array and vector. (Plus queue and stack hide
part of the interface provided by vector and deque that
you would want.)

But, as I mentioned above, std::set is different.

> and the others (map, list, etc.) are completely unusable here.


Well, I wouldn't say "completely unusuable," just
not really appropriate. You could use list, similar
to vector (with similar disadvantages), and you could
use map as if it were a set, simply ignoring the value
component of the key / value pair.

> Am I missing something? Is/are there reasonably efficient ways to
> use array or vector that are appropriate? TIA


I don't see a good way to use array or vector, but
std:set makes sense to me (given my assumptions,
above), as follows:

Loop over your records:

Record r = ...;
if (myTopTwentySet.size() < 20) myToTwentySet.insert (r);
else {
if (r > *myTopTwentySet.begin()) {
myTopTwentySet.erase (myTopTwentySet.begin());
myToTwentySet.insert (r);
}
}

When you're done looping over your records, myTopTwentySet
will contain the 20 largest records (assuming you had
at least 20 records), and, if you care, they will be in
sorted order, that is, myTopTwentySet.begin() will point
to the twentieth largest record, and myTopTwentySet.rbegin()
will point to the largest.

Note that almost all of the time you will only be doing
the test:

if (r > *myTopTwentySet.begin()) {...}

which will not be true, so the details of inserting and
resorting are not really that important as long as you
have a cheap way of keeping track of the smallest of your
current top twenty. For this, std::set is convenient, and
probably nearly optimal, but you won't lose that much by
"rolling your own" (out of array or vector or what-not).

The basic analysis says 300,000 times you test your current
record against your current twentieth best -- cheap, and
(except in perverse cases) -- I'm guessing here -- something
like 20 + log (300,000) times you do something more expensive,
namely replace / resort.


Good luck.


K. Frank
 
Reply With Quote
 
Stefan Ram
Guest
Posts: n/a
 
      08-17-2013
Jorgen Grahn <(E-Mail Removed)> writes:
>On Sat, 2013-08-17, Stefan Ram wrote:
>>Interfaces. Do not code to an implementation. Code to an interface.

>Are you sure you're replying to the right posting? I cannot see how
>run-time polymorphism would help Mr Copeland -- and I cannot see that
>he mistakenly tries to code to an implementation.


You are right: The run-time part was not necessary. I translated
from Java too literally. Compile-time polymorphism is sufficient here!

Otherwise, I had the impression that the OP worried about
efficiency too early, because he worried that ::std::vector
might be slow. So I stand by the rest of my post.

 
Reply With Quote
 
Jorgen Grahn
Guest
Posts: n/a
 
      08-17-2013
On Sat, 2013-08-17, Stefan Ram wrote:
> Jorgen Grahn <(E-Mail Removed)> writes:
>>On Sat, 2013-08-17, Stefan Ram wrote:
>>>Interfaces. Do not code to an implementation. Code to an interface.

>>Are you sure you're replying to the right posting? I cannot see how
>>run-time polymorphism would help Mr Copeland -- and I cannot see that
>>he mistakenly tries to code to an implementation.

>
> You are right: The run-time part was not necessary. I translated
> from Java too literally. Compile-time polymorphism is sufficient here!
>
> Otherwise, I had the impression that the OP worried about
> efficiency too early, because he worried that ::std::vector
> might be slow. So I stand by the rest of my post.


Yes, and I agree with that part. Too much worrying, and too little
experimentation.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
 
Reply With Quote
 
K. Frank
Guest
Posts: n/a
 
      08-17-2013
Hi Robert!

On Saturday, August 17, 2013 5:42:42 PM UTC-4, (E-Mail Removed) wrote:
> On Sat, 17 Aug 2013 12:07:18 -0700,(Mike Copeland)
> wrote:
>
> > I am looking for an STL container that will efficiently handle a "Top
> >20" list. Specifically, I have >300,000 data records that I must scan

> ...
> > - queue/deque, set, and stack seem inappropriate for this application,
> >and the others (map, list, etc.) are completely unusable here.
> > Am I missing something? Is/are there reasonably efficient ways to
> >use array or vector that are appropriate? TIA

>
> What am I missing here? A priority queue is the obvious structure.


I think std:riority_queue works. It's also a sorted
container (technically a container adapter), which is
the main point. The only disadvantage I can see depends
on what you want to do with the top 20 after you have
them. The priority_queue hides some of the interface of
its underlying container, and that might be inconvenient.


Thanks.


K. Frank
 
Reply With Quote
 
K. Frank
Guest
Posts: n/a
 
      08-17-2013
Hello Sam!

On Saturday, August 17, 2013 5:57:07 PM UTC-4, Sam wrote:
> Mike Copeland writes:
>
> > I am looking for an STL container that will efficiently handle a "Top
> > 20" list.

> ...
> std::set, std::multiset, std::map, or std::multimap will always iterate in
> key order.
>
> Put your records into one of these four, whichever one's appropriate,
> specifying a comparison functor that order on your key value.


The disadvantage of this is you end up paying the cost
of sorting (and also storing) the entire set of 300,000
records.

> Then iterate over the first twenty elements of your container.


This will indeed give you what you want (but at the extra
cost mentioned above).


Best.


K. Frank
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
python-noob - which container is appropriate for later exportinginto mySql + matplotlib ? someone Python 45 04-15-2013 12:28 PM
a = [ "1", "2", "3" ] v/s a = new Array ( "1", "2", "3" )identical in all ways? okey Javascript 1 08-25-2009 12:56 PM
std::transform container => std::abs(container) Steven T. Hatton C++ 4 12-05-2004 07:10 AM
STL: container's values setup by another container Maitre Bart C++ 2 02-11-2004 12:11 AM
std::container::iterator vs std::container::pointer Vivi Orunitia C++ 11 02-04-2004 08:09 AM



Advertisments