Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Selecting Container for "Top 20" Application

Reply
Thread Tools

Selecting Container for "Top 20" Application

 
 
Jorgen Grahn
Guest
Posts: n/a
 
      08-19-2013
On Mon, 2013-08-19, Robert Wessel wrote:
> On 19 Aug 2013 09:28:18 GMT, Jorgen Grahn <(E-Mail Removed)>
> wrote:
>
>>On Sun, 2013-08-18, Robert Wessel wrote:
>>> On Sun, 18 Aug 2013 12:21:46 -0700 (PDT), Paul N <(E-Mail Removed)>
>>> wrote:

>>...
>>>>Then at the end just deallocate spare. You only need 21 allocations no matter how many records you have (if you're really trying you could allocate one big chunk for all of them in one go) because you reuse the space. Insertions are quick because for a linked list you don't move anything, you just change pointers. And you only need to run through the list when you've got a record that is in the top 20 so far.
>>>
>>> Not to pick on you specifically Paul (the post had to go somewhere!),
>>> but has the state of programming knowledge really deteriorated so much
>>> that almost no one knows what a priority queue is?

>>
>>Surely you can't use std:riority_queue without placing all 300K
>>entries in it? I think that's the main reason people are avoiding
>>it.
>>
>>Perhaps that space requirement doesn't matter to the OP, but
>>it's tempting to try to avoid it, when it's obvious that it
>>/is/ avoidable.

>
>
> for(..300K items..)
> {
> get(item);
> pq.push(item);
> if (pq.size() > 20)
> pq.pop()
> }
>
>
> The priority queue always has the "highest"* value available for
> immediate removal. Let's say there are 20 items in the priority
> queue, stuff a new item in, and then you have 21, with the "highest"
> one of those immediately removable. Since we only want the top 20,
> just pull that extra one off and discard it.


Ah, of course! I was mentally placing the elements in the wrong
order, so that pop() would prematurely remove a candidate for the
top-20 list rather than an obvious non-candidate.

You're right, this is the obvious way to do it.

And perhaps I should be more used to dealing with priority queues.
I *do* know what it is, but I couldn't apply it correctly to this
problem.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
 
Reply With Quote
 
 
 
 
James Kanze
Guest
Posts: n/a
 
      08-19-2013
On Saturday, 17 August 2013 20:07:18 UTC+1, Mike Copeland wrote:
> I am looking for an STL container that will efficiently handle a "Top
> 20" list. Specifically, I have >300,000 data records that I must scan
> to find and store the highest 20 values.
> None of the available STL containers seems well suited for this
> application. For example:
> - array, even though fixed in size, seems cumbersome to use: no delete
> or erase function is available; insertion at some point is difficult;
> etc. Replacement of the "end value" and sorting might work, but it's
> tedious and slow...
> - vector, while having insert & erase functions, incurs repeated size
> expansion (expensive!) and truncation to maintain a fixed size.
> Appending a new value if it's greater than the 20th element, followed by
> sorting and truncation might be less work, but it might be very slow to
> execute.
> - queue/deque, set, and stack seem inappropriate for this application,
> and the others (map, list, etc.) are completely unusable here.
> Am I missing something? Is/are there reasonably efficient ways to
> use array or vector that are appropriate? TIA


The most obvious solution is std::vector, but keeping a maximum
of 20 elements in it, and using std:ush_heap and std:op_heap
to manage it. All of the algorithms you need are already
present in the standard library.

--
James
 
Reply With Quote
 
 
 
 
Luca Risolia
Guest
Posts: n/a
 
      08-19-2013
Sebastian Pfützner wrote:

> Finding top-elements via heap-extract: m*O(log n)
>
> So a heap can be the better choice. It would be interesting to measure,
> if this is true with n=300k and m=20.


Oh right, std::vector and std::make/push/pop_heap() might be a better choice,
but probably std:riority_queue is not ideal for the user needs, as it would
require to remove the elements from the container each time to get the "top
20" list.

 
Reply With Quote
 
Jorgen Grahn
Guest
Posts: n/a
 
      08-19-2013
On Mon, 2013-08-19, Luca Risolia wrote:
> Sebastian Pfützner wrote:
>
>> Finding top-elements via heap-extract: m*O(log n)
>>
>> So a heap can be the better choice. It would be interesting to measure,
>> if this is true with n=300k and m=20.

>
> Oh right, std::vector and std::make/push/pop_heap() might be a better choice,
> but probably std:riority_queue is not ideal for the user needs, as it would
> require to remove the elements from the container each time to get the "top
> 20" list.


Not /each/ time. Most elements can be discarded immediately without
inserting them, unless the input is already sorted.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
 
Reply With Quote
 
Sebastian Pfützner
Guest
Posts: n/a
 
      08-22-2013
On 19.08.2013 20:42, James Kanze wrote:
> On Saturday, 17 August 2013 20:07:18 UTC+1, Mike Copeland wrote:
>> I am looking for an STL container that will efficiently handle a "Top
>> 20" list. Specifically, I have >300,000 data records that I must scan
>> to find and store the highest 20 values.
>> None of the available STL containers seems well suited for this
>> application. For example:
>> - array, even though fixed in size, seems cumbersome to use: no delete
>> or erase function is available; insertion at some point is difficult;
>> etc. Replacement of the "end value" and sorting might work, but it's
>> tedious and slow...
>> - vector, while having insert & erase functions, incurs repeated size
>> expansion (expensive!) and truncation to maintain a fixed size.
>> Appending a new value if it's greater than the 20th element, followed by
>> sorting and truncation might be less work, but it might be very slow to
>> execute.
>> - queue/deque, set, and stack seem inappropriate for this application,
>> and the others (map, list, etc.) are completely unusable here.
>> Am I missing something? Is/are there reasonably efficient ways to
>> use array or vector that are appropriate? TIA

>
> The most obvious solution is std::vector, but keeping a maximum
> of 20 elements in it, and using std:ush_heap and std:op_heap
> to manage it. All of the algorithms you need are already
> present in the standard library.


This also has the advantage that you don't need to remove the
top-elements from the vector as with the priority_queue. std:op_heap
just swaps the top_element into the back. Do this 20 times and you have
your top-elements in [last - 20, last). You can simply reinert those
elements back into the heap. You also follow the same procedure when you
need to change the value of elements. (pop, change back element, push).

--
Sebastian Pfützner
http://www.velocityreviews.com/forums/(E-Mail Removed)
ICQ-ID: 39965036
 
Reply With Quote
 
Sebastian Pfützner
Guest
Posts: n/a
 
      08-22-2013
On 22.08.2013 13:13, Sebastian Pfützner wrote:
> On 19.08.2013 20:42, James Kanze wrote:
>> On Saturday, 17 August 2013 20:07:18 UTC+1, Mike Copeland wrote:
>>> I am looking for an STL container that will efficiently handle a "Top
>>> 20" list. Specifically, I have >300,000 data records that I must scan
>>> to find and store the highest 20 values.
>>> None of the available STL containers seems well suited for this
>>> application. For example:
>>> - array, even though fixed in size, seems cumbersome to use: no delete
>>> or erase function is available; insertion at some point is difficult;
>>> etc. Replacement of the "end value" and sorting might work, but it's
>>> tedious and slow...
>>> - vector, while having insert & erase functions, incurs repeated size
>>> expansion (expensive!) and truncation to maintain a fixed size.
>>> Appending a new value if it's greater than the 20th element, followed by
>>> sorting and truncation might be less work, but it might be very slow to
>>> execute.
>>> - queue/deque, set, and stack seem inappropriate for this application,
>>> and the others (map, list, etc.) are completely unusable here.
>>> Am I missing something? Is/are there reasonably efficient ways to
>>> use array or vector that are appropriate? TIA

>>
>> The most obvious solution is std::vector, but keeping a maximum
>> of 20 elements in it, and using std:ush_heap and std:op_heap
>> to manage it. All of the algorithms you need are already
>> present in the standard library.

>
> This also has the advantage that you don't need to remove the
> top-elements from the vector as with the priority_queue. std:op_heap
> just swaps the top_element into the back. Do this 20 times and you have
> your top-elements in [last - 20, last). You can simply reinert those
> elements back into the heap. You also follow the same procedure when you
> need to change the value of elements. (pop, change back element, push).


To clarify: I would store all elements in this vector and keep the
heap-sort-order in sync. This has the advantage that it is still
efficient if the top-elements are computed more than once after elements
are added/removed/changed. James' proposal does not have this property,
because the heap hast to be rebuild every time by iterating over all
elements even if only some elements changed.

--
Sebastian Pfützner
(E-Mail Removed)
ICQ-ID: 39965036
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
python-noob - which container is appropriate for later exportinginto mySql + matplotlib ? someone Python 45 04-15-2013 12:28 PM
a = [ "1", "2", "3" ] v/s a = new Array ( "1", "2", "3" )identical in all ways? okey Javascript 1 08-25-2009 12:56 PM
std::transform container => std::abs(container) Steven T. Hatton C++ 4 12-05-2004 07:10 AM
STL: container's values setup by another container Maitre Bart C++ 2 02-11-2004 12:11 AM
std::container::iterator vs std::container::pointer Vivi Orunitia C++ 11 02-04-2004 08:09 AM



Advertisments