Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > ? about priority_queue how make objects unique

Reply
Thread Tools

? about priority_queue how make objects unique

 
 
Rich S.
Guest
Posts: n/a
 
      10-05-2005
Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so


------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT greater
than the top and then pushing obj but I'm convinced thats the best way.

Thanks for any help.

Regards.


 
Reply With Quote
 
 
 
 
mlimber
Guest
Posts: n/a
 
      10-05-2005
Rich S. wrote:
> Hi
>
> In an earlier posting I was asking how to read thru millions of data records
> keeping the top 2000 (where the top values are based upon a field in the
> record) and niklasb suggested using a priority_queue like so
>
>
> ------------------- start old message ---------------------
> A more efficient way than what you describe would be
> to use priority_queue. Define the predicate in such
> a way that the top() always returns the lowest scoring
> object.
>
> Let q be the priority_queue and suppose you want to
> find the top N scores (where N is 2000 in your example),
> the pseudo-code would be something like this:
>
> q.reserve(N);
>
> // add the first N elements to the priority queue
> for (i = 0; i < N; ++i)
> {
> if (!ReadRecord(&obj))
> break;
>
> q.push(obj);
> }
>
> // invariant: q contains highest N elements
> // q.top() is the lowest of the highest N elements
> while (ReadRecord(&obj))
> {
> if (obj > q.top())
> {
> q.pop();
> q.push(obj);
> }
> }
>
> ------------------- end of old message ---------------------
>
> This is working well for me but I'm getting cases where I have duplicate
> records in the top 2000 and I'm not quite sure how to fix that problem.
>
> I was thinking of running a loop around the pop until the obj is NOT greater
> than the top and then pushing obj but I'm convinced thats the best way.
>
> Thanks for any help.
>
> Regards.


Are you getting duplicates because there are duplicates in your file(s)
or because there is an error in the program? (If the former, you might
consider using a std::unique, perhaps in conjunction with a
std::vector, to find the first N unique records. See the docs here:
http://www.sgi.com/tech/stl/unique.html .)

Cheers! --M

 
Reply With Quote
 
 
 
 
Kai-Uwe Bux
Guest
Posts: n/a
 
      10-05-2005
Rich S. wrote:

> Hi
>
> In an earlier posting I was asking how to read thru millions of data
> records keeping the top 2000 (where the top values are based upon a field
> in the record) and niklasb suggested using a priority_queue like so
>
>
> ------------------- start old message ---------------------
> A more efficient way than what you describe would be
> to use priority_queue. Define the predicate in such
> a way that the top() always returns the lowest scoring
> object.
>
> Let q be the priority_queue and suppose you want to
> find the top N scores (where N is 2000 in your example),
> the pseudo-code would be something like this:
>
> q.reserve(N);
>
> // add the first N elements to the priority queue
> for (i = 0; i < N; ++i)
> {
> if (!ReadRecord(&obj))
> break;
>
> q.push(obj);
> }
>
> // invariant: q contains highest N elements
> // q.top() is the lowest of the highest N elements
> while (ReadRecord(&obj))
> {
> if (obj > q.top())
> {
> q.pop();
> q.push(obj);
> }
> }
>
> ------------------- end of old message ---------------------
>
> This is working well for me but I'm getting cases where I have duplicate
> records in the top 2000 and I'm not quite sure how to fix that problem.
>
> I was thinking of running a loop around the pop until the obj is NOT
> greater than the top and then pushing obj but I'm convinced thats the best
> way.


If you care about uniqueness, then maybe std::set is the way to go. Consider
someting like this:
#include <set>

template < typename T, unsigned long N >
struct top_N {

std::set<T> elements;

void insert ( T const & t ) {
elements.insert( t );
if ( elements.size() > N ) {
elements.erase( elements.begin() );
}
}

}; // struct top_N


Best

Kai-Uwe Bux
 
Reply With Quote
 
Kristo
Guest
Posts: n/a
 
      10-05-2005
Rich S. wrote:
> Hi
>
> In an earlier posting I was asking how to read thru millions of data records
> keeping the top 2000 (where the top values are based upon a field in the
> record) and niklasb suggested using a priority_queue like so


[snip old message]

> This is working well for me but I'm getting cases where I have duplicate
> records in the top 2000 and I'm not quite sure how to fix that problem.


What if you used a set as the underlying container for your
priority_queue? That would automagically prevent duplicates from being
inserted.

> I was thinking of running a loop around the pop until the obj is NOT greater
> than the top and then pushing obj but I'm convinced thats the best way.


That's too much work IMHO, and probably inefficient. Just let the set
do that for you.

Kristo

 
Reply With Quote
 
Mark P
Guest
Posts: n/a
 
      10-06-2005
Rich S. wrote:
> Hi
>
> In an earlier posting I was asking how to read thru millions of data records
> keeping the top 2000 (where the top values are based upon a field in the
> record) and niklasb suggested using a priority_queue like so
>
>
> ------------------- start old message ---------------------
> A more efficient way than what you describe would be
> to use priority_queue. Define the predicate in such
> a way that the top() always returns the lowest scoring
> object.
>
> Let q be the priority_queue and suppose you want to
> find the top N scores (where N is 2000 in your example),
> the pseudo-code would be something like this:
>
> q.reserve(N);
>
> // add the first N elements to the priority queue
> for (i = 0; i < N; ++i)
> {
> if (!ReadRecord(&obj))
> break;
>
> q.push(obj);
> }
>
> // invariant: q contains highest N elements
> // q.top() is the lowest of the highest N elements
> while (ReadRecord(&obj))
> {
> if (obj > q.top())
> {
> q.pop();
> q.push(obj);
> }
> }
>
> ------------------- end of old message ---------------------
>
> This is working well for me but I'm getting cases where I have duplicate
> records in the top 2000 and I'm not quite sure how to fix that problem.
>
> I was thinking of running a loop around the pop until the obj is NOT greater
> than the top and then pushing obj but I'm convinced thats the best way.
>
> Thanks for any help.
>
> Regards.
>
>


I agree with previous replies that a std::set is a good approach (just
make sure to check the return value of insert to see if an item was
actually inserted, if you want to ensure always 2000 records). Another
option is a hashtable to record which elts. are currently in the queue.
 
Reply With Quote
 
niklasb@microsoft.com
Guest
Posts: n/a
 
      10-06-2005
Kristo wrote:
> Rich S. wrote:
> > Hi
> >
> > In an earlier posting I was asking how to read thru millions of data records
> > keeping the top 2000 (where the top values are based upon a field in the
> > record) and niklasb suggested using a priority_queue like so

>
> [snip old message]
>
> > This is working well for me but I'm getting cases where I have duplicate
> > records in the top 2000 and I'm not quite sure how to fix that problem.

>
> What if you used a set as the underlying container for your
> priority_queue? That would automagically prevent duplicates from being
> inserted.


I believe the underlying container for priority_queue must be a
sequence container rather than an associative container. However,
you could use set _instead_ of priority_queue. The algorithm
doesn't change very much.

As before, this is just off the top of my head. I haven't compiled
it, much lest tested it, so use at your own risk and all that.

std::set<MyType> highest;
MyType obj;
while (ReadRecord(&obj))
{
// Is obj one of the highest so far?
if (highest.count() < count || obj > *highest.begin())
{
// Insert it into the set; this has no effect if the set
// already contains an identical value.
highest.insert(obj);

// If too many elements discard the lowest one.
if (highest.count() > count)
{
highest.erase(highest.begin());
}
}
}

 
Reply With Quote
 
Rich S.
Guest
Posts: n/a
 
      10-06-2005

"mlimber" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...
> Rich S. wrote:
>> Hi
>>
>> In an earlier posting I was asking how to read thru millions of data
>> records
>> keeping the top 2000 (where the top values are based upon a field in the
>> record) and niklasb suggested using a priority_queue like so
>>
>>
>> ------------------- start old message ---------------------
>> A more efficient way than what you describe would be
>> to use priority_queue. Define the predicate in such
>> a way that the top() always returns the lowest scoring
>> object.
>>
>> Let q be the priority_queue and suppose you want to
>> find the top N scores (where N is 2000 in your example),
>> the pseudo-code would be something like this:
>>
>> q.reserve(N);
>>
>> // add the first N elements to the priority queue
>> for (i = 0; i < N; ++i)
>> {
>> if (!ReadRecord(&obj))
>> break;
>>
>> q.push(obj);
>> }
>>
>> // invariant: q contains highest N elements
>> // q.top() is the lowest of the highest N elements
>> while (ReadRecord(&obj))
>> {
>> if (obj > q.top())
>> {
>> q.pop();
>> q.push(obj);
>> }
>> }
>>
>> ------------------- end of old message ---------------------
>>
>> This is working well for me but I'm getting cases where I have duplicate
>> records in the top 2000 and I'm not quite sure how to fix that problem.
>>
>> I was thinking of running a loop around the pop until the obj is NOT
>> greater
>> than the top and then pushing obj but I'm convinced thats the best way.
>>
>> Thanks for any help.
>>
>> Regards.

>
> Are you getting duplicates because there are duplicates in your file(s)
> or because there is an error in the program? (If the former, you might
> consider using a std::unique, perhaps in conjunction with a
> std::vector, to find the first N unique records. See the docs here:
> http://www.sgi.com/tech/stl/unique.html .)
>
> Cheers! --M
>

yes duplicates in my data which I expected. I switched to a set based
solution and it works good so far.

Thanks for the tip about unique that'll be handy one day.


 
Reply With Quote
 
Rich S.
Guest
Posts: n/a
 
      10-06-2005

<(E-Mail Removed)> wrote in message
news:(E-Mail Removed) ups.com...
> Kristo wrote:
>> Rich S. wrote:
>> > Hi
>> >
>> > In an earlier posting I was asking how to read thru millions of data
>> > records
>> > keeping the top 2000 (where the top values are based upon a field in
>> > the
>> > record) and niklasb suggested using a priority_queue like so

>>
>> [snip old message]
>>
>> > This is working well for me but I'm getting cases where I have
>> > duplicate
>> > records in the top 2000 and I'm not quite sure how to fix that problem.

>>
>> What if you used a set as the underlying container for your
>> priority_queue? That would automagically prevent duplicates from being
>> inserted.

>
> I believe the underlying container for priority_queue must be a
> sequence container rather than an associative container. However,
> you could use set _instead_ of priority_queue. The algorithm
> doesn't change very much.
>
> As before, this is just off the top of my head. I haven't compiled
> it, much lest tested it, so use at your own risk and all that.
>
> std::set<MyType> highest;
> MyType obj;
> while (ReadRecord(&obj))
> {
> // Is obj one of the highest so far?
> if (highest.count() < count || obj > *highest.begin())
> {
> // Insert it into the set; this has no effect if the set
> // already contains an identical value.
> highest.insert(obj);
>
> // If too many elements discard the lowest one.
> if (highest.count() > count)
> {
> highest.erase(highest.begin());
> }
> }
> }
>


I switched to a set based solution and it works pretty good.

Thanks


 
Reply With Quote
 
Rich S.
Guest
Posts: n/a
 
      10-06-2005

"Mark P" <(E-Mail Removed)> wrote in message
news:_l_0f.9161$(E-Mail Removed) t...
> Rich S. wrote:
>> Hi
>>
>> In an earlier posting I was asking how to read thru millions of data
>> records keeping the top 2000 (where the top values are based upon a field
>> in the record) and niklasb suggested using a priority_queue like so
>>
>>
>> ------------------- start old message ---------------------
>> A more efficient way than what you describe would be
>> to use priority_queue. Define the predicate in such
>> a way that the top() always returns the lowest scoring
>> object.
>>
>> Let q be the priority_queue and suppose you want to
>> find the top N scores (where N is 2000 in your example),
>> the pseudo-code would be something like this:
>>
>> q.reserve(N);
>>
>> // add the first N elements to the priority queue
>> for (i = 0; i < N; ++i)
>> {
>> if (!ReadRecord(&obj))
>> break;
>>
>> q.push(obj);
>> }
>>
>> // invariant: q contains highest N elements
>> // q.top() is the lowest of the highest N elements
>> while (ReadRecord(&obj))
>> {
>> if (obj > q.top())
>> {
>> q.pop();
>> q.push(obj);
>> }
>> }
>>
>> ------------------- end of old message ---------------------
>>
>> This is working well for me but I'm getting cases where I have duplicate
>> records in the top 2000 and I'm not quite sure how to fix that problem.
>>
>> I was thinking of running a loop around the pop until the obj is NOT
>> greater than the top and then pushing obj but I'm convinced thats the
>> best way.
>>
>> Thanks for any help.
>>
>> Regards.
>>
>>

>
> I agree with previous replies that a std::set is a good approach (just
> make sure to check the return value of insert to see if an item was
> actually inserted, if you want to ensure always 2000 records). Another
> option is a hashtable to record which elts. are currently in the queue.


I switched to a set but was thinking about a map solution for speed.


 
Reply With Quote
 
Rich S.
Guest
Posts: n/a
 
      10-06-2005

"Kai-Uwe Bux" <(E-Mail Removed)> wrote in message
news:di1ce7$hcq$(E-Mail Removed)...
> Rich S. wrote:
>
>> Hi
>>
>> In an earlier posting I was asking how to read thru millions of data
>> records keeping the top 2000 (where the top values are based upon a field
>> in the record) and niklasb suggested using a priority_queue like so
>>
>>
>> ------------------- start old message ---------------------
>> A more efficient way than what you describe would be
>> to use priority_queue. Define the predicate in such
>> a way that the top() always returns the lowest scoring
>> object.
>>
>> Let q be the priority_queue and suppose you want to
>> find the top N scores (where N is 2000 in your example),
>> the pseudo-code would be something like this:
>>
>> q.reserve(N);
>>
>> // add the first N elements to the priority queue
>> for (i = 0; i < N; ++i)
>> {
>> if (!ReadRecord(&obj))
>> break;
>>
>> q.push(obj);
>> }
>>
>> // invariant: q contains highest N elements
>> // q.top() is the lowest of the highest N elements
>> while (ReadRecord(&obj))
>> {
>> if (obj > q.top())
>> {
>> q.pop();
>> q.push(obj);
>> }
>> }
>>
>> ------------------- end of old message ---------------------
>>
>> This is working well for me but I'm getting cases where I have duplicate
>> records in the top 2000 and I'm not quite sure how to fix that problem.
>>
>> I was thinking of running a loop around the pop until the obj is NOT
>> greater than the top and then pushing obj but I'm convinced thats the
>> best
>> way.

>
> If you care about uniqueness, then maybe std::set is the way to go.
> Consider
> someting like this:
> #include <set>
>
> template < typename T, unsigned long N >
> struct top_N {
>
> std::set<T> elements;
>
> void insert ( T const & t ) {
> elements.insert( t );
> if ( elements.size() > N ) {
> elements.erase( elements.begin() );
> }
> }
>
> }; // struct top_N
>
>
> Best
>
> Kai-Uwe Bux


Thanks, I did switch to a set solution and it works well.


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Is there a unique method in python to unique a list? Token Type Python 9 09-09-2012 02:13 PM
list question... unique values in all possible unique spots ToshiBoy Python 6 08-12-2008 05:01 AM
queue, deque, priority_queue newsock C++ 15 11-04-2003 02:26 PM
clearing a priority_queue Tino C++ 4 07-09-2003 10:05 PM



Advertisments