Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > How to quickly search through arrays?

Reply
Thread Tools

How to quickly search through arrays?

 
 
Steve555
Guest
Posts: n/a
 
      02-22-2010
Hi

I have a set of 171 entities (in this case they're integers
representing musical phrases)
Theses exist in sequences of length 4, and I have a library (lookup-
table) of such sequences.
e.g.
2 58 93 170
38 39 100 150
38 40 41 87
etc.

The table is sorted and is quite sparse; there are 333,932 out of a
possible 855,036,081 (171^4) .

I'm looking for fast ways to find matches for a query containing 1 or
2 wildcards.
e.g. find matches for:
2 58 ? 170
? 39 ? 150

At the moment I'm using brute force to just iterated through the
lookup table, but this is proving very slow as the algorithm needs to
do hundreds of searches in close to real time.

I know very little about searching, so any ideas for me to study would
be welcomed.
I do know how to do a binary search, but only from the first entry, so
if that is a wildcard I'm stumped.

I happy to re-format/re-package the data if it could improve the
search speed. I'm not concerned about size or memory usage.

Thanks for any suggestions

Steve


 
Reply With Quote
 
 
 
 
Maxim Yegorushkin
Guest
Posts: n/a
 
      02-22-2010
On 22/02/10 08:02, Steve555 wrote:
> Hi
>
> I have a set of 171 entities (in this case they're integers
> representing musical phrases)
> Theses exist in sequences of length 4, and I have a library (lookup-
> table) of such sequences.
> e.g.
> 2 58 93 170
> 38 39 100 150
> 38 40 41 87
> etc.
>
> The table is sorted and is quite sparse; there are 333,932 out of a
> possible 855,036,081 (171^4) .
>
> I'm looking for fast ways to find matches for a query containing 1 or
> 2 wildcards.
> e.g. find matches for:
> 2 58 ? 170
> ? 39 ? 150
>
> At the moment I'm using brute force to just iterated through the
> lookup table, but this is proving very slow as the algorithm needs to
> do hundreds of searches in close to real time.
>
> I know very little about searching, so any ideas for me to study would
> be welcomed.
> I do know how to do a binary search, but only from the first entry, so
> if that is a wildcard I'm stumped.


Sounds that you might benefit from using a ternary search tree. They
have comparable to hash lookup performance and they do support partial
(wildcards) searches.

http://www.cs.princeton.edu/~rs/strings/

--
Max
 
Reply With Quote
 
 
 
 
Öö Tiib
Guest
Posts: n/a
 
      02-22-2010
On Feb 22, 10:02*am, Steve555 <(E-Mail Removed)> wrote:
> Hi
>
> I have a set of 171 entities (in this case they're integers
> representing musical phrases)
> Theses exist in sequences of length 4, and I have a library (lookup-
> table) of such sequences.
> e.g.
> 2 *58 *93 *170
> 38 *39 *100 *150
> 38 *40 *41 *87
> etc.
>
> The table is sorted and is quite sparse; there are 333,932 out of a
> possible 855,036,081 (171^4) .
>
> I'm looking for fast ways to find matches for a query containing 1 or
> 2 wildcards.
> e.g. find matches for:
> 2 *58 *? *170
> ? *39 *? *150
>
> At the moment I'm using brute force to just iterated through the
> lookup table, but this is proving very slow as the algorithm needs to
> do hundreds of searches in close to real time.
>
> I know very little about searching, so any ideas for me to study would
> be welcomed.


> I do know how to do a binary search, but only from the first entry, so
> if that is a wildcard I'm stumped.


You may create index vectors to your table (sorted by other phrases).
For search by first phrase use std::lower_bound() on your table. For
search by other phrase use std::lower_bound() on such sorted index.
If you have static array (no new phrases are added into your tables)
you may want to make pre-searched indexes to quicken it up even more.

> I happy to re-format/re-package the data if it could improve the
> search speed. I'm not concerned about size or memory usage.


Possibly you still want to make some alternative implementations and
profile them integrated into your application. Size of memory usage is
a factor that often drags performance down. It is common to
underestimate how devastating are frequent cache misses to
performance.

>
> Thanks for any suggestions
>
> Steve

 
Reply With Quote
 
Marcel Müller
Guest
Posts: n/a
 
      02-22-2010
Hi,

Steve555 wrote:
> I have a set of 171 entities (in this case they're integers
> representing musical phrases)
> Theses exist in sequences of length 4, and I have a library (lookup-
> table) of such sequences.
> e.g.
> 2 58 93 170
> 38 39 100 150
> 38 40 41 87
> etc.
>
> The table is sorted and is quite sparse; there are 333,932 out of a
> possible 855,036,081 (171^4) .
>
> I'm looking for fast ways to find matches for a query containing 1 or
> 2 wildcards.
> e.g. find matches for:
> 2 58 ? 170
> ? 39 ? 150
>
> At the moment I'm using brute force to just iterated through the
> lookup table, but this is proving very slow as the algorithm needs to
> do hundreds of searches in close to real time.


what about the change rate of your data set?

If the data changes rarely create six hashtables for each pair of
columns. If you want to speed up requests with only one wildcard even
more, place a nested hashtable for a third column in the value type of
the above hash tables. This will make all lookups with one and two
wildcards O(1). Note that you do not need to support all combinations as
you can choose the first hashtable appropriately in case of three given
Columns.
If memory consumption is more important than the scalability you could
replace the many inner hashtables by sorted lists.

Btw., how large can your numbers grow? If an unsigned char is
sufficient, you can pack an entire line into a single 32 bit structure.
Even with unsigned short data keeping redundant copies of your data in
the data structures is better than references.


> I do know how to do a binary search, but only from the first entry, so
> if that is a wildcard I'm stumped.


You need different entry points for different types of queries.


Marcel
 
Reply With Quote
 
Steve555
Guest
Posts: n/a
 
      02-22-2010
On 22 Feb, 09:34, Maxim Yegorushkin <(E-Mail Removed)>
wrote:
> On 22/02/10 08:02, Steve555 wrote:
>
>
>
>
>
> > Hi

>
> > I have a set of 171 entities (in this case they're integers
> > representing musical phrases)
> > Theses exist in sequences of length 4, and I have a library (lookup-
> > table) of such sequences.
> > e.g.
> > 2 *58 *93 *170
> > 38 *39 *100 *150
> > 38 *40 *41 *87
> > etc.

>
> > The table is sorted and is quite sparse; there are 333,932 out of a
> > possible 855,036,081 (171^4) .

>
> > I'm looking for fast ways to find matches for a query containing 1 or
> > 2 wildcards.
> > e.g. find matches for:
> > 2 *58 *? *170
> > ? *39 *? *150

>
> > At the moment I'm using brute force to just iterated through the
> > lookup table, but this is proving very slow as the algorithm needs to
> > do hundreds of searches in close to real time.

>
> > I know very little about searching, so any ideas for me to study would
> > be welcomed.
> > I do know how to do a binary search, but only from the first entry, so
> > if that is a wildcard I'm stumped.

>
> Sounds that you might benefit from using a ternary search tree. They
> have comparable to hash lookup performance and they do support partial
> (wildcards) searches.
>
> http://www.cs.princeton.edu/~rs/strings/
>
> --
> Max


Thanks Max, the Dr Dobbs link there has a lot of useful stuff.
 
Reply With Quote
 
Steve555
Guest
Posts: n/a
 
      02-22-2010
On 22 Feb, 09:53, Öö Tiib <(E-Mail Removed)> wrote:
> On Feb 22, 10:02*am, Steve555 <(E-Mail Removed)> wrote:
>
>
>
>
>
> > Hi

>
> > I have a set of 171 entities (in this case they're integers
> > representing musical phrases)
> > Theses exist in sequences of length 4, and I have a library (lookup-
> > table) of such sequences.
> > e.g.
> > 2 *58 *93 *170
> > 38 *39 *100 *150
> > 38 *40 *41 *87
> > etc.

>
> > The table is sorted and is quite sparse; there are 333,932 out of a
> > possible 855,036,081 (171^4) .

>
> > I'm looking for fast ways to find matches for a query containing 1 or
> > 2 wildcards.
> > e.g. find matches for:
> > 2 *58 *? *170
> > ? *39 *? *150

>
> > At the moment I'm using brute force to just iterated through the
> > lookup table, but this is proving very slow as the algorithm needs to
> > do hundreds of searches in close to real time.

>
> > I know very little about searching, so any ideas for me to study would
> > be welcomed.
> > I do know how to do a binary search, but only from the first entry, so
> > if that is a wildcard I'm stumped.

>
> You may create index vectors to your table (sorted by other phrases).
> For search by first phrase use std::lower_bound() on your table. For
> search by other phrase use std::lower_bound() on such sorted index.
> If you have static array (no new phrases are added into your tables)
> you may want to make pre-searched indexes to quicken it up even more.
>
> > I happy to re-format/re-package the data if it could improve the
> > search speed. I'm not concerned about size or memory usage.

>
> Possibly you still want to make some alternative implementations and
> profile them integrated into your application. Size of memory usage is
> a factor that often drags performance down. It is common to
> underestimate how devastating are frequent cache misses to
> performance.
>
>
>
>
>
> > Thanks for any suggestions

>
> > Steve


Did you mean make 3 other index vectors for the elements in the 2nd,
3rd, 4th positions?
I can see how that would allow me to use binary search on any
'column'.

I'm not sure what 'pre-searched indexes' means.

Thanks

Steve
 
Reply With Quote
 
Steve555
Guest
Posts: n/a
 
      02-22-2010
On 22 Feb, 10:06, Marcel Müller <(E-Mail Removed)> wrote:
> Hi,
>
>
>
>
>
> Steve555 wrote:
> > I have a set of 171 entities (in this case they're integers
> > representing musical phrases)
> > Theses exist in sequences of length 4, and I have a library (lookup-
> > table) of such sequences.
> > e.g.
> > 2 *58 *93 *170
> > 38 *39 *100 *150
> > 38 *40 *41 *87
> > etc.

>
> > The table is sorted and is quite sparse; there are 333,932 out of a
> > possible 855,036,081 (171^4) .

>
> > I'm looking for fast ways to find matches for a query containing 1 or
> > 2 wildcards.
> > e.g. find matches for:
> > 2 *58 *? *170
> > ? *39 *? *150

>
> > At the moment I'm using brute force to just iterated through the
> > lookup table, but this is proving very slow as the algorithm needs to
> > do hundreds of searches in close to real time.

>
> what about the change rate of your data set?
>
> If the data changes rarely create six hashtables for each pair of
> columns. If you want to speed up requests with only one wildcard even
> more, place a nested hashtable for a third column in the value type of
> the above hash tables. This will make all lookups with one and two
> wildcards O(1). Note that you do not need to support all combinations as
> you can choose the first hashtable appropriately in case of three given
> Columns.
> If memory consumption is more important than the scalability you could
> replace the many inner hashtables by sorted lists.
>
> Btw., how large can your numbers grow? If an unsigned char is
> sufficient, you can pack an entire line into a single 32 bit structure.
> Even with unsigned short data keeping redundant copies of your data in
> the data structures is better than references.
>
> > I do know how to do a binary search, but only from the first entry, so
> > if that is a wildcard I'm stumped.

>
> You need different entry points for different types of queries.
>
> Marcel


Thanks for the comprehensive reply. I've just googled as much as I
could to try and understand before replying.
I really need to learn about hash tables, before I get why I need six
of them!
(I did say I was new to search techniques )

The lookup table (library of sequences) never changes. The elements
will never be bigger than 171, so yes I could pack 4 of them into 32
bits.
Are you, then, saying to store each sequence-of-4 in a single
unsigned long, and do some kind of bit search?
Sorry, I don't know how that would work... how would I differentiate
between the ones the exist in the library, and those that don't?

Thanks

Steve


 
Reply With Quote
 
Gert-Jan de Vos
Guest
Posts: n/a
 
      02-22-2010
On Feb 22, 2:32*pm, Steve555 <(E-Mail Removed)> wrote:
> On 22 Feb, 10:06, Marcel Müller <(E-Mail Removed)> wrote:
>
>
>
> > Hi,

>
> > Steve555 wrote:
> > > I have a set of 171 entities (in this case they're integers
> > > representing musical phrases)
> > > Theses exist in sequences of length 4, and I have a library (lookup-
> > > table) of such sequences.
> > > e.g.
> > > 2 *58 *93 *170
> > > 38 *39 *100 *150
> > > 38 *40 *41 *87
> > > etc.

>
> > > The table is sorted and is quite sparse; there are 333,932 out of a
> > > possible 855,036,081 (171^4) .

>
> > > I'm looking for fast ways to find matches for a query containing 1 or
> > > 2 wildcards.
> > > e.g. find matches for:
> > > 2 *58 *? *170
> > > ? *39 *? *150

>
> > > At the moment I'm using brute force to just iterated through the
> > > lookup table, but this is proving very slow as the algorithm needs to
> > > do hundreds of searches in close to real time.

>
> > what about the change rate of your data set?

>
> > If the data changes rarely create six hashtables for each pair of
> > columns. If you want to speed up requests with only one wildcard even
> > more, place a nested hashtable for a third column in the value type of
> > the above hash tables. This will make all lookups with one and two
> > wildcards O(1). Note that you do not need to support all combinations as
> > you can choose the first hashtable appropriately in case of three given
> > Columns.
> > If memory consumption is more important than the scalability you could
> > replace the many inner hashtables by sorted lists.

>
> > Btw., how large can your numbers grow? If an unsigned char is
> > sufficient, you can pack an entire line into a single 32 bit structure.
> > Even with unsigned short data keeping redundant copies of your data in
> > the data structures is better than references.

>
> > > I do know how to do a binary search, but only from the first entry, so
> > > if that is a wildcard I'm stumped.

>
> > You need different entry points for different types of queries.

>
> > Marcel

>
> Thanks for the comprehensive reply. I've just googled as much as I
> could to try and understand before replying.
> I really need to learn about hash tables, before I get why I need six
> of them!
> (I did say I was new to search techniques *)
>
> The lookup table (library of sequences) never changes. The elements
> will never be bigger than 171, so yes I could pack 4 of them into 32
> bits.


In that case you can generate a perfect hash function for the full
table, four for the single wild card and six for the double wild
card cases. This should give you the fastest possible searches.
 
Reply With Quote
 
Öö Tiib
Guest
Posts: n/a
 
      02-22-2010
On Feb 22, 3:24*pm, Steve555 <(E-Mail Removed)> wrote:
> On 22 Feb, 09:53, Öö Tiib <(E-Mail Removed)> wrote:
>
>
>
>
>
> > On Feb 22, 10:02*am, Steve555 <(E-Mail Removed)> wrote:

>
> > > Hi

>
> > > I have a set of 171 entities (in this case they're integers
> > > representing musical phrases)
> > > Theses exist in sequences of length 4, and I have a library (lookup-
> > > table) of such sequences.
> > > e.g.
> > > 2 *58 *93 *170
> > > 38 *39 *100 *150
> > > 38 *40 *41 *87
> > > etc.

>
> > > The table is sorted and is quite sparse; there are 333,932 out of a
> > > possible 855,036,081 (171^4) .

>
> > > I'm looking for fast ways to find matches for a query containing 1 or
> > > 2 wildcards.
> > > e.g. find matches for:
> > > 2 *58 *? *170
> > > ? *39 *? *150

>
> > > At the moment I'm using brute force to just iterated through the
> > > lookup table, but this is proving very slow as the algorithm needs to
> > > do hundreds of searches in close to real time.

>
> > > I know very little about searching, so any ideas for me to study would
> > > be welcomed.
> > > I do know how to do a binary search, but only from the first entry, so
> > > if that is a wildcard I'm stumped.

>
> > You may create index vectors to your table (sorted by other phrases).
> > For search by first phrase use std::lower_bound() on your table. For
> > search by other phrase use std::lower_bound() on such sorted index.
> > If you have static array (no new phrases are added into your tables)
> > you may want to make pre-searched indexes to quicken it up even more.

>
> > > I happy to re-format/re-package the data if it could improve the
> > > search speed. I'm not concerned about size or memory usage.

>
> > Possibly you still want to make some alternative implementations and
> > profile them integrated into your application. Size of memory usage is
> > a factor that often drags performance down. It is common to
> > underestimate how devastating are frequent cache misses to
> > performance.

>
> > > Thanks for any suggestions

>
> > > Steve

>
> Did you mean make 3 other index vectors for the elements in the 2nd,
> 3rd, 4th positions?
> I can see how that would allow me to use binary search on any
> 'column'.


Yes.

> I'm not sure what 'pre-searched indexes' means.


I meant do not stop at making indexes.

Your array will have 334K records (Lets call it 'array'.) It is sorted
by 0-th phrase.
There are also 3 334K vectors containing array indexes sorted by 1-st,
2-nd, 3-rd phrases ( lets call them 'index1' etc.)

Now ... occurs that you usually first search is like where sequence of
phrases 38 start in 'index2'. You can search that before (since your
array is static) and put results into vector of 171 indexes of indexes
(lets call it 'firstIndex2'). As result first array element with
phrase 38 by index2 is:

array[ index2[ firstIndex2[38] ] ];

Similarily you make lastIndex2. It results with no searching at all
from 334K of space. You search only between these indexes, knowing
that all the records are between them.
 
Reply With Quote
 
Marcel Müller
Guest
Posts: n/a
 
      02-22-2010
Steve555 wrote:
> Thanks for the comprehensive reply. I've just googled as much as I
> could to try and understand before replying.
> I really need to learn about hash tables, before I get why I need six
> of them!
> (I did say I was new to search techniques )


Feel free to use std::hash_map and forget about the internal structure
of hash tables.


> The lookup table (library of sequences) never changes. The elements
> will never be bigger than 171, so yes I could pack 4 of them into 32
> bits.
> Are you, then, saying to store each sequence-of-4 in a single
> unsigned long, and do some kind of bit search?


typedef unsigned char MyEntry[4];

will most likely fit your needs.

If you encounter problems because array types are a bit special in C/C++
wrap it in a struct:
struct MyEntry
{ unsigned char Column[4];
};

Objects of type MyEntry are now 32 bits.


struct MyHashKey
{ unsigned char Key[2];
};

MyHashKey is intended to be used as hash key for two arbitrary columns.

Because MyHashKey is no integral type you must define a hash function to
use it in a hashtable.

int MyHashFunc(MyHashKey key)
{ // portable version:
return key[0] | (key[1] << ;
// fast non-portable version:
// return *(short*)key;
}

As content I would recommend to use something like:

template <typename Compare>
typedef MyHashEntry std::set<MyEntry, Compare>;

It will contain the matching MyEntry lines of your data ordered by the
remaining columns, to speed up lookups with one or no wildcards. If you
have two wildcards enumerate over the entire sublist.
The Compare template parameter is needed because Content is sorted by
different columns depending on your first key columns. It is a functor
that provides the comperator for two MyEntries.

Now you need the comparer mentioned above.

template <int COL1, int COL2>
bool MyColumnComparer(MyEntry l, MyEntry r)
{ unsigned lkey = (l.Column[COL1] << | l.Column[COL2];
unsigned rkey = (r.Column[COL1] << | r.Column[COL2];
return lkey < rkey;
}

And now you can define your root data structures:

std::hash_set<MyHashKey,
MyHashEntry<MyColumnComparer<3,4> >,
MyHashFunc> EntriesBy1234;

The above hashtable instance is intended to store the Entries hashed by
the first two columns and for each distinct values of the first two
colums ordered by the third column. In the same way you could define
further entry points for your searches.

std::hash_set<MyHashKey,
MyHashEntry<MyColumnComparer<2,4> >,
MyHashFunc> EntriesBy1324;
std::hash_set<MyHashKey,
MyHashEntry<MyColumnComparer<2,3> >,
MyHashFunc> EntriesBy1423;
std::hash_set<MyHashKey,
MyHashEntry<MyColumnComparer<1,4> >,
MyHashFunc> EntriesBy2314;
std::hash_set<MyHashKey,
MyHashEntry<MyColumnComparer<1,3> >,
MyHashFunc> EntriesBy2413;
std::hash_set<MyHashKey,
MyHashEntry<MyColumnComparer<1,2> >,
MyHashFunc> EntriesBy3412;

For lookups with only one wildcard you can choose an arbitrary list with
the required key columns. There is some redundancy in this case.


Note, that all of that is only a rough idea. It may even contain
syntactical errors because I wrote down it quite quickly.
Also note that the data structure is only optimized for lookups not for
the population process. Changes are quite slow because they have to be
applied at different places simultaneously.


> Sorry, I don't know how that would work... how would I differentiate
> between the ones the exist in the library, and those that don't?


Sorry, I didn't understand the your last question.


Marcel
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Re: How to quickly search through arrays? Steve555 C++ 2 02-23-2010 04:40 PM
How do I quickly search the end of a huge text file? Brian Green Ruby 2 09-05-2008 11:32 AM
Trojan horse spreads quickly through Microsoft's IM Au79 Computer Support 14 11-28-2007 02:59 PM



Advertisments