Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Algorithm on search

Reply
Thread Tools

Algorithm on search

 
 
Vincent SHAO
Guest
Posts: n/a
 
      05-06-2008
Search engine have to record all of the query string. Now i have a
search engine log which contains 10 milllion query strings, but almost
of them are repeated, not more than 3 million of them are non-
repeated.
My task is to pick the top 10 most popular query string, memory < 1G,
the length of the query string is no more than 255.

The faster, the better.
the principal solutions, algorithm and data structure.

Thank you.
 
Reply With Quote
 
 
 
 
Pascal J. Bourguignon
Guest
Posts: n/a
 
      05-06-2008
Vincent SHAO <(E-Mail Removed)> writes:

> Search engine have to record all of the query string. Now i have a
> search engine log which contains 10 milllion query strings, but almost
> of them are repeated, not more than 3 million of them are non-
> repeated.
> My task is to pick the top 10 most popular query string, memory < 1G,
> the length of the query string is no more than 255.
>
> The faster, the better.
> the principal solutions, algorithm and data structure.


Sounds like homework... Is it homework?


In real "search engine" queries, it's not strings, but _sets_ of
keywords that are searched, so already your data structures will be
suboptimal... That is, unless it's homework.

In anycase, have a look at http://en.wikipedia.org/wiki/Bloom_filter

--
__Pascal Bourguignon__
 
Reply With Quote
 
 
 
 
Jim Langston
Guest
Posts: n/a
 
      05-06-2008
Vincent SHAO wrote:
> Search engine have to record all of the query string. Now i have a
> search engine log which contains 10 milllion query strings, but almost
> of them are repeated, not more than 3 million of them are non-
> repeated.
> My task is to pick the top 10 most popular query string, memory < 1G,
> the length of the query string is no more than 255.
>
> The faster, the better.
> the principal solutions, algorithm and data structure.
>
> Thank you.


My first attempt would be to stuff the query strings into a map with the
query string (or a hash of it) as the key, the number of times it occurs as
the data.

Then a loop to read the data and sort, or simply compare counts and store
the keys for the top 10.


--
Jim Langston
http://www.velocityreviews.com/forums/(E-Mail Removed)


 
Reply With Quote
 
James Kanze
Guest
Posts: n/a
 
      05-07-2008
On May 6, 2:15 pm, (E-Mail Removed) (Pascal J. Bourguignon)
wrote:
> Vincent SHAO <(E-Mail Removed)> writes:
> > Search engine have to record all of the query string. Now i
> > have a search engine log which contains 10 milllion query
> > strings, but almost of them are repeated, not more than 3
> > million of them are non-repeated. My task is to pick the
> > top 10 most popular query string, memory < 1G, the length of
> > the query string is no more than 255.


> > The faster, the better. the principal solutions, algorithm
> > and data structure.


> Sounds like homework... Is it homework?


It sounds like a problem that really could come up
professionally (which doesn't mean it isn't homework---a well
conceived course will use realistic examples).

> In real "search engine" queries, it's not strings, but _sets_ of
> keywords that are searched, so already your data structures will be
> suboptimal... That is, unless it's homework.


What about SQL or LDAP? The problem is that he probably should
consider requests that different only in white space to be the
same; for that matter (considering SQL), something like "SELECT
* FROM table1 WHERE a > 5 AND b < 10" is the same as "select *
from table1 where b < 10 and a > 5". He probably needs to
normalize the input somewhat, but how far depends on the project
requirements.

Anyway, the obvious solution is to use an std::map< Query, long >
for starters, then copy the tabluation into a vector (of
Map::value_type) and sort that on the criteria he's interested
in. If that's not fast enough, replace std::map with a not
yet standard hash_map, and only if that's not fast enough, to
start worrying about other solutions.

> In anycase, have a look athttp://en.wikipedia.org/wiki/Bloom_filter


That saves space, not time, and probably isn't appropriate for
such a small table. (Space savings become time savings when the
result in less disk accesses. In his case, however, the
complete table will almost certainly fit into real memory.)

--
James Kanze (GABI Software) email:(E-Mail Removed)
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Filtered Back Projection Algorithm (FBP Algorithm) Bapaiah Katepalli VHDL 1 06-23-2006 04:50 PM
search within a search within a search - looking for better way...my script times out Abby Lee ASP General 5 08-02-2004 04:01 PM
Article: An Analysis of the Google Search Engine Algorithm Johann Blake ASP .Net 0 01-21-2004 03:21 PM
Key generation algorithm and Cipher algorithm Ahmed Moustafa Java 0 11-15-2003 06:35 AM



Advertisments