Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > efficient way of looking up huge hashes

Reply
Thread Tools

efficient way of looking up huge hashes

 
 
Ram Prasad
Guest
Posts: n/a
 
      05-25-2007
Hi,
I am a quiet a newbie to C programming. ( not new to programming
itself though )
I am writing an email blacklist application that will lookup huge DB
files like

----------
somedomain.com=emailid1
someotherdomain.com=emailid2
somedomain.com=senderdomain1
....
....

---------------------

there would be anywhere between 10k-200k such records
What is the best way of looking up such a list. I want to have the
queries return very fast , the memory footprint would ideally be low
but that is not a limiting factor

Should I use something like a Berkeley DB and read from a DB file or
should I read the entire stuff into memory.


Thanks
Ram

PS:
Note to Spammers: Go ahead , send me spam
http://www.velocityreviews.com/forums/(E-Mail Removed)
http://ecm.netcore.co.in/spamtrap.html

 
Reply With Quote
 
 
 
 
Richard Tobin
Guest
Posts: n/a
 
      05-25-2007
In article <(E-Mail Removed). com>,
Ram Prasad <(E-Mail Removed)> wrote:

>I am writing an email blacklist application that will lookup huge DB
>files like
>
>----------
>somedomain.com=emailid1
>someotherdomain.com=emailid2
>somedomain.com=senderdomain1
>...
>...
>
>---------------------
>
>there would be anywhere between 10k-200k such records


Even with 200k records that's still only about 5MB, which is hardly
huge. Does a single run of the program just do one lookup? If so,
and it runs infrequently, you could just read the file in and compare
the strings as you go. On the other hand, if it sits there repeatedly
processing addresses, it would be reasonable to use an in-memory hash
table.

If you're doing single lookups but running the program very often (say
once a second), or if you think the list could get much bigger, then
an on-disk hash table such as Berkeley DB would be the way to go. It
has the advantage that it won't read the whole file.

-- Richarad
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.
 
Reply With Quote
 
 
 
 
Ram Prasad
Guest
Posts: n/a
 
      05-25-2007
On May 25, 5:46 pm, (E-Mail Removed) (Richard Tobin) wrote:
> In article <(E-Mail Removed). com>,
> Ram Prasad <(E-Mail Removed)> wrote:
>
> >I am writing an email blacklist application that will lookup huge DB
> >files like

>
> >----------
> >somedomain.com=emailid1
> >someotherdomain.com=emailid2
> >somedomain.com=senderdomain1
> >...
> >...

>
> >---------------------

>
> >there would be anywhere between 10k-200k such records

>
> Even with 200k records that's still only about 5MB, which is hardly
> huge. Does a single run of the program just do one lookup? If so,
> and it runs infrequently, you could just read the file in and compare
> the strings as you go. On the other hand, if it sits there repeatedly
> processing addresses, it would be reasonable to use an in-memory hash
> table.
>
> If you're doing single lookups but running the program very often (say
> once a second), or if you think the list could get much bigger, then
> an on-disk hash table such as Berkeley DB would be the way to go. It
> has the advantage that it won't read the whole file.
>
> -- Richarad
> --
> "Consideration shall be given to the need for as many as 32 characters
> in some alphabets" - X3.4, 1963.


This will be running in a mail-filter daemon. A single instance
would potentially do thousands of lookups. Since the proram would be a
daemon there I could read the entire DB into memory during startup and
use it from the memory
Which libraries should I use for such a such lookups. I dont
need a hash lookup , just an if_key_exists() lookup






 
Reply With Quote
 
Duncan Muirhead
Guest
Posts: n/a
 
      05-25-2007
On Fri, 25 May 2007 06:17:50 -0700, Ram Prasad wrote:

> On May 25, 5:46 pm, (E-Mail Removed) (Richard Tobin) wrote:
>> In article <(E-Mail Removed). com>,
>> Ram Prasad <(E-Mail Removed)> wrote:
>>
>> >I am writing an email blacklist application that will lookup huge DB
>> >files like

>>
>> >----------

><snip>
> This will be running in a mail-filter daemon. A single instance
> would potentially do thousands of lookups. Since the proram would be a
> daemon there I could read the entire DB into memory during startup and
> use it from the memory
> Which libraries should I use for such a such lookups. I dont
> need a hash lookup , just an if_key_exists() lookup


http://judy.sourceforge.net/


 
Reply With Quote
 
Tor Rustad
Guest
Posts: n/a
 
      05-26-2007
Ram Prasad wrote:

>
> Should I use something like a Berkeley DB and read from a DB file or
> should I read the entire stuff into memory.
>


I'm having a similar problem, and was thinking about using Berkeley DB too.

I wouldn't worry too much about the performance issue, if you have the table
in memory, the OS might swap your unused memory pages to disk anyway.
Likewise, if you access some disk location a lot, it will typically be in
cache.

So, using Berkeley DB is a good idea. Knuth once said:

"premature optimization is the root of all evil"

--
Tor <torust [at] online [dot] no>

 
Reply With Quote
 
Ram Prasad
Guest
Posts: n/a
 
      05-28-2007
On May 26, 5:58 am, Tor Rustad <(E-Mail Removed)> wrote:
> Ram Prasad wrote:
>
> > Should I use something like a Berkeley DB and read from a DB file or
> > should I read the entire stuff into memory.

>
> I'm having a similar problem, and was thinking about using Berkeley DB too.
>
> I wouldn't worry too much about the performance issue, if you have the table
> in memory, the OS might swap your unused memory pages to disk anyway.
> Likewise, if you access some disk location a lot, it will typically be in
> cache.
>
> So, using Berkeley DB is a good idea.


I am evaluating multiple options. I think tinycdb looks very
promising
http://www.corpit.ru/mjt/tinycdb.html

It compiles easily on my machine and the example scripts work without
much fuss.
Unfortunately with so many options available there is no single
standard method. How does one make his choice


 
Reply With Quote
 
Tor Rustad
Guest
Posts: n/a
 
      05-28-2007
Ram Prasad wrote:
> On May 26, 5:58 am, Tor Rustad <(E-Mail Removed)> wrote:
>> Ram Prasad wrote:
>>
>>> Should I use something like a Berkeley DB and read from a DB file or
>>> should I read the entire stuff into memory.

>>
>> I'm having a similar problem, and was thinking about using Berkeley DB too.
>>
>> I wouldn't worry too much about the performance issue, if you have the table
>> in memory, the OS might swap your unused memory pages to disk anyway.
>> Likewise, if you access some disk location a lot, it will typically be in
>> cache.
>>
>> So, using Berkeley DB is a good idea.

>
> I am evaluating multiple options. I think tinycdb looks very
> promising
> http://www.corpit.ru/mjt/tinycdb.html


Nice and simple C API, but rather limited platform support.

> Unfortunately with so many options available there is no single
> standard method. How does one make his choice


Yes, there are pro and cons, if I have a tuning problem, I would
prototype multiple solutions, and select the one which get the job done,
with minimal drawbacks. In practice,

* robustness / maturity
* maintainability
* support / documentation
* portability
* fingerprint
* security
* error handling
* simplicity

etc.

might be important design parameters too. I rarely select the fastest
solution.

The simplest solution to compare with, would perhaps be holding the
blacklist in memory, and to use qsort()/bsearch(). That prototype can be
made in no time.

However, for expert advice, you should instead consult one of the many
database news groups.

--
Tor <torust [at] online [dot] no>
 
Reply With Quote
 
Ram Prasad
Guest
Posts: n/a
 
      05-30-2007
On May 28, 7:46 pm, Tor Rustad <torust_at_online.no> wrote:
> Ram Prasad wrote:
> > On May 26, 5:58 am, Tor Rustad <(E-Mail Removed)> wrote:
> >> Ram Prasad wrote:

>
> >>> Should I use something like a Berkeley DB and read from a DB file or
> >>> should I read the entire stuff into memory.

>
> >> I'm having a similar problem, and was thinking about using Berkeley DB too.

>
> >> I wouldn't worry too much about the performance issue, if you have the table
> >> in memory, the OS might swap your unused memory pages to disk anyway.
> >> Likewise, if you access some disk location a lot, it will typically be in
> >> cache.

>
> >> So, using Berkeley DB is a good idea.

>
> > I am evaluating multiple options. I think tinycdb looks very
> > promising
> > http://www.corpit.ru/mjt/tinycdb.html

>
> Nice and simple C API, but rather limited platform support.
>
> > Unfortunately with so many options available there is no single
> > standard method. How does one make his choice

>
> Yes, there are pro and cons, if I have a tuning problem, I would
> prototype multiple solutions, and select the one which get the job done,
> with minimal drawbacks. In practice,
>
> * robustness / maturity
> * maintainability
> * support / documentation
> * portability
> * fingerprint
> * security
> * error handling
> * simplicity
>
> etc.
>
> might be important design parameters too. I rarely select the fastest
> solution.
>
> The simplest solution to compare with, would perhaps be holding the
> blacklist in memory, and to use qsort()/bsearch(). That prototype can be
> made in no time.


Can you pls explain figerprint ?


Would you suggest bsearch() on a 5MB data would be better than a hash-
db lookup.
What are the implications of multiple instances running can they share
the same data area


Thanks
Ram

PS:
Note to Spammers: Go ahead , send me spam
(E-Mail Removed)
http://ecm.netcore.co.in/spamtrap.html

 
Reply With Quote
 
Tor Rustad
Guest
Posts: n/a
 
      05-30-2007
Ram Prasad wrote:

> Can you pls explain figerprint ?


That was a typo, for "footprint", e.g. the memory usage.

> Would you suggest bsearch() on a 5MB data would be better than a hash-
> db lookup.


The relevant question for you: is a simple qsort/bsearch sufficient?


> What are the implications of multiple instances running can they share
> the same data area


Those implications are system specific, we don't discuss file locks,
mutex, thread programming etc. here.

--
Tor <torust [at] online [dot] no>
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
On Hashes - How the hashes printing works? Neela megha shyam Chivukula Ruby 4 05-28-2009 10:56 AM
How to make an array of hashes to a single array with all thevalues of these hashes ? kazaam Ruby 12 09-13-2007 01:30 PM
using hashes as keys in hashes Steven Arnold Ruby 3 11-23-2005 03:25 PM
Hash of hashes, of hashes, of arrays of hashes Tim O'Donovan Perl Misc 5 10-28-2005 05:59 AM
Hashes of Hashes via subs Ben Holness Perl 8 10-08-2003 06:57 AM



Advertisments