Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > huge dictionary

Reply
Thread Tools

huge dictionary

 
 
barbaros
Guest
Posts: n/a
 
      10-10-2005
Some say I don't provide enough detail.
Let me see what can I add ...
I need to build a large dictionary, in the sense
of the "map" class in C++ STL. I need to build it
quickly (if not it may take years to get something
meaningfull), so I need relatively quick write access.
Because it's so large I need to save (parts of) it on
the hard disk, and in the end to save it all on disk.
As about budget restrictions, I am commited to use
only free software (open source, to be more precise).

mmap seems to be the soltution, since it seems to be
really fast (thank you, moi and bane). I confess I am
not familiar with mmap or address space. I understand
on a 64 bit machine there would be no problem.
How do I know whether linux on a recent intel box is
32 or 64 bits ?

Actually, now I can imagine a way to compress a little
bit my dictionary. Since the keys are all strings,
instead of keeping all strings in memory I can have
a map with keys 'a','b','c',...,'z'.
Then the values of this map would be other maps,
again having letters as keys, and so on. That is,
I could work with a map os maps of maps ...
I guess this is more or less what moi suggested.
(By the way, moi, I see no syntax error in my message.)

Thank you everybody for your answers.
Cristian Barbarosie http://cmaf.fc.ul.pt/~barbaros

 
Reply With Quote
 
 
 
 
Maxim Yegorushkin
Guest
Posts: n/a
 
      10-10-2005

barbaros wrote:
> Some say I don't provide enough detail.
> Let me see what can I add ...
> I need to build a large dictionary, in the sense
> of the "map" class in C++ STL. I need to build it
> quickly (if not it may take years to get something
> meaningfull), so I need relatively quick write access.
> Because it's so large I need to save (parts of) it on
> the hard disk, and in the end to save it all on disk.
> As about budget restrictions, I am commited to use
> only free software (open source, to be more precise).


Use boost::multi_index for that. hashed indexes might be what you need.

> mmap seems to be the soltution, since it seems to be
> really fast (thank you, moi and bane). I confess I am
> not familiar with mmap or address space. I understand
> on a 64 bit machine there would be no problem.
> How do I know whether linux on a recent intel box is
> 32 or 64 bits ?


Check uname command output.
In C/C++ code you can get sizeof(void*).


> Actually, now I can imagine a way to compress a little
> bit my dictionary. Since the keys are all strings,
> instead of keeping all strings in memory I can have
> a map with keys 'a','b','c',...,'z'.
> Then the values of this map would be other maps,
> again having letters as keys, and so on. That is,
> I could work with a map os maps of maps ...
> I guess this is more or less what moi suggested.
> (By the way, moi, I see no syntax error in my message.)


This is called digital search tree (DST). It consumes a lot of memory
for nodes. A ternary tree has similar performance charasteristics to
that of DST but consumes less memory.

 
Reply With Quote
 
 
 
 
barbaros
Guest
Posts: n/a
 
      10-10-2005
moi wrote:
> BTW for a similar problem:
> http://www-db.stanford.edu/~backrub/google.html
> They manage to keep the 'dictionary' in core.
> (but they probably can afford to forget/ignore some 'strings' )


Yes, you are perfectly right to point out this similar problem.
Thank you for the link.

 
Reply With Quote
 
Branimir Maksimovic
Guest
Posts: n/a
 
      10-10-2005

"barbaros" <> wrote in message
news: ups.com...
> Some say I don't provide enough detail.
> Let me see what can I add ...
>
> mmap seems to be the soltution, since it seems to be
> really fast (thank you, moi and bane). I confess I am
> not familiar with mmap or address space. I understand
> on a 64 bit machine there would be no problem.
> How do I know whether linux on a recent intel box is
> 32 or 64 bits ?


Well you don't need too. Just organize pages.
Let's say you have page table in memory with pointer
to mmaped block , and flag for in memory
/out of memory status. Index into table represents
page number. Then you only need to store page number
and offset into page in say two integers, which will represent
pointer in you map object. Let's say page = 0x1 offset = 0x4000
then table[1].mmaped_pointer + 0x4000 gives real address.
You have to check if table[1].in_memory = true if not
mmap that page and update
table[page_num].mmaped_pointer = mmap(...,page_num*page_size);
then table[1].in_memory = true. You can put also access counter
table[1].acc_count and unmap pages that are not used so you
keep some maximum number of in memory pages.
This is normaly done by os but if you don't have enough address
space you have to do that too
When retreiving data you just have to know where is first node of
your data structure. Let's say it's on page 0 offset 0x0.

>
> Actually, now I can imagine a way to compress a little
> bit my dictionary. Since the keys are all strings,
> instead of keeping all strings in memory I can have
> a map with keys 'a','b','c',...,'z'.


You can use some hashing algorithm, and just store
hash values instead of strings.

Greetings, Bane.




 
Reply With Quote
 
moi
Guest
Posts: n/a
 
      10-10-2005
barbaros wrote:
> Some say I don't provide enough detail.
> Let me see what can I add ...
> I need to build a large dictionary, in the sense
> of the "map" class in C++ STL. I need to build it
> quickly (if not it may take years to get something


>
> mmap seems to be the soltution, since it seems to be
> really fast (thank you, moi and bane). I confess I am
> not familiar with mmap or address space. I understand
> on a 64 bit machine there would be no problem.


Mmap() *can* be elegant, but gets ugly in the
presence of writes/appends to the same file (as in your
case).
Also, there is *no performance benefit* over write().
Mmap just maps a piece of your diskfile into memory,
but underneath just as much disk I/O is needed.
In the ultimate case (unclustered read access) you end up with one read
per record-fetch.
Writing/appending is always faster (since multiple records can fit into
one one disk block)
Using a DB library does not change this, the library still has to
do the reading/writing on your behalf, and you en up with 10ms readtime
per block. It *can* save you some development effort.


> How do I know whether linux on a recent intel box is
> 32 or 64 bits ?


What another replyer suggested :
printf("%u", (unsigned) sizeof (void*));

If the machine is 64 bits you can install/compile a 64 bits
linux-distribution onto it.
Linux has 64 bit support since about 1994 (alpha/PPC)


> Actually, now I can imagine a way to compress a little
> bit my dictionary. Since the keys are all strings,
> instead of keeping all strings in memory I can have
> a map with keys 'a','b','c',...,'z'.
> Then the values of this map would be other maps,
> again having letters as keys, and so on. That is,
> I could work with a map os maps of maps ...


This is basically a tree-like structure.
Google for tree, trie, patricia or skiplist for more.
As another replyer suggested: tree-like structures can take
a lot of (memory) space, when badly designed/applied (eg when most of
the pointers are NULL)

> I guess this is more or less what moi suggested.
> (By the way, moi, I see no syntax error in my message.)


It was more about semantics. The paragraph made no sense to me.

> Thank you everybody for your answers.
> Cristian Barbarosie http://cmaf.fc.ul.pt/~barbaros
>



Good luck,
AvK
 
Reply With Quote
 
Branimir Maksimovic
Guest
Posts: n/a
 
      10-10-2005

"moi" <avk@localhost> wrote in message
news:...
> barbaros wrote:
>> Some say I don't provide enough detail.
>> Let me see what can I add ...
>> I need to build a large dictionary, in the sense
>> of the "map" class in C++ STL. I need to build it
>> quickly (if not it may take years to get something

>
>>
>> mmap seems to be the soltution, since it seems to be
>> really fast (thank you, moi and bane). I confess I am
>> not familiar with mmap or address space. I understand
>> on a 64 bit machine there would be no problem.

>
> Mmap() *can* be elegant, but gets ugly in the
> presence of writes/appends to the same file (as in your
> case).


Not neccessarily: ftruncate then mmap additional page(s)
(or lseek then write zeros to avoid fragmentation of file).
But, better is to just preallocate large enough file,
(write zero bytes, not ftruncate because of fragmentation)
if not enough then expand.

> Also, there is *no performance benefit* over write().


Disk write is disk write, but with mmap you don't
have to manually read/write from file to app memory.

> Mmap just maps a piece of your diskfile into memory,
> but underneath just as much disk I/O is needed.


See the difference. No need for memory to file buffering
and data conversion in application code, therefore mmap
is most natural way to implement persistent data structure.

> In the ultimate case (unclustered read access) you end up with one read
> per record-fetch.


Same as with read. Of course caching strategies can be different ,
resulting that either mmap or read can be faster depending on situation.
In my experince mmap is better when dealing with large random access
files (large swap file).
For pure sequential reading, read() should be better.

> Writing/appending is always faster (since multiple records can fit into
> one one disk block)
> Using a DB library does not change this, the library still has to
> do the reading/writing on your behalf, and you en up with 10ms readtime
> per block. It *can* save you some development effort.


Of course, except that everything goes through db interface and application
itself is limited by it. In this case this is not problem since db
eliminates
the need for hash table implementation, one just uses db interface
instead.

Greetings, Bane.


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[jython] Problem with an huge dictionary keobox Python 3 05-22-2010 06:15 AM
Memory error due to the huge/huge input file size tejsupra@gmail.com Python 3 11-20-2008 07:21 PM
huge dictionary -> bsddb/pickle question lazy Python 2 06-15-2007 02:40 PM
[DICTIONARY] - Copy dictionary entries to attributes Ilias Lazaridis Python 6 02-21-2006 11:27 AM
huge dictionary barbaros C++ 15 10-10-2005 07:34 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57