Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: Writing huge Sets() to disk

Reply
Thread Tools

Re: Writing huge Sets() to disk

 
 
=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=
Guest
Posts: n/a
 
      01-10-2005
Tim Peters wrote:
> [Martin MOKREJÅ*]
>
>> just imagine, you want to compare how many words are in English, German,
>>Czech, Polish disctionary. You collect words from every language and record
>>them in dict or Set, as you wish.

>
>
> Call the set of all English words E; G, C, and P similarly.
>
>
>> Once you have those Set's or dict's for those 4 languages, you ask
>>for common words

>
>
> This Python expression then gives the set of words common to all 4:
>
> E & G & C & P
>
>
>>and for those unique to Polish.

>
>
> P - E - G - C
>
> is a reasonably efficient way to compute that.


Nice, is it equivalent to common / unique methods of Sets?

>
>
>>I have no estimates
>>of real-world numbers, but we might be in range of 1E6 or 1E8?
>>I believe in any case, huge.

>
>
> No matter how large, it's utterly tiny compared to the number of
> character strings that *aren't* words in any of these languages.
> English has a lot of words, but nobody estimates it at over 2 million
> (including scientific jargon, like names for chemical compounds):
>
> http://www.worldwidewords.org/articles/howmany.htm


As I've said, I analyze in real something else then languages.
However, it can be described with the example of words in different languages.

But nevertheless, imagine 1E6 words of size 15. That's maybe 1.5GB of raw
data. Will sets be appropriate you think?

>>My concern is actually purely scientific, not really related to analysis
>>of these 4 languages, but I believe it describes my intent quite well.
>>
>> I wanted to be able to get a list of words NOT found in say Polish,
>>and therefore wanted to have a list of all, theoretically existing words.
>>In principle, I can drop this idea of having ideal, theoretical lexicon.
>>But have to store those real-world dictionaries anyway to hard drive.

>
>
> Real-word dictionaries shouldn't be a problem. I recommend you store
> each as a plain text file, one word per line. Then, e.g., to convert
> that into a set of words, do
>
> f = open('EnglishWords.txt')
> set_of_English_words = set(f)


I'm aware I can't keep set_of_English_words in memory.

> f.close()


M.
 
Reply With Quote
 
 
 
 
Istvan Albert
Guest
Posts: n/a
 
      01-10-2005
Martin MOKREJÅ* wrote:


> But nevertheless, imagine 1E6 words of size 15. That's maybe 1.5GB of raw
> data. Will sets be appropriate you think?


You started out with 20E20 then cut back to 1E15 keys
now it is down to one million but you claim that these
will take 1.5 GB.

On my system storing 1 million words of length 15
as keys of a python dictionary is around 75MB.

I.
 
Reply With Quote
 
 
 
 
=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=
Guest
Posts: n/a
 
      01-10-2005
Istvan Albert wrote:
> Martin MOKREJÅ* wrote:
>
>
>> But nevertheless, imagine 1E6 words of size 15. That's maybe 1.5GB of raw
>> data. Will sets be appropriate you think?

>
>
> You started out with 20E20 then cut back to 1E15 keys
> now it is down to one million but you claim that these
> will take 1.5 GB.


I gave up the theoretical approach. Practically, I might need up
to store maybe those 1E15 keys.

So you say 1 million words is better to store in dictionary than
in a set and use your own function to get out those unique or common
words?

>
> On my system storing 1 million words of length 15
> as keys of a python dictionary is around 75MB.


Fine, that's what I wanted to hear. How do you improve the algorithm?
Do you delay indexing to the very latest moment or do you let your
computer index 999 999 times just for fun?

>
> I.


 
Reply With Quote
 
Istvan Albert
Guest
Posts: n/a
 
      01-10-2005
Martin MOKREJÅ* wrote:

> Istvan Albert wrote:


> So you say 1 million words is better to store in dictionary than
> in a set and use your own function to get out those unique or common
> words?


I have said nothing even remotely like that.

> Fine, that's what I wanted to hear. How do you improve the algorithm?
> Do you delay indexing to the very latest moment or do you let your
> computer index 999 999 times just for fun?


I think that you need to first understand how dictionaries work.
The time needed to insert a key is independent of
the number of values in the dictionary.

Istvan.
 
Reply With Quote
 
Tim Peters
Guest
Posts: n/a
 
      01-10-2005
[Martin MOKREJÅ*]
> ...
>
> I gave up the theoretical approach. Practically, I might need up
> to store maybe those 1E15 keys.


We should work on our multiplication skills here <wink>. You don't
have enough disk space to store 1E15 keys. If your keys were just one
byte each, you would need to have 4 thousand disks of 250GB each to
store 1E15 keys. How much disk space do you actually have? I'm
betting you have no more than one 250GB disk.

....

[Istvan Albert]
>> On my system storing 1 million words of length 15
>> as keys of a python dictionary is around 75MB.


> Fine, that's what I wanted to hear. How do you improve the algorithm?
> Do you delay indexing to the very latest moment or do you let your
> computer index 999 999 times just for fun?


It remains wholly unclear to me what "the algorithm" you want might
be. As I mentioned before, if you store keys in sorted text files,
you can do intersection and difference very efficiently just by using
the Unix `comm` utiltity.
 
Reply With Quote
 
=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=
Guest
Posts: n/a
 
      01-10-2005
Tim Peters wrote:
> [Martin MOKREJÅ*]
>
>>...
>>
>>I gave up the theoretical approach. Practically, I might need up
>>to store maybe those 1E15 keys.

>
>
> We should work on our multiplication skills here <wink>. You don't
> have enough disk space to store 1E15 keys. If your keys were just one
> byte each, you would need to have 4 thousand disks of 250GB each to
> store 1E15 keys. How much disk space do you actually have? I'm
> betting you have no more than one 250GB disk.


Can get about 700GB on raid5, but this doesn't help here of course.
I definitely appreciate your comments, I somehow forgot to make sure
I can store it. I was mainly distracted by the fact I might anyway
hit almost the same size, if there's too few words used in reality.
Therefore when looking for those words not found in such a dictionary,
I'd be close to the teoretical, maximal size of say in order of E15
or E14.


> ...
>
> [Istvan Albert]
>
>>>On my system storing 1 million words of length 15
>>>as keys of a python dictionary is around 75MB.

>
>
>>Fine, that's what I wanted to hear. How do you improve the algorithm?
>>Do you delay indexing to the very latest moment or do you let your
>>computer index 999 999 times just for fun?

>
>
> It remains wholly unclear to me what "the algorithm" you want might


My intent is do some analysis in protein science. But it can be really
thought of as analysing those 4 different languages.

> be. As I mentioned before, if you store keys in sorted text files,
> you can do intersection and difference very efficiently just by using
> the Unix `comm` utiltity.


Now I got your point. I understand the comm(1) is written in C, but it still
has to scan file1 once and file2 n-times, where n is a number of lines
in file1, right? Without any index ... I'll consider it, actually will test,
thanks!


I was really hoping I'll get an answer how to alter the indexes for dictionaries
in python.

You convinced me not to even try to construct to theoretical dictionary,
as it will take ages just to create. Even if I'd manage, I couldn't
save it (the theoretical and possibly not even the dict(theoret) - dict(real)
result).

Still, before I give the whole project, once more:

I'll parse some text files, isolates separate words and add them to
either Set(), list, dict, flatfile line by line.

Depending on the above, I'll compare them and look for those unique
to some "language". I need to keep track of frequencies used
in every such language, so the dict approach would be the best.
The number stored as a value vould be a float ^H^H^H^H^H^H Decimal()
type - very small number.

Once more, I expect to have between E4 or E5 to E8??? words
stored in 20 dictionaries (remember words of sizes in range 1-20?
Every of those 20 dictionaries should be, I believe, indexed just once.
The indexing method should know all entries in a given file are of same
size, i.e. 5 chars, 15 chars, 20 chars etc.

I already did implement the logic to walk through those 20 different
dictionaries from language a and from language b and find uout those
unique to a or common to both of them. Why I went to ask on this list
was to make sure I took right approach. Sets seemed to be better solution
for the simple comparison (common, unique). To keep track of those
very small frequencies, I anyway have to have dictionaries. I say
that again, how can I improve speed of dictionary indexing?
It doesn't matter here if I have 10E4 keys in dictionary or
10E8 keys in a dictionary.

Or I wanted to hear: go for Sets(), having set with 1E8 keys
might take 1/10 of the size you need for a dictionary ... but
you cannot dump them efficiently on a disk. Using shelve will
cost you maybe 2x the size of dictionary approach and will
be also slover that writing dictionary directly.

Something along these words. I really appreciate your input!
Martin




 
Reply With Quote
 
Tim Peters
Guest
Posts: n/a
 
      01-10-2005
[Tim Peters]
>> As I mentioned before, if you store keys in sorted text files,
>> you can do intersection and difference very efficiently just by using
>> the Unix `comm` utiltity.


[Martin MOKREJÅ*]
> Now I got your point. I understand the comm(1) is written in C, but it still
> has to scan file1 once and file2 n-times, where n is a number of lines
> in file1, right? Without any index ... I'll consider it, actually will test,
> thanks!


`comm` is much more efficient than that. Note that the input files
have to be sorted. Then, in a single pass over both files (not 2
passes, not 3, not n, just 1), it can compute any subset of these
three (do `man comm`):

1. The words common to both files.
2. The words unique to "the first" file.
3. The words unique to "the second" file.

It's essentially just doing a merge on two sorted lists, and how it
works should be obvious if you give it some thought. It takes time
proportional to the sum of the lengths of the input files, and nothing
*can* be faster than that.

> I was really hoping I'll get an answer how to alter the indexes for dictionaries
> in python.


Sorry, I don't have a guess for what that might mean.

> You convinced me not to even try to construct to theoretical dictionary,
> as it will take ages just to create. Even if I'd manage, I couldn't
> save it (the theoretical and possibly not even the dict(theoret) - dict(real)
> result).


Worse, it didn't sound like a good approach even if you could save it.

> Still, before I give the whole project, once more:
>
> I'll parse some text files, isolates separate words and add them to
> either Set(), list, dict, flatfile line by line.
>
> Depending on the above, I'll compare them and look for those unique
> to some "language". I need to keep track of frequencies used
> in every such language,


Frequencies of what? "A language" here can contain some words
multiple times each?

> so the dict approach would be the best. The number stored as a value vould
> be a float ^H^H^H^H^H^H Decimal() type - very small number.


Integers are the natural type for counting, and require less memory
than floats or decimals.

> Once more, I expect to have between E4 or E5 to E8??? words
> stored in 20 dictionaries (remember words of sizes in range 1-20?
> Every of those 20 dictionaries should be, I believe, indexed just once.
> The indexing method should know all entries in a given file are of same
> size, i.e. 5 chars, 15 chars, 20 chars etc.


I think you're making this more complicated than it needs to be.

> I already did implement the logic to walk through those 20 different
> dictionaries from language a and from language b and find uout those
> unique to a or common to both of them. Why I went to ask on this list
> was to make sure I took right approach. Sets seemed to be better solution
> for the simple comparison (common, unique). To keep track of those
> very small frequencies, I anyway have to have dictionaries. I say
> that again, how can I improve speed of dictionary indexing?
> It doesn't matter here if I have 10E4 keys in dictionary or
> 10E8 keys in a dictionary.


What reason do you have for believing that the speed of dictionary
indexing is worth any bother at all to speed up? Dict lookup is
generally considered to be extremely fast already. If you're just
*guessing* that indexing is the bottleneck, you're probably guessing
wrong -- run a profiler to find out where the real bottleneck is.
 
Reply With Quote
 
=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=
Guest
Posts: n/a
 
      01-11-2005
Tim Peters wrote:
> [Tim Peters]
>
>>>As I mentioned before, if you store keys in sorted text files,
>>>you can do intersection and difference very efficiently just by using
>>>the Unix `comm` utiltity.

>
>
> [Martin MOKREJÅ*]
>
>>Now I got your point. I understand the comm(1) is written in C, but it still
>>has to scan file1 once and file2 n-times, where n is a number of lines
>>in file1, right? Without any index ... I'll consider it, actually will test,
>>thanks!

>
>
> `comm` is much more efficient than that. Note that the input files
> have to be sorted. Then, in a single pass over both files (not 2
> passes, not 3, not n, just 1), it can compute any subset of these
> three (do `man comm`):
>
> 1. The words common to both files.
> 2. The words unique to "the first" file.
> 3. The words unique to "the second" file.


I read the manpage actually before answering to the list.

> It's essentially just doing a merge on two sorted lists, and how it
> works should be obvious if you give it some thought. It takes time
> proportional to the sum of the lengths of the input files, and nothing
> *can* be faster than that.


Well, I think I understand now because "Scott David Daniels" wrote
probably the very same logic in python code in this thread.

>>I was really hoping I'll get an answer how to alter the indexes for dictionaries
>>in python.

>
>
> Sorry, I don't have a guess for what that might mean.


I'm not an expert, mysql for example givs you ability to index say
first 10 characters of a text column, which are typically varchar.
Just for curiosity I'd like to know how do it in python.

When importing data from a flatfile into mysql table, there's an
option to delay indexing to the very last moment, when all keys are
loaded (it doesn't make sense to re-create index after each new
row into table is added). I believe it's exactly same waste of cpu/io
in this case - when I create a dictionary and fill it with data,
I want to create index afterward, not after every key/value pair
is recorded.

>>You convinced me not to even try to construct to theoretical dictionary,
>>as it will take ages just to create. Even if I'd manage, I couldn't
>>save it (the theoretical and possibly not even the dict(theoret) - dict(real)
>>result).

>
>
> Worse, it didn't sound like a good approach even if you could save it.
>
>
>>Still, before I give the whole project, once more:
>>
>>I'll parse some text files, isolates separate words and add them to
>>either Set(), list, dict, flatfile line by line.
>>
>>Depending on the above, I'll compare them and look for those unique
>>to some "language". I need to keep track of frequencies used
>>in every such language,

>
>
> Frequencies of what? "A language" here can contain some words
> multiple times each?


Yes, to compute the frequency of each word used in say "language A",
I'll count number of occurencies and them compute frequency of it.
Frequency number should be recorded as a value in the dictionary,
where the keys are unique and represent the word itself (or it's hash
as recommended already).

Once more, the dictionary will contain every word only once, it will be really
a unique key.

>>so the dict approach would be the best. The number stored as a value vould
>>be a float ^H^H^H^H^H^H Decimal() type - very small number.

>
>
> Integers are the natural type for counting, and require less memory
> than floats or decimals.


I hoped I was clear ... when I said I'll compute frequency of those words.
The algorithm to compute it will be subject to change during developemnt.


>>Once more, I expect to have between E4 or E5 to E8??? words
>>stored in 20 dictionaries (remember words of sizes in range 1-20?
>>Every of those 20 dictionaries should be, I believe, indexed just once.
>>The indexing method should know all entries in a given file are of same
>>size, i.e. 5 chars, 15 chars, 20 chars etc.

>
>
> I think you're making this more complicated than it needs to be.


I hope the algoritm can save some logic. For example, can turn off locking support,
index only part of the key etc.

>
>
>>I already did implement the logic to walk through those 20 different
>>dictionaries from language a and from language b and find uout those
>>unique to a or common to both of them. Why I went to ask on this list
>>was to make sure I took right approach. Sets seemed to be better solution
>>for the simple comparison (common, unique). To keep track of those
>>very small frequencies, I anyway have to have dictionaries. I say
>>that again, how can I improve speed of dictionary indexing?
>>It doesn't matter here if I have 10E4 keys in dictionary or
>>10E8 keys in a dictionary.

>
>
> What reason do you have for believing that the speed of dictionary
> indexing is worth any bother at all to speed up? Dict lookup is
> generally considered to be extremely fast already. If you're just
> *guessing* that indexing is the bottleneck, you're probably guessing
> wrong -- run a profiler to find out where the real bottleneck is.


For example, looking up a key with it's value only once in the whole for loop tells me
I don't need an index. Yes, I'll do this 4 times for those 4 languages, but
still I think it's faster to live without an index, when I can sort
records. I think I like the sorted text file approach, and the dictionary
approach without an index would be almost the same, especially if I manage
to tell the db layout not to move the cursor randomly but just to walk down the
pre-sorted data.

As an extra, I could record those frequences within a dictionary.
Actually, I need them, that's why I do all this work.

The idea of using Sets() I had just to find more quickly unique/common
words because of the possible speed gain. But I don't see a way to solve
this without dictionaries.



 
Reply With Quote
 
Mike C. Fletcher
Guest
Posts: n/a
 
      01-11-2005
Martin MOKREJÅ* wrote:

> Tim Peters wrote:
>

....

>>> I was really hoping I'll get an answer how to alter the indexes for
>>> dictionaries
>>> in python.

>>
>>
>>
>> Sorry, I don't have a guess for what that might mean.

>
>
> I'm not an expert, mysql for example givs you ability to index say
> first 10 characters of a text column, which are typically varchar.
> Just for curiosity I'd like to know how do it in python.
>
> When importing data from a flatfile into mysql table, there's an
> option to delay indexing to the very last moment, when all keys are
> loaded (it doesn't make sense to re-create index after each new
> row into table is added). I believe it's exactly same waste of cpu/io
> in this case - when I create a dictionary and fill it with data,
> I want to create index afterward, not after every key/value pair
> is recorded.


Okay, you seem to be missing this key idea:

A hash-table (dict) is approximately the same level of abstraction
as a btree index.

Your MySQL "index" is likely implemented as a btree. A hash-table could
just as readily be used to implement the index. When you insert into
either of these structures (btree or hash), you are not creating an
"index" separate from the structure, the structure *is* an "index" of
the type you are thinking about. These structures are both efficient
representations that map from a data-value to some other data-value.
Hashes with a good hash function (such as Python's dicts) are basically
O(1) (worst case O(n), as Tim notes), while Btrees (such as common
database indices) are O(log(n)) (or something of that type, basically it
grows much more slowly than n).

>>> Once more, I expect to have between E4 or E5 to E8??? words
>>> stored in 20 dictionaries (remember words of sizes in range 1-20?
>>> Every of those 20 dictionaries should be, I believe, indexed just once.
>>> The indexing method should know all entries in a given file are of same
>>> size, i.e. 5 chars, 15 chars, 20 chars etc.

>>
>>
>>
>> I think you're making this more complicated than it needs to be.

>
>
> I hope the algoritm can save some logic. For example, can turn off
> locking support,
> index only part of the key etc.


I'd tend to agree with Tim. You're making this all far too complex in
what appears to be the absence of any real need. There's a maxim in
computer programming that you avoid, wherever possible, what is called
"premature optimisation". You are here trying to optimise away a
bottleneck that doesn't exist (indexing overhead, and locking support
are basically nil for a dictionary). It is almost a certainty that
these are not *real* bottlenecks in your code (what with not existing),
so your time would be better spent writing a "dumb" working version of
the code and profiling it to see where the *real* bottlenecks are.

> For example, looking up a key with it's value only once in the whole
> for loop tells me
> I don't need an index. Yes, I'll do this 4 times for those 4
> languages, but
> still I think it's faster to live without an index, when I can sort
> records. I think I like the sorted text file approach, and the dictionary
> approach without an index would be almost the same, especially if I
> manage
> to tell the db layout not to move the cursor randomly but just to walk
> down the
> pre-sorted data.


Again, you don't really have a cursor with a dictionary (and since it's
randomly ordered, a cursor wouldn't mean anything). A *btree* has an
order, but not a dictionary. You could construct a sorted list in
memory that would approximate what I *think* you're thinking of as a
dictionary-without-an-index.

Good luck,
Mike

________________________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://www.vrplumber.com
http://blog.vrplumber.com

 
Reply With Quote
 
=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=
Guest
Posts: n/a
 
      01-14-2005
Tim Peters wrote:
> [Martin MOKREJÅ*]
>
>>...
>>
>>I gave up the theoretical approach. Practically, I might need up
>>to store maybe those 1E15 keys.

>
>
> We should work on our multiplication skills here <wink>. You don't
> have enough disk space to store 1E15 keys. If your keys were just one
> byte each, you would need to have 4 thousand disks of 250GB each to
> store 1E15 keys. How much disk space do you actually have? I'm
> betting you have no more than one 250GB disk.
>
> ...
>
> [Istvan Albert]
>
>>>On my system storing 1 million words of length 15
>>>as keys of a python dictionary is around 75MB.

>
>
>>Fine, that's what I wanted to hear. How do you improve the algorithm?
>>Do you delay indexing to the very latest moment or do you let your
>>computer index 999 999 times just for fun?

>
>
> It remains wholly unclear to me what "the algorithm" you want might
> be. As I mentioned before, if you store keys in sorted text files,
> you can do intersection and difference very efficiently just by using
> the Unix `comm` utiltity.


This comm(1) approach doesn't work for me. It somehow fails to detect
common entries when the offset is too big.

file 1:

A
F
G
I
K
M
N
R
V
AA
AI
FG
FR
GF
GI
GR
IG
IK
IN
IV
KI
MA
NG
RA
RI
VF
AIK
FGR
FRA
GFG
GIN
GRI
IGI
IGR
IKI
ING
IVF
KIG
MAI
NGF
RAA
RIG


file 2:

W
W
W
W
W
W
W
W
W
W
AA
AI
FG
FR
GF
GI
GR
IG
IK
IN
IV
KI
MA
NG
RA
RI
VF
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Memory error due to the huge/huge input file size tejsupra@gmail.com Python 3 11-20-2008 07:21 PM
Re: Writing huge Sets() to disk =?ISO-8859-2?Q?Martin_MOKREJ=A9?= Python 5 01-17-2005 01:25 PM
Writing huge Sets() to disk =?ISO-8859-2?Q?Martin_MOKREJ=A9?= Python 7 01-11-2005 08:42 AM
Re: Writing huge Sets() to disk =?ISO-8859-15?Q?Martin_MOKREJ=A6?= Python 4 01-11-2005 01:13 AM
Re: Writing huge Sets() to disk Tim Peters Python 3 01-11-2005 12:11 AM



Advertisments