Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Writing huge Sets() to disk

Reply
Thread Tools

Writing huge Sets() to disk

 
 
=?ISO-8859-2?Q?Martin_MOKREJ=A9?=
Guest
Posts: n/a
 
      01-10-2005
Hi,
I have sets.Set() objects having up to 20E20 items,
each is composed of up to 20 characters. Keeping
them in memory on !GB machine put's me quickly into swap.
I don't want to use dictionary approach, as I don't see a sense
to store None as a value. The items in a set are unique.

How can I write them efficiently to disk? To be more exact,
I have 20 sets. _set1 has 1E20 keys of size 1 character.

alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q', 'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
for aa1 in alphabet:
# l = [aa1]
#_set1.add(aa1)
for aa2 in alphabet:
# l.append(aa2)
#_set2.add(''.join(l))
[cut]

The reason I went for sets instead of lists is the speed,
availability of unique, common and other methods.
What would you propose as an elegant solution?
Actually, even those nested for loops take ages.
M.
 
Reply With Quote
 
 
 
 
Paul McGuire
Guest
Posts: n/a
 
      01-10-2005
"Martin MOKREJ©" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> Hi,
> I have sets.Set() objects having up to 20E20 items,
> each is composed of up to 20 characters. Keeping
> them in memory on !GB machine put's me quickly into swap.
> I don't want to use dictionary approach, as I don't see a sense
> to store None as a value. The items in a set are unique.
>
> How can I write them efficiently to disk? To be more exact,
> I have 20 sets. _set1 has 1E20 keys of size 1 character.
>
> alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q',

'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
> for aa1 in alphabet:
> # l = [aa1]
> #_set1.add(aa1)
> for aa2 in alphabet:
> # l.append(aa2)
> #_set2.add(''.join(l))
> [cut]
>
> The reason I went for sets instead of lists is the speed,
> availability of unique, common and other methods.
> What would you propose as an elegant solution?
> Actually, even those nested for loops take ages.
> M.


_set1 only has 19 keys of size 1 character - 'A' is duplicated.

Assuming you replace 'A' with another character (say 'Z'), then here is what
you get:
_set1 - 20 elements (20**1)
_set2 - 400 elements (20**2)
_set3 - 8000 elements (20**3)
....
_set20 - 20**20 ~ 10 ^ (1.301*20) or 1E26

If you have no duplicates in your alphabet, then you wont need to use sets,
every combination will be unique. In this case, just use a generator.

Here's a recursive generator approach that may save you a bunch of nested
editing (set maxDepth as high as you dare):

alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'Z', 'Q', 'F',
'Y', 'W', 'K', 'R', 'H', 'D', 'E')

maxDepth = 3

def genNextCombo(root=list(),depth=1):
for a in alphabet:
thisRoot = root + [a]
yield "".join( thisRoot )
if depth < maxDepth:
for c in genNextCombo(thisRoot, depth+1):
yield c

for c in genNextCombo():
print c # or write to file, or whatever



-- Paul


 
Reply With Quote
 
 
 
 
=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=
Guest
Posts: n/a
 
      01-10-2005
Paul McGuire wrote:
> "Martin MOKREJ©" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed)...
>
>>Hi,
>> I have sets.Set() objects having up to 20E20 items,
>>each is composed of up to 20 characters. Keeping
>>them in memory on !GB machine put's me quickly into swap.
>>I don't want to use dictionary approach, as I don't see a sense
>>to store None as a value. The items in a set are unique.
>>
>> How can I write them efficiently to disk? To be more exact,
>>I have 20 sets. _set1 has 1E20 keys of size 1 character.
>>
>>alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q',

>
> 'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
>
>>for aa1 in alphabet:
>> # l = [aa1]
>> #_set1.add(aa1)
>> for aa2 in alphabet:
>> # l.append(aa2)
>> #_set2.add(''.join(l))
>>[cut]
>>
>> The reason I went for sets instead of lists is the speed,
>>availability of unique, common and other methods.
>>What would you propose as an elegant solution?
>>Actually, even those nested for loops take ages.
>>M.

>
>
> _set1 only has 19 keys of size 1 character - 'A' is duplicated.


Ahh, you are right. That's a typo, yet I have to figure out which char
I have to use. But 'Z' will do for the tests anyway.

>
> Assuming you replace 'A' with another character (say 'Z'), then here is what
> you get:
> _set1 - 20 elements (20**1)
> _set2 - 400 elements (20**2)
> _set3 - 8000 elements (20**3)
> ...
> _set20 - 20**20 ~ 10 ^ (1.301*20) or 1E26
>
> If you have no duplicates in your alphabet, then you wont need to use sets,
> every combination will be unique. In this case, just use a generator.


As I noted in later response, actually I will compare these ideal sets to some
real world examples. I have no expectation how many entries will be common
and how many unique. The only thing I know even in real world, I might be in
size ranges of 2E6 or perhaps 2E8?

> Here's a recursive generator approach that may save you a bunch of nested
> editing (set maxDepth as high as you dare):
>
> alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'Z', 'Q', 'F',
> 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
>
> maxDepth = 3
>
> def genNextCombo(root=list(),depth=1):
> for a in alphabet:
> thisRoot = root + [a]
> yield "".join( thisRoot )
> if depth < maxDepth:
> for c in genNextCombo(thisRoot, depth+1):
> yield c
>
> for c in genNextCombo():
> print c # or write to file, or whatever


Works nice, but http://www.python.org/doc/2.3.4/ref/yield.html
doesn't really tell what happens here. But is quite fast and
definitely saves those the stupid recursively hand-written for
loops.

Thanks Paul!
M.
 
Reply With Quote
 
John Lenton
Guest
Posts: n/a
 
      01-10-2005
you probably want to look into building set-like objects ontop of
tries, given the homogeneity of your language. You should see
imrpovements both in size and speed.

 
Reply With Quote
 
Simo Melenius
Guest
Posts: n/a
 
      01-10-2005
"John Lenton" <(E-Mail Removed)> writes:

> you probably want to look into building set-like objects ontop of
> tries, given the homogeneity of your language. You should see
> imrpovements both in size and speed.


Ternary search trees give _much_ better space-efficiency compared to
tries, at the expense of only slightly slower build and search time.
This is especially essential as the OP mentioned he could have huge
sets of data.


br,
S
 
Reply With Quote
 
=?ISO-8859-15?Q?Martin_MOKREJ=A6?=
Guest
Posts: n/a
 
      01-10-2005
Simo Melenius wrote:
> "John Lenton" <(E-Mail Removed)> writes:
>
>
>>you probably want to look into building set-like objects ontop of
>>tries, given the homogeneity of your language. You should see
>>imrpovements both in size and speed.

>
>
> Ternary search trees give _much_ better space-efficiency compared to
> tries, at the expense of only slightly slower build and search time.
> This is especially essential as the OP mentioned he could have huge
> sets of data.


Hi Simo and John,
would you please point me to some docs so I learn what are you talking about?
Many thanks!
Martin
 
Reply With Quote
 
John Lenton
Guest
Posts: n/a
 
      01-11-2005
On Tue, Jan 11, 2005 at 12:33:42AM +0200, Simo Melenius wrote:
> "John Lenton" <(E-Mail Removed)> writes:
>
> > you probably want to look into building set-like objects ontop of
> > tries, given the homogeneity of your language. You should see
> > imrpovements both in size and speed.

>
> Ternary search trees give _much_ better space-efficiency compared to
> tries, at the expense of only slightly slower build and search time.
> This is especially essential as the OP mentioned he could have huge
> sets of data.


hmm! sounds like *some*body needs to go read up on ternary trees,
then.

Ok, ok, I'm going...

--
John Lenton ((E-Mail Removed)) -- Random fortune:
Fortune finishes the great quotations, #6

"But, soft! What light through yonder window breaks?"
It's nothing, honey. Go back to sleep.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFB4yB+gPqu395ykGsRAqTqAKCuCPwmL/P0fy3zYPeIM0klDWji0gCgkAk2
jGE2qYQJst+7ZL02Ucz+lL0=
=VFph
-----END PGP SIGNATURE-----

 
Reply With Quote
 
Bengt Richter
Guest
Posts: n/a
 
      01-11-2005
On Mon, 10 Jan 2005 17:11:09 +0100, =?ISO-8859-2?Q?Martin_MOKREJ=A9?= <(E-Mail Removed)> wrote:

>Hi,
> I have sets.Set() objects having up to 20E20 items,

What notation are you using when you write 20E20?
IOW, ISTM 1E9 is a billion. So 20E20 would be 2000 billion billion.
Please clarify

>each is composed of up to 20 characters. Keeping
>them in memory on !GB machine put's me quickly into swap.
>I don't want to use dictionary approach, as I don't see a sense
>to store None as a value. The items in a set are unique.
>
> How can I write them efficiently to disk? To be more exact,
>I have 20 sets. _set1 has 1E20 keys of size 1 character.
>
>alphabet = ('G', 'A', 'V', 'L', 'I', 'P', 'S', 'T', 'C', 'M', 'A', 'Q', 'F', 'Y', 'W', 'K', 'R', 'H', 'D', 'E')
>for aa1 in alphabet:
> # l = [aa1]
> #_set1.add(aa1)
> for aa2 in alphabet:
> # l.append(aa2)
> #_set2.add(''.join(l))
>[cut]
>
> The reason I went for sets instead of lists is the speed,
>availability of unique, common and other methods.
>What would you propose as an elegant solution?
>Actually, even those nested for loops take ages.


If you will explain a little what you are doing with these set "items"
perhaps someone will think of another way to represent and use your data.

Regards,
Bengt Richter
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Memory error due to the huge/huge input file size tejsupra@gmail.com Python 3 11-20-2008 07:21 PM
Re: Writing huge Sets() to disk =?ISO-8859-2?Q?Martin_MOKREJ=A9?= Python 5 01-17-2005 01:25 PM
Re: Writing huge Sets() to disk =?UTF-8?B?TWFydGluIE1PS1JFSsWg?= Python 11 01-14-2005 04:29 PM
Re: Writing huge Sets() to disk =?ISO-8859-15?Q?Martin_MOKREJ=A6?= Python 4 01-11-2005 01:13 AM
Re: Writing huge Sets() to disk Tim Peters Python 3 01-11-2005 12:11 AM



Advertisments