Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > matching strings in a large set of strings

Reply
Thread Tools

matching strings in a large set of strings

 
 
Karin Lagesen
Guest
Posts: n/a
 
      04-29-2010
Hello.

I have approx 83 million strings, all 14 characters long. I need to be
able to take another string and find out whether this one is present
within the 83 million strings.

Now, I have tried storing these strings as a list, a set and a dictionary.
I know that finding things in a set and a dictionary is very much faster
than working with a list, so I tried those first. However, I run out of
memory building both the set and the dictionary, so what I seem to be left
with is the list, and using the in method.

I imagine that there should be a faster/better way than this?

TIA,

Karin

 
Reply With Quote
 
 
 
 
Peter Otten
Guest
Posts: n/a
 
      04-29-2010
Karin Lagesen wrote:

> I have approx 83 million strings, all 14 characters long. I need to be
> able to take another string and find out whether this one is present
> within the 83 million strings.
>
> Now, I have tried storing these strings as a list, a set and a dictionary.
> I know that finding things in a set and a dictionary is very much faster
> than working with a list, so I tried those first. However, I run out of
> memory building both the set and the dictionary, so what I seem to be left
> with is the list, and using the in method.
>
> I imagine that there should be a faster/better way than this?


Do you need all matches or do you just have to know whether there are any?
Can the search string be shorter than 14 characters?

One simple improvement over the list may be using one big string instead of
the 83 million short ones and then search it using string methods.

Peter
 
Reply With Quote
 
 
 
 
Miki
Guest
Posts: n/a
 
      04-29-2010

> I have approx 83 million strings, all 14 characters long. I need to be
> able to take another string and find out whether this one is present
> within the 83 million strings.

Have a look at the shelve module.

If you want to write the algorithm yourself, I suggest
http://en.wikipedia.org/wiki/Trie

HTH,
--
Miki
http://pythonwise.blogspot.com
 
Reply With Quote
 
Paul Rudin
Guest
Posts: n/a
 
      04-30-2010
"Karin Lagesen" <(E-Mail Removed)> writes:

> Hello.
>
> I have approx 83 million strings, all 14 characters long. I need to be
> able to take another string and find out whether this one is present
> within the 83 million strings.
>
> Now, I have tried storing these strings as a list, a set and a dictionary.
> I know that finding things in a set and a dictionary is very much faster
> than working with a list, so I tried those first. However, I run out of
> memory building both the set and the dictionary, so what I seem to be left
> with is the list, and using the in method.
>
> I imagine that there should be a faster/better way than this?
>


Shouldn't a set with 83 million 14 character strings be fine in memory
on a stock PC these days? I suppose if it's low on ram you might start
swapping which will kill performance. Perhaps the method you're using to
build the data structures creates lots of garbage? How much ram do you
have and how much memory does the python process use as it builds your
data structures?

A set should give good performance if the target string is also 14
characters.

If you do end up with the list then try using bisect
<http://docs.python.org/library/bisect.html> which should be quicker
than just using "in" (which I think just scans linearly through the list
testing for equality).

There are other algorithms you can use that have better theoretical
performance than using bisect for this particular problem, but you get
into trade offs between executing things in python as opposed to C if
you start to hand code things in python.
 
Reply With Quote
 
Steven D'Aprano
Guest
Posts: n/a
 
      04-30-2010
On Fri, 30 Apr 2010 08:23:39 +0100, Paul Rudin wrote:

> "Karin Lagesen" <(E-Mail Removed)> writes:
>
>> Hello.
>>
>> I have approx 83 million strings, all 14 characters long. I need to be
>> able to take another string and find out whether this one is present
>> within the 83 million strings.
>>
>> Now, I have tried storing these strings as a list, a set and a
>> dictionary. I know that finding things in a set and a dictionary is
>> very much faster than working with a list, so I tried those first.
>> However, I run out of memory building both the set and the dictionary,
>> so what I seem to be left with is the list, and using the in method.
>>
>> I imagine that there should be a faster/better way than this?
>>
>>

> Shouldn't a set with 83 million 14 character strings be fine in memory
> on a stock PC these days?


Not even close. Using Python 2.6:

>>> s = "12345678901234"
>>> assert len(s) == 14
>>> import sys
>>> sys.getsizeof(s)

38

So a single 14 char string takes 38 bytes.


>>> import random, string
>>> chars = list(string.letters + string.digits)*4
>>> def rnd_str():

.... random.shuffle(chars)
.... return ''.join(chars[:14])
....
>>> s = set()
>>> while len(s) < 83000:

.... s.add(rnd_str())
....
>>> sys.getsizeof(s)

1048688


So a set with 83000 such strings takes approximately 1 MB. So far fairly
trivial. But that's just the memory used by the container (the set), not
the contents. 38 bytes * 83,000 strings = another 3 MB. Which of course
is trivial for a modern PC, but the OP is using 83 million such strings,
not 83 thousand, which gives us a grand total of at least 3 gigabytes. An
entry level desktop PC these days is generally 2GB, and entry level
notebooks might have half a gig.

If the OP is on a 64 bit system, every pointer will be twice the size,
leading to even larger memory requirements. And if the strings are
unicode, then potentially they could be anything up to four times larger
each. Worst case, the OP could need something of the order of 24 GB to
store the strings all in memory.


--
Steven
 
Reply With Quote
 
Paul Rudin
Guest
Posts: n/a
 
      04-30-2010
Duncan Booth <(E-Mail Removed)> writes:

> Paul Rudin <(E-Mail Removed)> wrote:
>
>> Shouldn't a set with 83 million 14 character strings be fine in memory
>> on a stock PC these days? I suppose if it's low on ram you might start
>> swapping which will kill performance. Perhaps the method you're using
>> to build the data structures creates lots of garbage? How much ram do
>> you have and how much memory does the python process use as it builds
>> your data structures?

>
> Some simple experiments should show you that a stock PC running a 32 bit
> Python will struggle:
>
>>>> s = "12345678901234"
>>>> sys.getsizeof(s)

> 38
>>>> 83*38

> 3154
>
> So more than 3GB just for the strings (and that's for Python 2.x on
> Python 3.x you'll need nearly 5GB).
>
> Running on a 64 bit version of Python should be fine, but for a 32 bit
> system a naive approach just isn't going to work.


It depends - a few gig of RAM can be cheap compared with programmer
time. If you know you can solve a problem by spending a few euros on
some extra RAM it can be a good solution! It depends of course where the
code is being deployed - if it's something that's intended to be
deployed widely then you can't expect thousands of people to go out and
buy more RAM - but if it's a one off deployment for a particular
environment then it can be the best way to go.

 
Reply With Quote
 
News123
Guest
Posts: n/a
 
      05-01-2010
Dennis Lee Bieber wrote:
> On Thu, 29 Apr 2010 11:38:28 +0200, "Karin Lagesen"
> <(E-Mail Removed)> declaimed the following in comp.lang.python:
>
>> Hello.
>>
>> I have approx 83 million strings, all 14 characters long. I need to be
>> able to take another string and find out whether this one is present
>> within the 83 million strings.
>>


>>

> So don't load them into memory... First use a file-based (not memory
>
>
> That lets you do a binary search on the file. Much faster than a
> linear search (linear search will average out to 41.5M read operations;
> binary should be around 10000 reads)


Don't you meant 27 reads instead of 41.5 M reads?

>>> from math import log
>>> log(83e6)/log(2)

26.306608000671101
>>>



N
 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      05-01-2010
Duncan Booth, 30.04.2010 10:20:
> So more than 3GB just for the strings (and that's for Python 2.x on
> Python 3.x you'll need nearly 5GB).
>
> Running on a 64 bit version of Python should be fine, but for a 32 bit
> system a naive approach just isn't going to work.
>
> Option 1: use a trie. That should reduce storage, maybe it will reduce
> it enough, maybe not. It depends on the data.


Depending on the implementation and the data, a trie can also use a lot
/more/ space than the set of strings that it contains. The 83 million 14
character strings can well include a relatively large subset of the
possible permutations (imagine purely numeric, hex-numeric or lower-case
alpha-numeric strings, for example), so the trie will likely need to branch
very often with very low intermediate run length. If you use pointers for
trie branches, that's at least 8 bytes per branch on a 64bit system, versus
1 byte per character in a byte string list. Depending on the ratio of
branches to characters, one or the other may win. So a "naive approach"
likely won't work for tries either.

Stefan

 
Reply With Quote
 
Dennis Lee Bieber
Guest
Posts: n/a
 
      05-01-2010
On Sat, 01 May 2010 13:48:02 +0200, News123 <(E-Mail Removed)> declaimed
the following in gmane.comp.python.general:

> Dennis Lee Bieber wrote:
> > That lets you do a binary search on the file. Much faster than a
> > linear search (linear search will average out to 41.5M read operations;
> > binary should be around 10000 reads)

>
> Don't you meant 27 reads instead of 41.5 M reads?
>

Don't you mean my 10000 reads? The 41.5M is the average for the
linear search.

> >>> from math import log
> >>> log(83e6)/log(2)

> 26.306608000671101
> >>>

>

Probably -- it was late at night and I was working things out in my
head...
--
Wulfraed Dennis Lee Bieber AF6VN
http://www.velocityreviews.com/forums/(E-Mail Removed) HTTP://wlfraed.home.netcom.com/

 
Reply With Quote
 
News123
Guest
Posts: n/a
 
      05-01-2010
Dennis Lee Bieber wrote:
> On Sat, 01 May 2010 13:48:02 +0200, News123 <(E-Mail Removed)> declaimed
> the following in gmane.comp.python.general:
>
>> Dennis Lee Bieber wrote:
>>> That lets you do a binary search on the file. Much faster than a
>>> linear search (linear search will average out to 41.5M read operations;
>>> binary should be around 10000 reads)

>> Don't you meant 27 reads instead of 41.5 M reads?
>>

> Don't you mean my 10000 reads? The 41.5M is the average for the
> linear search.
>

Indeed, this is what I meant. or phrased differently:
"about 27 reads with a binary search instead of 41.5M reads average with
a linear search."




>>>>> from math import log
>>>>> log(83e6)/log(2)

>> 26.306608000671101

> Probably -- it was late at night and I was working things out in my
> head...


I know about late nights. I just wondered whether I overlooked something.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Find closest matching string based on collection of strings inlist/dict/set Joel Goldstick Python 1 08-31-2010 09:24 PM
External Hashing [was Re: matching strings in a large set of strings] Helmut Jarausch Python 3 04-30-2010 08:44 PM
median of large data set (from large file) friend.05@gmail.com Perl Misc 5 04-02-2009 04:06 AM
How to replace all strings matching a pattern with correspondinglower case strings ? anonym Java 1 01-15-2009 07:29 PM
Speeding up matching Strings in a set softwarepearls_com Java 34 10-26-2008 06:35 PM



Advertisments