Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Match beginning of two strings

Reply
Thread Tools

Match beginning of two strings

 
 
Jim Richardson
Guest
Posts: n/a
 
      08-04-2003
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sun, 3 Aug 2003 12:44:59 -0600,
Andrew Dalke <(E-Mail Removed)> wrote:
> Jim Richardson:
>> Why bother finding out which one is the shorter? if you try the compare,
>> and you run out of the other to compare to, then by default, it's not
>> the same

>
> Because I wasn't sure if the strings had embedded NULs in them.
> Python strings allow those.
>


Ah, I hadn't considered that.

> Otherwise something like this would work
>
> char *s1, *s2 = ... the strings from Python
> char *s = s1;
> while ( *s1 && (*s1++ == *s2++))
> ;
> return the string from s->s1, or just the size s1-s.
>
> Andrew
> http://www.velocityreviews.com/forums/(E-Mail Removed)
>
>



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/La14d90bcYOAWPYRApsqAJ9OW1jCViH/ytTMwpMv/TtTo9Q0BwCg2/Hn
R41XPXwJUhQYvCBnrqRMqLY=
=arqM
-----END PGP SIGNATURE-----

--
Jim Richardson http://www.eskimo.com/~warlock

Linux, because eventually, you grow up enough to be trusted with a fork()
 
Reply With Quote
 
 
 
 
Alex Martelli
Guest
Posts: n/a
 
      08-04-2003
Jeff Epler wrote:
...
> .. unfortunately, it seems to be slower than the first method. On my
> machine (800MHz PIII):
> $ python timeit.py -s 'import ravi' \
> 'ravi.extract("abcdefghijklmnopqrstuvwxyz","abcdef ghijklmnopBHLHT")'
> 10000 loops, best of 3: 32.7 usec per loop


Here's my proposal...:

import sys

def extract(a, b):
m = min(len(a), len(b))
for i in range(m):
if a[i] != b[i]:
return a[:i]
return a[:m]

def extract2(a, b):
for i, ai, bi in zip(xrange(sys.maxint), a, b):
if ai != bi: return a[:i]
return a[:m]

def extract3(a, b):
for i, ai in enumerate(a):
if ai != b[i:i+1]:
return a[:i]
return a

[alex@lancelot python2.3]$ python -O timeit.py -s 'import exa'
'exa.extract("abcdefghijklmnopqrstuvwxyz","abcdefg hijklmnopBHLHT")'
100000 loops, best of 3: 13.9 usec per loop
[alex@lancelot python2.3]$ python -O timeit.py -s 'import exa'
'exa.extract2("abcdefghijklmnopqrstuvwxyz","abcdef ghijklmnopBHLHT")'
10000 loops, best of 3: 19.7 usec per loop
[alex@lancelot python2.3]$ python -O timeit.py -s 'import exa'
'exa.extract3("abcdefghijklmnopqrstuvwxyz","abcdef ghijklmnopBHLHT")'
100000 loops, best of 3: 15.7 usec per loop

Now add after the "import sys" two lines:

import psyco
psyco.full()

and run again:

[alex@lancelot python2.3]$ python -O timeit.py -s 'import exa'
'exa.extract("abcdefghijklmnopqrstuvwxyz","abcdefg hijklmnopBHLHT")'
1000000 loops, best of 3: 0.771 usec per loop

[alex@lancelot python2.3]$ python -O timeit.py -s 'import exa'
'exa.extract3("abcdefghijklmnopqrstuvwxyz","abcdef ghijklmnopBHLHT")'
100000 loops, best of 3: 3.37 usec per loop

However, extract2 doesn't run correctly with psyco (gets a MemoryError).

Still, the 18-times acceleration that psyco is able to effect on
the naive extract IS typical of psyco's effect on functions coded
in simple, elementary terms. When you really need speed, assuming
that your processor is Intel-compatible, consider psyco (of course,
you'll generally use psyco.profile, or something more selective
still, rather than psyco.full) -- orders-of-magnitude improvements
on low-level bottleneck functions are anything but surprising...


Alex

 
Reply With Quote
 
 
 
 
Alex Martelli
Guest
Posts: n/a
 
      08-04-2003
Ravi wrote:

> Hi,
>
> I have about 200GB of data that I need to go through and extract the
> common first part of a line. Something like this.
>
> >>>a = "abcdefghijklmnopqrstuvwxyz"
> >>>b = "abcdefghijklmnopBHLHT"
> >>>c = extract(a,b)
> >>>print c

> "abcdefghijklmnop"
>
> Here I want to extract the common string "abcdefghijklmnop". Basically I
> need a fast way to do that for any two given strings. For my situation,
> the common string will always be at the beginning of both strings. I can


Here's my latest study on this:

*** pexa.py:

import sys

import psyco
psyco.full()

import cexa
import exa

def extract(a, b):
m = min(len(a), len(b))
for i in range(m):
if a[i] != b[i]:
return a[:i]
return a[:m]

def extract2(a, b):
for i, ai, bi in zip(xrange(len(a)), a, b):
if ai != bi: return a[:i]
return a[:m]

def extract3(a, b):
for i, ai in enumerate(a):
if ai != b[i:i+1]:
return a[:i]
return a

extract_pyrex = exa.exa

extract_c = cexa.cexa

*** exa.pyx:

def exa(a, b):
cdef int la
cdef int lb
la = len(a)
lb = len(b)
cdef int lmin
lmin = min(la, lb)
cdef int i
i = 0
while i < lmin:
if a[i] != b[i]:
return a[:i]
i = i + 1
if lmin == la:
return a
else:
return b

*** cexa.c:

#include <Python.h>

static PyObject*
cexa(PyObject* self, PyObject* args)
{
char *a, *b;
int la, lb;
int lmin, i;
if(!PyArg_ParseTuple(args, "s#s#", &a, &la, &b, &lb))
return 0;

lmin = la;
if(lmin<lb) lmin = lb;

for(i=0; i<lmin; i++)
if(a[i] != b[i])
break;

return Py_BuildValue("s#", a, i);
}

static PyMethodDef cexaMethods[] = {
{"cexa", cexa, METH_VARARGS, "Extract common prefix"},
{0}
};

void
initcexa(void)
{
Py_InitModule("cexa", cexaMethods);
}

I've built the pyrex-coded extension with:

from distutils.core import setup
from distutils.extension import Extension
from Pyrex.Distutils import build_ext

setup(
name = "exa",
ext_modules=[
Extension("exa", ["exa.pyx"])
],
cmdclass = {'build_ext': build_ext}
)

and the C-coded one with:

from distutils.core import setup
from distutils.extension import Extension

setup(
name = "cexa",
ext_modules=[
Extension("cexa", ["cexa.c"])
],
)

and my measurements give me:

[alex@lancelot exi]$ python -O timeit.py -s 'import pexa' \
> 'pexa.extract("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'

100000 loops, best of 3: 2.39 usec per loop
[alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
'pexa.extract("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
100000 loops, best of 3: 2.14 usec per loop
[alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
'pexa.extract2("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
10000 loops, best of 3: 30.2 usec per loop
[alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
'pexa.extract3("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
100000 loops, best of 3: 9.59 usec per loop
[alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
'pexa.extract_pyrex("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
10000 loops, best of 3: 21.8 usec per loop
[alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
'pexa.extract_c("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
100000 loops, best of 3: 1.88 usec per loop
[alex@lancelot exi]$

So, it seems you can still get a tiny drop of extra speed
with a C-coded extension, though it's doubtful whether it's
worth the bother wrt the pyro-optimized simple Python code
in function 'extract'. I'm not sure where I went wrong in
the Pyrex coding (it doesn't seem to be performing anywhere
as well as I thought it might) and I'll be happy for real
Pyrex expert to show me the way.

Of course, as others have pointed out, it's unclear from
your problem description that doing such operations pairwise
on a lot of pairs of strings is actually what you need. It
IS quite possible that what you're doing could often be
better modeled, e.g., by repeated "prefix extractions"
between ONE fixed string and several other candidate strings;
or "prefix extraction" between a set of more than two strings.
In each case, it's likely that you can get much better
performance by more sophisticated algorithms. However,
which algorithms those might be is unclear unless you can
provide mode details on what you're doing.


Alex

 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      08-04-2003
On Mon, 04 Aug 2003 11:56:04 GMT, Alex Martelli <(E-Mail Removed)>
wrote:

> I'm not sure where I went wrong in
>the Pyrex coding (it doesn't seem to be performing anywhere
>as well as I thought it might) and I'll be happy for real
>Pyrex expert to show me the way.


I don't call myself an expert, but here's my best shot:

If you look at the generated C code, you'll see lots of conversion
between C and Python types. The trick is to get your args into C, stay
in C as much as possible, and ship back a Python return value.

It's made harder with strings as there is not (yet) any way of hinting
to Pyrex to use the "s#" gadget, you have to DIY, see below.

cdef extern from "Python.h":
int PyString_Size(object s)

def exa2(arga, argb):
cdef int la, lb, lmin, i
cdef char *a, *b

a = arga
b = argb
la = PyString_Size(arga)
lb = PyString_Size(argb)
# living dangerously, not testing for error;
# Easy to eyeball for correctness in this case,
# but ...
if la <= lb:
lmin = la
else:
lmin = lb
i = 0
while i < lmin:
if a[i] != b[i]:
return arga[:i]
i = i + 1
if lmin == la:
return arga
else:
return argb

 
Reply With Quote
 
Richie Hindle
Guest
Posts: n/a
 
      08-04-2003

[Alex]
> I'm not sure where I went wrong in
> the Pyrex coding (it doesn't seem to be performing anywhere
> as well as I thought it might) and I'll be happy for real
> Pyrex expert to show me the way.


I'm no an expert, but I can see a few easily-fixed problems. The line:

if a[i] != b[i]:

is working with Python strings when it could be working with C
strings. Here's the original code and its output on my machine:

def exa(a, b):
cdef int la
cdef int lb
la = len(a)
lb = len(b)
cdef int lmin
lmin = min(la, lb)
cdef int i
i = 0
while i < lmin:
if a[i] != b[i]:
return a[:i]
i = i + 1
if lmin == la:
return a
else:
return b

100000 loops, best of 3: 9.11 usec per loop

Here's a modified version of the code comparing C strings:

def exa(a, b):
cdef char* c_a # `a` as a C string
cdef char* c_b # `b` as a C string
cdef int la
cdef int lb

c_a = a
c_b = b
la = len(a)
lb = len(b)
cdef int lmin
lmin = min(la, lb)
cdef int i
i = 0
while i < lmin:
if c_a[i] != c_b[i]:
return a[:i]
i = i + 1
if lmin == la:
return a
else:
return b

100000 loops, best of 3: 5.79 usec per loop

Almost twice as fast. Looking at the generated C is always
worthwhile when optimising Pyrex code - here's the code that does
the comparison against Python strings:

/* "D:\src\tests\pyrex\exa.pyx":11 */
__pyx_3 = PyInt_FromLong(__pyx_v_i); if (!__pyx_3) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 11; goto __pyx_L1;}
__pyx_1 = PyObject_GetItem(__pyx_v_a, __pyx_3); if (!__pyx_1) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 11; goto __pyx_L1;}
Py_DECREF(__pyx_3); __pyx_3 = 0;
__pyx_5 = PyInt_FromLong(__pyx_v_i); if (!__pyx_5) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 11; goto __pyx_L1;}
__pyx_2 = PyObject_GetItem(__pyx_v_b, __pyx_5); if (!__pyx_2) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 11; goto __pyx_L1;}
Py_DECREF(__pyx_5); __pyx_5 = 0;
if (PyObject_Cmp(__pyx_1, __pyx_2, &__pyx_4) < 0) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 11; goto __pyx_L1;}
__pyx_4 = __pyx_4 != 0;
Py_DECREF(__pyx_1); __pyx_1 = 0;
Py_DECREF(__pyx_2); __pyx_2 = 0;
if (__pyx_4) {

vs.

/* "D:\src\tests\pyrex\exa.pyx":16 */
__pyx_5 = ((__pyx_v_c_a[__pyx_v_i]) != (__pyx_v_c_b[__pyx_v_i]));
if (__pyx_5) {

for C strings. There's another similar optimisation that the C
output leads you to: you can use strlen rather than Python's len:

cdef extern from "string.h":
int strlen(char*)

def exa(a, b):
cdef char* c_a # `a` as a C string
cdef char* c_b # `b` as a C string
cdef int la
cdef int lb

c_a = a
c_b = b
la = strlen(c_a)
lb = strlen(c_b)
cdef int lmin
lmin = min(la, lb)
cdef int i
i = 0
while i < lmin:
if c_a[i] != c_b[i]:
return a[:i]
i = i + 1
if lmin == la:
return a
else:
return b

100000 loops, best of 3: 3.58 usec per loop

That replaces:

/* "D:\src\tests\pyrex\exa.pyx":4 */
__pyx_1 = __Pyx_GetName(__pyx_b, "len"); if (!__pyx_1) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 4; goto __pyx_L1;}
__pyx_2 = PyTuple_New(1); if (!__pyx_2) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 4; goto __pyx_L1;}
Py_INCREF(__pyx_v_a);
PyTuple_SET_ITEM(__pyx_2, 0, __pyx_v_a);
__pyx_3 = PyObject_CallObject(__pyx_1, __pyx_2); if (!__pyx_3) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 4; goto __pyx_L1;}
Py_DECREF(__pyx_1); __pyx_1 = 0;
Py_DECREF(__pyx_2); __pyx_2 = 0;
__pyx_4 = PyInt_AsLong(__pyx_3); if (PyErr_Occurred()) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 4; goto __pyx_L1;}
Py_DECREF(__pyx_3); __pyx_3 = 0;
__pyx_v_la = __pyx_4;

/* "D:\src\tests\pyrex\exa.pyx":5 */
__pyx_1 = __Pyx_GetName(__pyx_b, "len"); if (!__pyx_1) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 5; goto __pyx_L1;}
__pyx_2 = PyTuple_New(1); if (!__pyx_2) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 5; goto __pyx_L1;}
Py_INCREF(__pyx_v_b);
PyTuple_SET_ITEM(__pyx_2, 0, __pyx_v_b);
__pyx_3 = PyObject_CallObject(__pyx_1, __pyx_2); if (!__pyx_3) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 5; goto __pyx_L1;}
Py_DECREF(__pyx_1); __pyx_1 = 0;
Py_DECREF(__pyx_2); __pyx_2 = 0;
__pyx_4 = PyInt_AsLong(__pyx_3); if (PyErr_Occurred()) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 5; goto __pyx_L1;}
Py_DECREF(__pyx_3); __pyx_3 = 0;
__pyx_v_lb = __pyx_4;

with :

/* "D:\src\tests\pyrex\exa.pyx":12 */
__pyx_v_la = strlen(__pyx_v_c_a);

/* "D:\src\tests\pyrex\exa.pyx":13 */
__pyx_v_lb = strlen(__pyx_v_c_b);

and leaves the call to 'min' as the only remaining huge block of C.
The final version looks like this, eliminating 'min' (Greg, can we have
the terary operator in Pyrex please? <good_mood ? wink : frown>)

cdef extern from "string.h":
int strlen(char*)

def exa(a, b):
cdef char* c_a # `a` as a C string
cdef char* c_b # `b` as a C string
cdef int la
cdef int lb

c_a = a
c_b = b
la = strlen(c_a)
lb = strlen(c_b)
cdef int lmin
if la < lb:
lmin = la
else:
lmin = lb
cdef int i
i = 0
while i < lmin:
if c_a[i] != c_b[i]:
return a[:i]
i = i + 1
if lmin == la:
return a
else:
return b


1000000 loops, best of 3: 0.803 usec per loop

Over ten times quicker than the original, for the sake of a couple of
small tweaks driven by looking at the C output. Although the C still
looks very verbose at first glance, it's now substantially the same as
Alex's cexa.c.

Hope that helps,

--
Richie Hindle
(E-Mail Removed)


 
Reply With Quote
 
Bengt Richter
Guest
Posts: n/a
 
      08-04-2003
On Mon, 04 Aug 2003 11:56:04 GMT, Alex Martelli <(E-Mail Removed)> wrote:

>Ravi wrote:
>
>> Hi,
>>
>> I have about 200GB of data that I need to go through and extract the
>> common first part of a line. Something like this.
>>
>> >>>a = "abcdefghijklmnopqrstuvwxyz"
>> >>>b = "abcdefghijklmnopBHLHT"
>> >>>c = extract(a,b)
>> >>>print c

>> "abcdefghijklmnop"
>>
>> Here I want to extract the common string "abcdefghijklmnop". Basically I
>> need a fast way to do that for any two given strings. For my situation,
>> the common string will always be at the beginning of both strings. I can

>
>Here's my latest study on this:
>
>*** pexa.py:
>

[...]

JFTHOI, if you have the inclination, I'm curious how this slightly
different 2.3-dependent version would fare in your harness on your
system with the rest:

def commonprefix(s1, s2): # very little tested!
try:
for i, c in enumerate(s1):
if c != s2[i]: return s1[:i]
except IndexError:
return s1[:i]
return s1

[...]

>
>and my measurements give me:
>
>[alex@lancelot exi]$ python -O timeit.py -s 'import pexa' \
>> 'pexa.extract("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'

>100000 loops, best of 3: 2.39 usec per loop
>[alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
>'pexa.extract("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
>100000 loops, best of 3: 2.14 usec per loop
>[alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
>'pexa.extract2("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
>10000 loops, best of 3: 30.2 usec per loop
>[alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
>'pexa.extract3("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
>100000 loops, best of 3: 9.59 usec per loop
>[alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
>'pexa.extract_pyrex("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
>10000 loops, best of 3: 21.8 usec per loop
>[alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
>'pexa.extract_c("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
>100000 loops, best of 3: 1.88 usec per loop
>[alex@lancelot exi]$
>

Interesting, but I think I will have to write a filter so I can
see a little more easily what your timeit.py outputs say

Regards,
Bengt Richter
 
Reply With Quote
 
Ravi
Guest
Posts: n/a
 
      08-04-2003
Ravi wrote:
> Hi,
>
> I have about 200GB of data that I need to go through and extract the
> common first part of a line. Something like this.
>
> >>>a = "abcdefghijklmnopqrstuvwxyz"
> >>>b = "abcdefghijklmnopBHLHT"
> >>>c = extract(a,b)
> >>>print c

> "abcdefghijklmnop"
>
> Here I want to extract the common string "abcdefghijklmnop". Basically I
> need a fast way to do that for any two given strings. For my situation,
> the common string will always be at the beginning of both strings. I can
> use regular expressions to do this, but from what I understand there is
> a lot of overhead. New data is being generated at the rate of about 1GB
> per hour, so this needs to be reasonably fast while leaving CPU time for
> other processes.
>
> Thanks
> Ravi
>


I really appreciate all your help, Alex, Jim, Jeff, Andrew, John, Richie
and Bengt. However I have this problem taken care of now. Took around 6
hours to run on a P4 2.8Ghz 1.0GB DDR (I suspect I/O limitations). As
for the data, if you want to know about it just for the sake of an
optimized algorithm, there are no Null (\0) characters in the strings
(actually they're Base64), and I've included a typical pair of strings.
The version I used was Andrew's.

Someone suggested that this would be better done in larger sets than
just pairs. That's not suitable because of the structure of the data,
two strings might be highly correlated, but are probably quite different
from another pair of strings. Perhaps more significantly, correlation in
sets of greater than two has no physical significance to the experiment.

I grabbed this from a typical data file. So I would want to be
extracting 'A832nv81a'
"
A832nv81a81nW103v9c24jgpy92T
A832nv81aTyqiep4v9c324jgpy92T
"

Thanks for your help everyone, coming from a Perl (It's a four letter
word to me world, I'm very impressed by how helpful all of you are.

Ravi

 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      08-04-2003
Richie Hindle <(E-Mail Removed)> wrote in message news:<(E-Mail Removed)>...
>
> for C strings. There's another similar optimisation that the C
> output leads you to: you can use strlen rather than Python's len:
>


You can, if you don't care about the possibility that the input may contain NULs.
 
Reply With Quote
 
Bengt Richter
Guest
Posts: n/a
 
      08-07-2003
On Mon, 04 Aug 2003 18:53:27 -0400, Ravi <(E-Mail Removed)> wrote:

>Ravi wrote:
>> Hi,
>>
>> I have about 200GB of data that I need to go through and extract the
>> common first part of a line. Something like this.
>>
>> >>>a = "abcdefghijklmnopqrstuvwxyz"
>> >>>b = "abcdefghijklmnopBHLHT"
>> >>>c = extract(a,b)
>> >>>print c

>> "abcdefghijklmnop"
>>
>> Here I want to extract the common string "abcdefghijklmnop". Basically I
>> need a fast way to do that for any two given strings. For my situation,
>> the common string will always be at the beginning of both strings. I can
>> use regular expressions to do this, but from what I understand there is
>> a lot of overhead. New data is being generated at the rate of about 1GB
>> per hour, so this needs to be reasonably fast while leaving CPU time for
>> other processes.
>>
>> Thanks
>> Ravi
>>

>
>I really appreciate all your help, Alex, Jim, Jeff, Andrew, John, Richie
>and Bengt. However I have this problem taken care of now. Took around 6
>hours to run on a P4 2.8Ghz 1.0GB DDR (I suspect I/O limitations). As
>for the data, if you want to know about it just for the sake of an
>optimized algorithm, there are no Null (\0) characters in the strings
>(actually they're Base64), and I've included a typical pair of strings.
>The version I used was Andrew's.
>
>Someone suggested that this would be better done in larger sets than
>just pairs. That's not suitable because of the structure of the data,
>two strings might be highly correlated, but are probably quite different
>from another pair of strings. Perhaps more significantly, correlation in
>sets of greater than two has no physical significance to the experiment.
>
>I grabbed this from a typical data file. So I would want to be
>extracting 'A832nv81a'
>"
>A832nv81a81nW103v9c24jgpy92T
>A832nv81aTyqiep4v9c324jgpy92T
>"

I still don't understand your use case

1) Are you periodically batch processing to look for near-pairs in a 200GB
unsorted list of strings? Are they just in a huge unsorted ascii file with
line feeds delimiting?

or

1a) Does someone send you a single string in a message every now and then (how often?)
and ask you to find the closest match in the data base? I.e., 200GB at 30 bytes/string
would be 6.67 billion strings. Which you say you now crunch in ~6hrs? You don't do that
amount of crunching just for one match request, do you?

1b) If you get a lot of "match requests," wouldn't you gain a lot by at least partially
ordering the incoming streams into separate buckets? E.g., the simplest thing would be
to have 64 files corresponding to the first character, or even 64*64 files for the first two.
Then have 64*64 file buffers (not open files) of say 1000 strings each, which would be
64*64*1000*30 or call it 32k/file buffer -> 64*64*32*1024/(1024*1024) ->128 megabytes of buffers
and on the average you'd expect 1GB/hr to fill a buffer of 32k about
(1e9/3600)/(32*1024) -> 8.477 times/second. It should be reasonable to open a file for append
and write 32k and close it that often, IWT. Whateve box writes the 200GB data now could either
do the partitioning itself or send it by gigabit ethernet to a dedicated box for that, IWT. Or
you could distribute the load by patitioning the ethernet destinations per the leading n bits
and let 2**n boxes maybe do even more ordering on the fly.

Even without distributing the load, this could give you 200GB/4096 (if evenly distributed!!)
or only '%e'%(200e9/4096) ->4.882813e+007 or less than 50MB to read off disk and search
to make a probe with a single string. If you sorted when you probed and wrote back the
sorted data with an indication that it was sorted, further probes to that same partition could
go a lot faster.

2) Why do the string lengths vary if they are samples from some process? Are they
effectively just right-stripped from some fixed length, which would be a guaranteed max length?
2a) what are the max and min lengths?

3) what is the distribution of leading characters as they come from the source?
3a) E.g., is the first character uniformly distributed within its possible Base64 codes?
3b) Are they uncorrelated timewise? Or e.g. do they arrive in clumps of <<64 possibilities for a time?
3c) Are the individual strings' characters randomly distributed and uncorrelated within the string?
3d) You say 9000GB compresses to 700GB. That suggests a lot of correlation and/or uneven distributions.
Is there a fixed dictionary you could use to repack the strings bit-wise with huffman and/or
rle compression?
3d1) What is the significance of the Base64 character boundaries? I.e., would a common prefix of
8-bit bytes representing sequentially compressed string be good enough, even if the first non-
matching byte contained some bits that did match and would have been included in base64?

3d2) Note that compressing during partitioned 4096-bucket storage could save a lot of space
as well as speed up matching.

4) 1GB/hr translates to about 10k strings like the your example per second.
4a) Are they just logged sequentially to a 200GB data store? (cf. 1b)
4b==2a) Is there a max and/or min length to these strings? (the above are 29 & 30 chars).
4c) You say they are base 64. That means binary would make (6/*200gb = 150GB, and the
strings (6/*30 ~ 22.5 or say 24 for a nice multiple of 8, even without compression.

Just some thoughts. Might want to check my arithmetic

>
>Thanks for your help everyone, coming from a Perl (It's a four letter
>word to me world, I'm very impressed by how helpful all of you are.
>

Some newsgroups are like that. In the past I spent a fair amount of time on the Borland
Delphi n.g., and that was mostly very friendly and helpful too. I think that's just the
way people are unless some stupidities get in the way. Renaissance yes! Armageddon no!

Regards,
Bengt Richter
 
Reply With Quote
 
=?ISO-8859-1?Q?Ma=F1ungo?=
Guest
Posts: n/a
 
      08-22-2003
"Andrew Dalke" <(E-Mail Removed)> wrote in message news:<bgi9g8$dc0$(E-Mail Removed)>...
> Ravi:
> > Read in both strings.
> > Check to see if the first character matches.
> > If yes:
> > Check halfway through the string and see if that character matches
> > Repeatedly check halfway until the difference point is found.
> > Go back through from the difference point backwards and make sure
> > the characters match from the start to the difference point.
> >
> > I timed it, and it seems to be doing about 3.5usec per loop.

>
> There's a lot of overhead for doing that. Have you tried the simple
>
> char *s1 = ... the first string ..
> char *s2 = ... the second string ..
> n = ... the shorter of the two ..
> for(i=0; i<n; i++) {
> if (*s1++ != *s2++) {
> break;
> }
> }
> return ... the string s1[:n] (or even just the int)
>
> Easy to understand, and the CPU is spending almost its whole
> time doing character tests.
>
> Andrew
> (E-Mail Removed)


First, I'm not a python programmer... but I think is better test int.
Something like that:

int *a = (int *) ...first string...;
int *b = (int *) ...second string...;

for( i = 0; i < n; i += 4 )
if( *a++ != *b++ )
break;
char *aa = (char *) a;
char *bb = (char *) b;
if( *aa++ == *bb++ ) i++;
if( *aa++ == *bb++ ) i++;
if( *aa == *bb ) i++;

and return i.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
index of string from beginning of line vs beginning of file Jesse B. Ruby 9 03-27-2010 04:04 PM
match matching chars at beginning of STL string Lars Schouw C++ 1 03-26-2010 12:10 AM
Strings, Strings and Damned Strings Ben C Programming 14 06-24-2006 05:09 AM
Match url beginning with http:// or nothing. Heckner Perl Misc 3 07-11-2005 09:43 AM
Java regex can't match lengthy match? hiwa Java 0 01-29-2004 10:09 AM



Advertisments