Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Efficient String Lookup?

Reply
Thread Tools

Efficient String Lookup?

 
 
Chris S.
Guest
Posts: n/a
 
      10-16-2004
I have a number of strings, containing wildcards (e.g. 'abc#e#' where #
is anything), which I want to match with a test string (e.g 'abcdef').
What would be the best way for me to store my strings so lookup is as
fast as possible?
 
Reply With Quote
 
 
 
 
Michael Hoffman
Guest
Posts: n/a
 
      10-16-2004
Chris S. wrote:
> I have a number of strings, containing wildcards (e.g. 'abc#e#' where #
> is anything), which I want to match with a test string (e.g 'abcdef').
> What would be the best way for me to store my strings so lookup is as
> fast as possible?


If it were me, I would store them as compiled regular expressions.

See the re module documentation and use re.compile().

If you want a better solution, it might help if you supply a little more
information about your problem and why this solution is unsuitable
(maybe it is :]).
--
Michael Hoffman
 
Reply With Quote
 
 
 
 
Diez B. Roggisch
Guest
Posts: n/a
 
      10-16-2004
Chris S. wrote:

> I have a number of strings, containing wildcards (e.g. 'abc#e#' where #
> is anything), which I want to match with a test string (e.g 'abcdef').
> What would be the best way for me to store my strings so lookup is as
> fast as possible?


As a compiled regular expression, I guess - you don't give much info here,
so maybe there is a better way. But to me it looks like a classic regexp
thing. Maybe if your wildcards are equivalent to .*, then using subsequent
string searches lik this helps you:

pattern = 'abc#e#'.split('#')
s = 'abcdef'
found = True
pos = 0
for p in pattern:
h = s.find(p)
if h != -1:
p = h + 1b
else:
found = False


That might be faster, if the string.find operation uses something else than
simple brute force linear searching - but I don't know enough about the
internals of python's string implementation to give an definite answer
here.

But to be honest: I don't think regexps are easy to beat, unless your
usecase is modeled in a way that makes it prone to other approaches.

--
Regards,

Diez B. Roggisch
 
Reply With Quote
 
Josiah Carlson
Guest
Posts: n/a
 
      10-16-2004

> I have a number of strings, containing wildcards (e.g. 'abc#e#' where #
> is anything), which I want to match with a test string (e.g 'abcdef').
> What would be the best way for me to store my strings so lookup is as
> fast as possible?


Start with a Trie, and virtually merge branches as necessary.

- Josiah

 
Reply With Quote
 
Nicolas Lehuen
Guest
Posts: n/a
 
      10-16-2004
Josiah Carlson wrote:
>>I have a number of strings, containing wildcards (e.g. 'abc#e#' where #
>>is anything), which I want to match with a test string (e.g 'abcdef').
>>What would be the best way for me to store my strings so lookup is as
>>fast as possible?

>
>
> Start with a Trie, and virtually merge branches as necessary.
>
> - Josiah
>


Yup, you might also have a look at the Aho-Corasick algorithm, which can
match a test string against a big number of strings quite efficiently :

http://www-sr.informatik.uni-tuebing...ler/AC/AC.html

You'll have to adapt the algorithm so that it can handle your wilcard,
though. I found it easy to implement the '.' (one character wildcard),
but the '.*' (zero or more character wildcard) forces you to have
backtracking.

But if you can use the regexp engine, do not hesitate, it will save you
a lot of headaches. Unless of course you're a student and your teacher
asked you this question .

Regards,

Nicolas
 
Reply With Quote
 
Chris S.
Guest
Posts: n/a
 
      10-16-2004
Michael Hoffman wrote:
> Chris S. wrote:
>
>> I have a number of strings, containing wildcards (e.g. 'abc#e#' where
>> # is anything), which I want to match with a test string (e.g
>> 'abcdef'). What would be the best way for me to store my strings so
>> lookup is as fast as possible?

>
>
> If it were me, I would store them as compiled regular expressions.
>
> See the re module documentation and use re.compile().
>
> If you want a better solution, it might help if you supply a little more
> information about your problem and why this solution is unsuitable
> (maybe it is :]).


The problem is I want to associate some data with my pattern, as in a
dictionary. Basically, my application consists of a number of
conditions, represented as strings with wildcards. Associated to each
condition is arbitrary data explaining "what I must do". My task is to
find this data by match a state string with these condition strings. Of
course, the brute force approach is to just add each pattern to a
dictionary, and linearly seach every key for a match. To improve this, I
considered a trie, implemented as a special dictionary:

class Trie(dict):
'''Implements a traditional Patricia-style Tria.
Keys must be sequence types. None key represents a value.'''

def __init__(self):
dict.__init__(self)

def __setitem__(self, key, value):
assert key, 'invalid key '+str(key)
d = self
last = None
for n in key:
if n not in d:
dict.__setitem__(d, n, {})
last = (d,n)
d = dict.__getitem__(d, n)
(d,n) = last
dict.__getitem__(d, n)[None] = value

def __getitem__(self, key):
d = self
for n in key:
assert n in d, 'invalid key '+str(key)
d = dict.__getitem__(d, n)
assert None in d, 'missing value for key '+str(key)
return dict.__getitem__(d, None)

def __delitem__(self, key):
previous = []
d = self
for n in key:
assert n in d, 'invalid key '+str(key)
previous.append((d,n))
d = dict.__getitem__(d, n)
assert None in d, 'missing value for key '+str(key)
# remove value
dict.__delitem__(d, None)
# find and remove empty keys
while len(previous):
(d,n) = previous.pop()
if not len(dict.__getitem__(d, n)):
dict.__delitem__(d, n)
else:
break

However, I'm uncertain about the efficiency of this approach. I'd like
to use regexps, but how would I associate data with each pattern?
 
Reply With Quote
 
Chris S.
Guest
Posts: n/a
 
      10-16-2004
Diez B. Roggisch wrote:

> That might be faster, if the string.find operation uses something else than
> simple brute force linear searching - but I don't know enough about the
> internals of python's string implementation to give an definite answer
> here.
>
> But to be honest: I don't think regexps are easy to beat, unless your
> usecase is modeled in a way that makes it prone to other approaches.


The problem is, I want to find all patterns that match my test string,
so even with re I'd still have to search through every pattern, which is
what I'm trying to avoid. Something like a trie might be better, but
they don't seem very efficient when implemented in Python.
 
Reply With Quote
 
Andrew Dalke
Guest
Posts: n/a
 
      10-17-2004
Chris S. wrote:
> The problem is I want to associate some data with my pattern, as in a
> dictionary. Basically, my application consists of a number of
> conditions, represented as strings with wildcards. Associated to each
> condition is arbitrary data explaining "what I must do".

...
> However, I'm uncertain about the efficiency of this approach. I'd like
> to use regexps, but how would I associate data with each pattern?


One way is with groups. Make each pattern into a regexp
pattern then concatenate them as
(pat1)|(pat2)|(pat3)| ... |(patN)

Do the match and find which group has the non-None value.

You may need to tack a "$" on the end of string (in which
case remember to enclose everything in a () so the $ doesn't
affect only the last pattern).

One things to worry about is you can only have 99 groups
in a pattern.

Here's example code.


import re

config_data = [
("abc#e#", "Reactor meltdown imminent"),
("ab##", "Antimatter containment field breach"),
("b####f", "Coffee too strong"),
]

as_regexps = ["(%s)" % pattern.replace("#", ".")
for (pattern, text) in config_data]

full_regexp = "|".join(as_regexps) + "$"
pat = re.compile(full_regexp)


input_data = [
"abadb",
"abcdef",
"zxc",
"abcq",
"b1234f",
]

for text in input_data:
m = pat.match(text)
if not m:
print "%s? That's okay." % (text,)
else:
for i, val in enumerate(m.groups()):
if val is not None:
print "%s? We've got a %r warning!" % (text,
config_data[i][1],)



Here's the output I got when I ran it


abadb? We've got a 'Antimatter containment field breach' warning!
abcdef? We've got a 'Reactor meltdown imminent' warning!
zxc? That's okay.
abcq? We've got a 'Antimatter containment field breach' warning!
b1234f? We've got a 'Coffee too strong' warning!


Andrew
http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
Bengt Richter
Guest
Posts: n/a
 
      10-17-2004
On Sat, 16 Oct 2004 09:11:37 GMT, "Chris S." <(E-Mail Removed)> wrote:

>I have a number of strings, containing wildcards (e.g. 'abc#e#' where #
>is anything), which I want to match with a test string (e.g 'abcdef').
>What would be the best way for me to store my strings so lookup is as
>fast as possible?


Insufficient info. But 'fast as possible' suggests putting your strings in
a flex grammar and generating a parser in c. See
http://www.gnu.org/software/flex/
Defining a grammar is a good exercise in precise definition of the problem anyway

If you want to do it in python, you still need to be more precise...

- is # a single character? any number of characters?
- if your test string were 'abcdefabcdef' would you want 'abc#e#' to match the whole thing?
or match abcdef twice?
- if one wild card string matches, does that "use up" the test string so other wild card strings
mustn't match? If so, what has priority? Longest? shortest? Other criterion?
- etc etc

Regards,
Bengt Richter
 
Reply With Quote
 
Chris S.
Guest
Posts: n/a
 
      10-17-2004
Bengt Richter wrote:

> On Sat, 16 Oct 2004 09:11:37 GMT, "Chris S." <(E-Mail Removed)> wrote:
>
>
>>I have a number of strings, containing wildcards (e.g. 'abc#e#' where #
>>is anything), which I want to match with a test string (e.g 'abcdef').
>>What would be the best way for me to store my strings so lookup is as
>>fast as possible?

>
>
> Insufficient info. But 'fast as possible' suggests putting your strings in
> a flex grammar and generating a parser in c. See
> http://www.gnu.org/software/flex/
> Defining a grammar is a good exercise in precise definition of the problem anyway
>
> If you want to do it in python, you still need to be more precise...
>
> - is # a single character? any number of characters?
> - if your test string were 'abcdefabcdef' would you want 'abc#e#' to match the whole thing?
> or match abcdef twice?
> - if one wild card string matches, does that "use up" the test string so other wild card strings
> mustn't match? If so, what has priority? Longest? shortest? Other criterion?
> - etc etc


Sorry for the ambiguity. My case is actually pretty simple. '#'
represents any single character, so it's essentially the same as re's
'.'. The match must be exact, so the string and pattern must be of equal
lengths. Each wildcard is independent of other wildcards. For example,
suppose we restricted the possible characters to 1 and 0, then the
pattern '##' would only match '00', '01', '10', and '11'. This pattern
would not match '0', '111', etc. I feel that a trie would work well, but
I'm concerned that for large patterns, the overhead in the Python
implementation would be too inefficient.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Efficient String Manipulation ? javafreak Java 13 11-14-2005 02:55 AM
efficient string tokenizer Alex C++ 10 08-03-2004 10:09 AM
Efficient string concatenation methods Oliver Crow Python 13 05-06-2004 06:49 AM
efficient use of hashtable with string like keys Harald Kirsch Java 10 08-16-2003 09:08 PM
Efficient string tricks: legal and defined? Thomas Paul Diffenbach C Programming 2 07-24-2003 04:24 PM



Advertisments