Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Comparing strings from the back?

Reply
Thread Tools

Comparing strings from the back?

 
 
Chris Angelico
Guest
Posts: n/a
 
      09-04-2012
On Wed, Sep 5, 2012 at 2:32 AM, Johannes Bauer <(E-Mail Removed)> wrote:
> How do you arrive at that conclusion? When comparing two random strings,
> I just derived
>
> n = (256 / 255) * (1 - 256 ^ (-c))
>
> where n is the average number of character comparisons and c. The
> rationale as follows: The first character has to be compared in any
> case. The second with a probability of 1/256, the third with 1/(256^2)
> and so on.


That would be for comparing two random areas of memory. Python strings
don't have 256 options per character; and in terms of actual strings,
there's so many possibilities. The strings that a program is going to
compare for equality are going to use a vastly restricted alphabet;
for a lot of cases, there might be only a few dozen plausible
characters.

But even so, it's going to scale approximately linearly with the
string length. If they're really random, then yes, there's little
chance that either a 1MB string or a 2MB string will be the same, but
with real data, they might very well have a long common prefix. So
it's still going to be more or less O(n).

ChrisA
 
Reply With Quote
 
 
 
 
MRAB
Guest
Posts: n/a
 
      09-05-2012
On 05/09/2012 03:18, Neil Hodgson wrote:
> Roy Smith:
>
>> I'm wondering if it might be faster to start at the ends of the strings
>> instead of at the beginning? If the strings are indeed equal, it's the
>> same amount of work starting from either end.

>
> Most people write loops that go forwards. This leads to the
> processor designers prioritizing detection mechanisms like cache
> prefetching for that case.
>
> However, its not always the situation: a couple of years ago Intel
> contributed a memcpy implementation to glibc that went backwards for a
> performance improvement. An explanation from a related patch involves
> speculative and committed operations and address bits on some processors
> and quickly lost me:
> paragraph 3 of
> http://lists.freedesktop.org/archive...st/000423.html
>
> The memcpy patch was controversial as it broke Adobe Flash which
> assumed memcpy was safe like memmove.
>

I don't think I've ever use memcpy, only memmove.
 
Reply With Quote
 
 
 
 
Roy Smith
Guest
Posts: n/a
 
      09-05-2012
In article <(E-Mail Removed)>,
Neil Hodgson <(E-Mail Removed)> wrote:

> The memcpy patch was controversial as it broke Adobe Flash


An added benefit!
 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      09-05-2012
Roy Smith wrote:

> There's been a bunch of threads lately about string implementations, and
> that got me thinking (which is often a dangerous thing).
>
> Let's assume you're testing two strings for equality. You've already
> done the obvious quick tests (i.e they're the same length), and you're
> down to the O(n) part of comparing every character.
>
> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning? If the strings are indeed equal, it's the
> same amount of work starting from either end. But, if it turns out that
> for real-life situations, the ends of strings have more entropy than the
> beginnings, the odds are you'll discover that they're unequal quicker by
> starting at the end.
>
> It doesn't seem un-plausible that this is the case. For example, most
> of the filenames I work with begin with "/home/roy/". Most of the
> strings I use as memcache keys have one of a small number of prefixes.
> Most of the strings I use as logger names have common leading
> substrings. Things like credit card and telephone numbers tend to have
> much more entropy in the trailing digits.


But do you actually compare them with each other? Even if you do I'd guess
that Python looks at the hash values most of the time.

> On the other hand, hostnames (and thus email addresses) exhibit the
> opposite pattern.


Nor do english words:

$ python3 reverse_compare2.py compare /usr/share/dict/words --word-len 8

comparing every pair in a sample of 1000 8-char words
taken from '/usr/share/dict/words'

head
1: 477222 ************************************************** **********
2: 18870 **
3: 2870
4: 435
5: 74
6: 17
7: 12

tail
1: 386633 ************************************************** **********
2: 74966 ***********
3: 29698 ****
4: 6536 *
5: 1475
6: 154
7: 28
8: 10

> Anyway, it's just a thought. Has anybody studied this for real-life
> usage patterns?
>
> I'm also not sure how this work with all the possible UCS/UTF encodings.
> With some of them, you may get the encoding semantics wrong if you don't
> start from the front.



 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      09-05-2012
On Wed, Sep 5, 2012 at 6:29 PM, Peter Otten <(E-Mail Removed)> wrote:
> comparing every pair in a sample of 1000 8-char words
> taken from '/usr/share/dict/words'
>
> head
> 1: 477222 ************************************************** **********
> 2: 18870 **
> ...


Not understanding this. What are the statistics, and what (if it's not
obvious from the previous answer) do they prove?

ChrisA
 
Reply With Quote
 
Johannes Bauer
Guest
Posts: n/a
 
      09-05-2012
On 04.09.2012 20:07, Steven D'Aprano wrote:
> A reasonable, conservative assumption is to calculate the largest
> possible value of the average for random strings. That largest value
> occurs when the alphabet is as small as possible, namely two characters.
> In practice, strings come from a larger alphabet, up to 1114111 different
> characters for full Unicode strings, so the average for them will be less
> than the average we calculate now.
>
> So for unequal strings, the number of comparisons is equally likely to be
> 1, 2, 3, ..., N. The average then is:


That is exactly what I don't agree with. For unequal random strings, the
distribution is *not* equally likely to have the same amount of
comparisons. For demonstration, let's look at bitstrings and the string
"1010". There 15 bitstrings of same length which are unequal and I'll
put the number of comparisons needed next to them going left to right:

0000 1
0001 1
0010 1
0011 1
0100 1
0101 1
0110 1
0111 1
1100 2
1101 2
1110 2
1111 2
1000 3
1001 3
1011 4

i.e. about 1.73 (26/15) comparisons in average with random strings
compared to the 2.5 comparisons you end up with. My formula does respect
lazy termination (because the probabilty of higher comparisons gets
exponentially lower).

> sum([1, 2, 3, ..., N])/N
>
> which by a bit of simple algebra works out to be (N+1)/2, or half the
> characters as I said.


Yes, but it's a flawed assumption you're making since you're neglecting
lazy evaluation and early abort of comparison.

Best regards,
Johannes

--
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

> Zumindest nicht öffentlich!

Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$(E-Mail Removed)>
 
Reply With Quote
 
Johannes Bauer
Guest
Posts: n/a
 
      09-05-2012
On 04.09.2012 23:59, Chris Angelico wrote:

>> n = (256 / 255) * (1 - 256 ^ (-c))
>>
>> where n is the average number of character comparisons and c. The
>> rationale as follows: The first character has to be compared in any
>> case. The second with a probability of 1/256, the third with 1/(256^2)
>> and so on.

>
> That would be for comparing two random areas of memory. Python strings
> don't have 256 options per character; and in terms of actual strings,
> there's so many possibilities.


The unit of comparison is completely arbitraty and can equally be used
to compare bitstrings if changed accordingly.

> The strings that a program is going to
> compare for equality are going to use a vastly restricted alphabet;
> for a lot of cases, there might be only a few dozen plausible
> characters.


This is very true. But if so, I would like to see the assumtion that
you're making (which might very well be plausible) in order to arrive at
a average of N/2 comparisons (which just seems *way* off).

> But even so, it's going to scale approximately linearly with the
> string length. If they're really random, then yes, there's little
> chance that either a 1MB string or a 2MB string will be the same, but
> with real data, they might very well have a long common prefix. So
> it's still going to be more or less O(n).


I understand what you're saying and surely this is true to some extent
(someone mentioned filenames, which is an excellent example). However in
order to derive a *formula* you need to formularize your assumtions.
With random data, it's definitely not O(n). And even in reality (with a
more limited alphabet) it usually will early abort:

Consider sorting a large dictionary. Sorting is in O(n log(n)), but this
assumes O(1) comparisons! If the comparison operation itself were in
O(n), the total sorting complexity would be O(n^2 log(n)), which is
definitely false. Most comparisons will abort *very* early (after the
first character).

Best regards,
Johannes

--
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

> Zumindest nicht ffentlich!

Ah, der neueste und bis heute genialste Streich unsere groen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos ber Rdiger Thomas in dsa <hidbv3$om2$(E-Mail Removed)>
 
Reply With Quote
 
Johannes Bauer
Guest
Posts: n/a
 
      09-05-2012
On 05.09.2012 11:24, Johannes Bauer wrote:

> Consider sorting a large dictionary. Sorting is in O(n log(n)), but this
> assumes O(1) comparisons! If the comparison operation itself were in
> O(n), the total sorting complexity would be O(n^2 log(n)), which is
> definitely false. Most comparisons will abort *very* early (after the
> first character).


I've just written a program to demonstrate this (below it's attached). I
used my aspell dictionary, which contains about 100k words. To have
complexity down I went for only words wich occur most often (in this
case 9 characters or ~13k words). Here's the results (note it's
randomized, so they'll always differ a little, but are pretty stable)

Max Cmp: 100
Sum Cmp: 32111
Avg Cmp: 2.393842254361115
Num words : 13414
Min wordlen: 9
Max wordlen: 9
Avg wordlen: 9.0

Going up in wordlength (only showing part now)

Avg Cmp: 2.2374929735806632
Num words : 10674
Avg wordlen: 10.0

Avg Cmp: 2.261727078891258
Num words : 7504
Avg wordlen: 11.0

Avg Cmp: 2.2335647202939977
Num words : 4898
Avg wordlen: 12.0

Avg Cmp: 2.1743341404358354
Num words : 2891
Avg wordlen: 13.0

Avg Cmp: 2.156782549420586
Num words : 1467
Avg wordlen: 14.0

Avg Cmp: 2.112449799196787
Num words : 747
Avg wordlen: 15.0

So, interestingly, in this real-world case(tm), the complexity does not
scale with O(n). It actually goes down (which is probably due to the
limit amount of words of higher length).

In summary, let's please not debate "feelings" or such. Either
measurements (with clear indiciation on what data they're based on and
how the test was conducted) or derivations (with an explanation of the
assumtions). Feelings can be very misleading in cases like this.

Best regards,
Johannes



#!/usr/bin/python3
import random

class Word():
def __init__(self, s):
self._s = s
self._c = 0

def __lt__(self, other):
if len(self) < len(other):
return True
elif len(self) > len(other):
return False

for i in range(len(self)):
self._c += 1
if self._s[i] < other._s[i]:
return True
return False

def __repr__(self):
return "%s<%d>" % (self._s, self._c)

def __len__(self):
return len(self._s)

def getcmp(self):
return self._c

wordlist = [ ]
for line in open("dict", "r"):
word = Word(line[:-1])
if len(word) == 9:
wordlist.append(word)

random.shuffle(wordlist)
wordlist.sort()

comparisons = [ word.getcmp() for word in wordlist ]
print("Max Cmp:", max(comparisons))
print("Sum Cmp:", sum(comparisons))
print("Avg Cmp:", sum(comparisons) / len(comparisons))

wordlengths = [ len(word) for word in wordlist ]
print("Min wordlen:", min(wordlengths))
print("Max wordlen:", max(wordlengths))
print("Avg wordlen:", sum(wordlengths) / len(wordlengths))


--
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

> Zumindest nicht ffentlich!

Ah, der neueste und bis heute genialste Streich unsere groen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos ber Rdiger Thomas in dsa <hidbv3$om2$(E-Mail Removed)>
 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      09-05-2012
Chris Angelico wrote:

> On Wed, Sep 5, 2012 at 6:29 PM, Peter Otten <(E-Mail Removed)> wrote:
>> comparing every pair in a sample of 1000 8-char words
>> taken from '/usr/share/dict/words'
>>
>> head
>> 1: 477222 ************************************************** **********
>> 2: 18870 **
>> ...


tail
1: 386633 ************************************************** **********
2: 74966 ***********


> Not understanding this. What are the statistics,


I measured how many chars they have in common for all combinations of 1000
words taken from /usr/share/dict/words.

477222 pairs have one char in common, 18870 pairs have two chars in common
when compared from the *beginning* of the string.

386633 pairs have one char in common, 74966 pairs have two chars in common
when compared from the *end* of the string.

and what (if it's not obvious from the previous answer) do they prove?

They demonstrate that for two words from that particular corpus it is likely
that a[::-1] == b[::-1] has to take more (1.34 versus 1.05 on average)
characters into account than a == b, i. e. comparing from the back should be
slower rather than faster.

If that doesn't help, here's the code

import random
import itertools
from contextlib import contextmanager
from collections import defaultdict

@contextmanager
def open_corpus(args):
with open(args.name) as instream:
words = (line.strip() for line in instream)
#words = (word for word in words if not word.endswith("'s"))
yield words

def pick(items, count):
items = list(items)
return random.sample(items, count)

def count_common(a, b):
i = 0
for i, (x, y) in enumerate(zip(a, b), 1):
if x != y:
break
return i

def corpus(args):
with open_corpus(args) as words:
wordlens = (len(line.strip()) for line in words)
freq = defaultdict(int)
for wordlen in wordlens:
freq[wordlen] += 1
show_histogram(freq)

def show_histogram(freq):
peak = max(freq.values())
freq_width = len(str(peak))
max_freq = max(freq)
len_width = len(str(max_freq))
for i in range(min(freq), max_freq+1):
value = freq[i]
print("{:{}}: {:{}} {}".format(
i, len_width, value,
freq_width, "*" * (60 * value // peak)))

def compare(args):
with open_corpus(args) as words:
words = pick(
(word for word in words if len(word) == args.word_len),
args.sample_size)
n = 0
head_common = defaultdict(int)
tail_common = defaultdict(int)
n_tail = n_head = 0
for n, (a, b) in enumerate(itertools.combinations(words, 2), 1):
cc = count_common(a, b)
n_head += cc
head_common[cc] += 1
ccr = count_common(a[::-1], b[::-1])
n_tail += ccr
tail_common[ccr] += 1

print(n, "combinations")
print("average common chars (head) {:.3}".format(n_head/n))
print("average common chars (tail) {:.3}".format(n_tail/n))

print("comparing every pair in a "
"sample of {sample} {len}-char words\n"
"taken from {name!r}".format(
sample=args.sample_size,
len=args.word_len,
name=args.name))
print()
print("head")
show_histogram(head_common)
print()
print("tail")
show_histogram(tail_common)

def main():
import argparse
parser = argparse.ArgumentParser()
parsers = parser.add_subparsers()

pcorpus = parsers.add_parser("corpus")
pcorpus.add_argument("name")
pcorpus.set_defaults(func=corpus)

pcompare = parsers.add_parser("compare")
pcompare.add_argument("name")
pcompare.add_argument("--sample-size", type=int, default=1000)
pcompare.add_argument("--word-len", type=int, default=
pcompare.set_defaults(func=compare)
args = parser.parse_args()
args.func(args)

if __name__ == "__main__":
main()


 
Reply With Quote
 
Steven D'Aprano
Guest
Posts: n/a
 
      09-05-2012
On Wed, 05 Sep 2012 11:43:08 +0200, Johannes Bauer wrote:

> On 05.09.2012 11:24, Johannes Bauer wrote:
>
>> Consider sorting a large dictionary. Sorting is in O(n log(n)), but
>> this assumes O(1) comparisons! If the comparison operation itself were
>> in O(n), the total sorting complexity would be O(n^2 log(n)),


This shows some significant confusion about Big Oh complexity analysis
here. When you say that sorting the list of words is O(N log N), the N
refers to the number of words in the dictionary. But when you speak about
comparisons being O(N), the N doesn't refer to the size of the dict, but
the size of the individual strings being compared. There is no reasonable
comparison algorithm where the effort required to compare two strings
depends on the size of the dictionary they have come from.

Since these are *completely different Ns*, you can't combine them to get
O(N**2 log N) as the overall algorithmic complexity. Doing so is sheer
nonsense. If you state it like this:

sorting the list of words is O(N log N) where N = length of the list
comparing the words is O(M) where M = the average length of the words

then the overall complexity is O(N log N)*O(M), but since M is a constant
for any particular dictionary (its just the average length of the words)
that reduces down to O(N log N).

To say that sorting a dictionary is O(N log N) does *not* imply that
string comparisons are O(1). It only implies that string comparisons
don't depend on the size of the dictionary. Comparing:

s1 = 'a'*30002
s2 = 'a'*30001 + 'b'

takes the same amount of time whether they are in a dictionary of 100
words or one of 100000 words. For the purposes of calculating the
algorithmic complexity of sorting a large list of words, we can treat
comparisons as an elementary operation even though they actually aren't.


>> which is
>> definitely false. Most comparisons will abort *very* early (after the
>> first character).


You are making unjustified assumptions about the distribution of letters
in the words. This might be a list of long chemical compounds where the
words typically differ only in their suffix. It might be a list of people
with titles:

Herr Professor Frederick Schmidt
Herr Professor Frederick Wagner
....

But either way, it doesn't matter for the Big Oh behaviour of sorting the
words.


> I've just written a program to demonstrate this (below it's attached).


I have no idea why you think this program demonstrates anything about
string equality comparisons. What you are counting is not the average
number of comparisons for string equality, but the number of comparisons
when sorting. These are *completely different*, because sorting a list
does not require testing each string against every other string.

Whatever you have measured, it has nothing to do with the Big Oh
complexity of string comparisons.

It seems to me that you have misunderstood the purpose of Big Oh
notation. It gives an asymptotic upper bound on the amount of work done,
ignoring all lower terms and multiplicative factors, not an exact formula.

If something is O(N), then this tells you:

1) the number of algorithmic steps is proportional to N, plus or
minus some terms which are less than N (e.g. sqrt(N), or log(N),
or 1/N, or some constant, etc.);

2) if you ignore those other terms, and keep all other influences
equal, then doubling N will *at worst* double the number of
algorithmic steps;

3) if all those algorithmic steps take exactly the same amount of
time, then doubling N will take twice as much time.

But of course *none* of those assumptions hold for measurements of actual
elapsed time. In real life, operations rarely take constant time, you
don't always see worst case behaviour, and the lower terms are not
necessarily insignificant enough to ignore.


[...]
> So, interestingly, in this real-world case(tm), the complexity does not
> scale with O(n). It actually goes down (which is probably due to the
> limit amount of words of higher length).


I would guess that it probably has more to do with the ability of the
timsort algorithm to exploit local order within a list and avoid
performing comparisons.


> In summary, let's please not debate "feelings" or such.


I have no idea what feelings you are referring to.



--
Steven
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Strings, Strings and Damned Strings Ben C Programming 14 06-24-2006 05:09 AM
comparing strings manzur Java 8 03-07-2006 05:05 PM
Comparing two strings HS1 Java 3 11-29-2004 04:21 PM
Which way is more efficient - comparing strings of different letter casing kaeli Java 8 11-18-2004 09:44 PM
Comparing strings from within strings Rick C Programming 3 10-21-2003 09:10 AM



Advertisments