Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Number of distinct substrings of a string [continuation]

Reply
Thread Tools

Number of distinct substrings of a string [continuation]

 
 
n00m
Guest
Posts: n/a
 
      11-29-2009
http://www.spoj.pl/problems/DISUBSTR/
Got accepted both in C++(0.03s) and Python(0.12s).
Don't know exactly how they benchmark solutions but
my home tests proved Python is a fellow fast beast of C++.
Quite unexpected (I expected Py would be by ~10 times slower).
PS
Both my codes are identical in their algorithms.


Borland
compiler
(bcc32.exe) Python 2.5 + Psyco
================================
1 1
2 2
13 13
5 5
9 9
59078 59078
1000 1000
321000 321000
=============================
0.016 0.0150001049042 <--- exec times
=============================

----------------------------------------------------
My test input file:
----------------------------------------------------
8

a aa

abbaz

CCCCC ABABA

fgjtjdjjdjdjjryyryryududdisisisidfjfjnfnfgngnnghhh hhhhdeggrgtfgjtjdjjdjdjjryyryryududdisisisidfjfjnf nfgngnnghhhhhhhdeggrgtfgjtjdjjdjdjjryyryryududdisi sisidfjfjnfnfgngnnghhhhhhhdeggrgtfgjtjdjjdjdjjryyr yryududdisisisidfjfjnfnfgngnnghhhhhhhdeggrgtfgjtjd jjdjdjjryyryryududdisisisidfjfjnfnfgngnnghhhhhhhde ggrgtfgjtjdjjdjdjjryyryryududdisisisidfjfjnfnfgngn nghhhhhhhdeggrgtfgjtjdjjdjdjjryyryryududdisisisidf jfjnfnfgngnnghhhhhhhdeggrgtfgjtjdjjdjdjjryyryryudu ddisisisidfjfjnfnfgngnnghhhhhhhdeggrgtfgjtjdjjdjdj jryyryryududdisisisidfjfjnfnfgngnnghhhhhhhdeggrgtf gjtjdjjdjdjjryyryryududdisisisidfjfjnfnfgngnnghhhh hhhdeggrgtfgjtjdjjdjdjjryyryryududdisisisidfjfjnfn fgngnnghhhhhhhdeggrgtfgjtjdjjdjdjjryyryryududdisis isidfjfjnfnfgngnnghhhhhhhdeggrgtfgjtjdjjdjdjjryyry ryududdisisisidfjfjnfnfgngnnghhhhhhhdeggrgtfgjtjdj jdjdjjryyryryududdisisisidfjfjnfnfgngnnghhhhhhhdeg grgtfgjtjdjjdjdjjryyryryududdisisisidfjfjnfnfgngnn ghhhhhhhdeggrgtfgjtjdjjdjdjjryyryryududdisisisidfj fjnfnfgngnnghhhhhhhdeggrgtfgjtjdjjdjdjjryyryryudud


aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb cccccccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccccc
 
Reply With Quote
 
 
 
 
n00m
Guest
Posts: n/a
 
      11-29-2009
My Py solution:

================================================== ===

import psyco
psyco.full()

def foo(s):
n = len(s)
s = s + ' '
a = [[] for i in xrange(12]
ans = 0
for i in xrange(n - 1, -1, -1):
lev = 0
for st in xrange(len(a[ord(s[i])]) - 1, -1, -1):
j = a[ord(s[i])][st]
if (n - j <= lev):
break
if (s[j + lev] != s[i + lev]):
continue
if (s[j + lev / 2] != s[i + lev / 2]):
continue
k = 0
while (s[j + k] == s[i + k]):
k += 1
if (k > lev):
lev = k
a[ord(s[i])] += [i]
ans += n - i - lev
return ans


from time import time
t = time()

import sys
sys.stdin = open('D:/88.txt', 'rt')
f = sys.stdin.read().split()
sys.stdin.close()

z = open('D:/99.txt', 'wt')

for tc in range(int(f[0])):
s = f[tc + 1]
print >> z, foo(s)

print >> z, time() - t
z.close()

================================================== ======
 
Reply With Quote
 
 
 
 
n00m
Guest
Posts: n/a
 
      11-29-2009
This worked out in 5.28s
Imo it's not that *much* slower
(of course, Psyco can't help here)
===================================

import itertools
def subs(s):
return len(set(itertools.chain(
s[i:j]
for i in xrange(len(s))
for j in xrange(i, len(s)+1)))) - 1


from time import time
t = time()

import sys
sys.stdin = open('D:/88.txt', 'rt')
f = sys.stdin.read().split()
sys.stdin.close()

z = open('D:/99.txt', 'wt')

for tc in range(int(f[0])):
s = f[tc + 1]
print >> z, subs(s)

print >> z, time() - t
z.close()

==================================================
 
Reply With Quote
 
Bearophile
Guest
Posts: n/a
 
      11-29-2009
n00m:
> This worked out in 5.28s
> Imo it's not that *much* slower
> (of course, Psyco can't help here)


There's no need of a chain here, so you can rewrite this:

import itertools
def subs(s):
return len(set(itertools.chain(
s[i:j]
for i in xrange(len(s))
for j in xrange(i, len(s)+1)))) - 1

as:

def subs(s):
return len(set(s[i:j] for i in xrange(len(s))
for j in xrange(i, len(s)+1))) - 1


Psyco helps a little (about 10% on my PC) if you use it like this:

from time import time
import psyco; psyco.full()

def main():
t = time()
fin = open("input.txt")
fout = open("output.txt", "w")
fin.next()
for line in fin:
r = set()
s = line.rstrip()
len_s = len(s)
for i in xrange(len_s):
for j in xrange(i, len_s + 1):
r.add(s[i : j])
print >> fout, len(r) - 1
fout.close()
print time() - t

main()

Bye,
bearophile
 
Reply With Quote
 
Bearophile
Guest
Posts: n/a
 
      11-29-2009
n00m:
> My Py solution:
> ...


Cute. You can replace:
a[ord(s[i])] += [i]
With:
a[ord(s[i])].append(i)

If you want to micro-optimize this with Psyco may be a little faster:
(lev >> 1)
Than:
lev // 2

But to increase speed it's usually better to reduce memory
allocations. So you can try to find a way to pull this out of the
function:
a = [[] for i in xrange(12]
And to avoid a true append:
a[ord(s[i])] += [i]

(There are many ways to avoid an append. One of them is to have an
already allocated list/array and just bump forward an index variable
that starts from 0. This is worth doing especially when you use
Psyco).

Bye,
bearophile
 
Reply With Quote
 
Bearophile
Guest
Posts: n/a
 
      11-29-2009
n00m:
> my home tests proved Python is a fellow fast beast of C++.
> Quite unexpected (I expected Py would be by ~10 times slower).
> PS
> Both my codes are identical in their algorithms.


> =============================
> 0.016 * * * * 0.0150001049042 * <--- exec times


Maybe in your C++ code there's something that can be improved, this is
a 1:1 translation to D (V.1) language (using dlibs) and it's about 2.2
times faster than the Psyco version:
http://codepad.org/MQLj0ydB
Using a smarter usage of memory (that is avoiding all or most memory
allocations inside all loops), the performance difference will surely
grow.

Bye,
bearophile
 
Reply With Quote
 
n00m
Guest
Posts: n/a
 
      11-29-2009
On Nov 29, 3:15*pm, Bearophile <bearophileH...@lycos.com> wrote:
> Maybe in your C++ code there's something that can be improved, this is
> a 1:1 translation to D (V.1) language (using dlibs) and it's about 2.2
> times faster than the Psyco version:http://codepad.org/MQLj0ydB
> Using a smarter usage of memory (that is avoiding all or most memory
> allocations inside all loops), the performance difference will surely
> grow.


Very interesting. Thanks.
D code looks pretty neat. Btw D lang is among accepted langs there.
Even if Py by 4x *slower* -- it's still a perfect Ok for it (C# will
be
much (much) slower than Python).

Micro-optimizations.
Of course, the best optimization would be to implement Suffix Tree:
http://en.wikipedia.org/wiki/Trie
Currently I hardly understand/know/read/etc its core idea. My algo is
plainly stupid as soldier muddy boots.

My C++ code:

<BEGIN>
#include <stdlib.h>
#include <stdio.h>
//#include <string.h>
//#include <ctype.h>
//#include <math.h>
#include <time.h>
#include <iostream>
#include <vector>
#include <string>
using namespace std;

int main() {
clock_t start_time = clock();
freopen("88.txt", "rt", stdin);
freopen("99.txt", "wt", stdout);

int tcs;
string s;
cin >> tcs;
while (tcs-- > 0) {
cin >> s;
int n = s.size();
s = s + ' ';
vector< vector<int> > a(12;
int ans = 0;
for (int i = n - 1; i >= 0; --i) {
int lev = 0;
for (int st = (int)a[s[i]].size() - 1; st >= 0; --st) {
int j = a[s[i]][st];
if (n - j <= lev) break;
if (s[j + lev] != s[i + lev]) continue;
if (s[j + lev / 2] != s[i + lev / 2]) continue;
int k = 0;
while (s[j + k] == s[i + k]) ++k;
if (k > lev) lev = k;
}
a[s[i]].push_back(i);
ans += n - i - lev;
}
cout << ans << endl;
}

cout << (clock() - start_time) / CLOCKS_PER_SEC << endl;

return 0;
}
<END>







 
Reply With Quote
 
n00m
Guest
Posts: n/a
 
      11-29-2009
http://en.wikipedia.org/wiki/Suffix_tree

Looks not very friendly appealing
 
Reply With Quote
 
n00m
Guest
Posts: n/a
 
      11-29-2009
Tested both my codes against
a random string of length = 10000.

===========================================

from random import choice
s = ''
for i in xrange(10000):
s += choice(('a','b','c','d','e','f'))

===========================================

C++: ~0.28s
Python: ~0.48s

PS
I suspect that building of Suffix Tree would
be a big exec.time-consuming overhead
 
Reply With Quote
 
Bearophile
Guest
Posts: n/a
 
      11-29-2009
n00m:

> I suspect that building of Suffix Tree would
> be a big exec.time-consuming overhead


In C/D/C++ there are ways to allocate memory in smarter ways, using
pools, arenas, stacks, freelists, etc. With that, using a compact
encoding of tree nodes (there are many different tricks to reduce
space used by a suffix tree), you can go fast enough. Probably a basic
implementation too will suffice in many cases

Anyway, you may try a pure-Python2.x implementation:
http://suffixtree.googlecode.com/fil...fixtree-0.1.py
If you find time & will to try to use that (or similar code) to
improve your Python solution to the problem 99, you can show us the
code here...

Bye,
bearophile
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
number of distinct elements in a const container and iterating over the distinct elements Hicham Mouline C++ 1 04-11-2010 10:56 AM
Is there a command that returns the number of substrings in a string? Peng Yu Python 0 10-24-2009 02:31 AM
Replacing palindrome substrings of an input string with a given string Tung Chau C Programming 1 08-06-2004 07:27 PM
Replacing palindrome substrings of an input string with a given string. Any effecient algorithm? Tung Chau C Programming 0 08-06-2004 10:18 AM
splitting string with unknown number of similar substrings Jan Biel Perl Misc 2 04-02-2004 05:43 PM



Advertisments