On 20 Jun 2004 17:10:56 GMT, Anno Siegel
<> wrote:
> Aronaxis, the Sourceror <> wrote in
> comp.lang.perl.misc:
>> # ind_compare($string1,$string2 [,accuracy] )
>> # calculates strings "similarity"
>> # algorithm is cutted from some old Pascal code, and rewritten to use
>> # perl RE-engine backtracking for speed.
>>
>> use warnings;
>> use strict;
>>
>> sub DEBUG () {1}
>>
>> our ($cnt, $match);
#inserted:
our $i;
>> sub ind_compare ($$;$) {
>> my $max_len = $_[2];
>>
>> # numification for security reasons.
>> $max_len="" unless $max_len +=0;
>>
>> # WHY NOT?!
>> # my ($cnt,$match)=(0,0);
>> ($cnt, $match) = (0,0);
>>
>> use re 'eval'; # because of $max_len interpolation
>> # in regex below. But we cleaned it.
>>
>> # loop for comparing $_[0] against $_[1], and $_[1] against $_[0] too
>> for my $i (0,1) {
#changed:
for $i (0,1) {
>> $_[$i] =~ m{
>> ( .{1,$max_len} )
>> (?{
>> $cnt++;
>> $match++ if index( $_[1-$i], $1 ) != -1;
>> })
>> (?!) # that always fail and force backtracking.
>> }x;
>> }
>>
>> print "$match/$cnt\n" if DEBUG;
>>
>> return 0 unless $cnt;
>> $match/$cnt;
>> }
>>
>>
>> print ind_compare( "abcdefgh","abcdefgh" ), "\n\n";
>> print ind_compare( "abcdefg!","abcdefgh" ), "\n\n";
>>
>>
{...}
> I don't have a pat explanation, except to point out that "(?{ ... })"
> is still experimental. It has had scoping issues from day one and
> still has.
>
>> looks like some scoping issues, but then why there are no problems with
>> lexical scoped $i, or @_ (which AFAIK is also lexical) ?
>
> @_ is most definitely a package variable.
hm.. I'm sure I've read something claiming that @_ in perl5 behave more as
lexical than as package.. but I think you're right. And it's too hard to
feel the difference in this case... I think, if that function will make a
reference to @_ and store it somewhere, it would break next invocation the
same way as described. Looks like function internal variables tend to use
the same "addresses" on next invocation, if only they can..
("hm.."x2; I tested it now

. \@_ remains the same on several
invocations, but when I pushed \@_ to package array @temp, forcing perl to
change the \@_ on next invocation, regex still works correcly, as before..)
> The distinction is not with lexical vs. local. If you change "our" to
> "my", but keep the declarations out of the sub, you'll see the same
> difference.
I tested this too. for ($cnt,$match) it's correct. for $i - no. It should
be "our". I think that's because perl's "for" behaves slightly different
based on how variable with same name as "for"'s ..er.. iterator declared
in outer scope: as package or as lexical. Package var will be localized in
loop, but if there is lexical $i in scope, then "for $i (...)" works as
"for my $i (...)" does. It's only my assumption.. am I right?
> I also believe you are mistaken about there being no problems with $i.
> In fact, its behavior can also be described as it having a different
> value inside the regex and outside. This means that in effect you are
> doing "index( $_[1], $1)" both times through the loop, though the regex
> is first matched against $_[0] and then against $_[1].
yeah, thank you. I used badly chosen tests and didn't notice a difference.
> Check this by removing "my" in front of "$i" and adding "$i" to the list
> of "our" variables outside the loop. You will find that the match count
> changes from 64 to 56 in the second case, which, I think, is correct.
> (You are counting how many substrings of each string are also substrings
> of the other, right?)
right.
> On a more general note, why are you doing this? As far as I can see,
> there is no advantage in using the regex backtracking mechanism and
> (?{ ... }) against a more conventional method of calling Perl code.
> What you are really using it for is a mechanism to walk through
> all substrings of a string. That's somewhat neat, but not necessarily
> a speed gain. If the original Pascal program is anything like I imagine
> it to be, it will be hard to beat with a Perl program.
actually, it was my rewrite of another perl program, which made by my
friend from pascal program using "one-to-one" translation. It involved
many length(), substr(), eq, and nested loops, but he didn't ever used
index().
he send me his code and some another variants on other languages with
comparison chart:
Perl - 17.50s
O'Caml raw bytecode - 9.77s
O'Caml funcional bytecode - 9.55s
O'Caml funcional native - 1.68s
O'Caml raw native - 1.63s
his perl program was too lowlevel, and as my experience tells me,
"lowlevel" in perl's case means "slow". and ugly too. And I wonder if I
can use internal regex loop for memory saving (I didn't know how long are
lines he used to compare) and speed.. btw, I never used (?{}) before and
was curious. So you've seen what curiosity made to fox.
BTW, my rewrite took only 5.2s on former test.
> To do it in Perl, I wouldn't employ the regex engine at all. Instead
> I'd base it on a procedure to extract all non-empty substrings from
> a string. Ignoring the requirement for a maximal string length for
> simplicity, something like this would do:
>
> use constant DEBUG => 1;
>
> print ind_compare( "abcdefgh","abcdefgh" ), "\n\n";
> print ind_compare( "abcdefg!","abcdefgh" ), "\n\n";
>
> sub ind_compare {
> my ( $s, $t) = @_;
> my ( $count, $match);
> for ( 1, 2 ) {
> my @sub = all_substr( $s);
> $count += @sub;
> $match += grep 1 + index( $t, $_), @sub;
> ( $s, $t) = ( $t, $s);
> }
>
> print "$match/$count\n" if DEBUG;
> $count ? $match/$count : 0;
> }
>
> sub all_substr {
> my $s = shift;
> my @sub;
> while ( length $s ) {
> push @sub, map substr( $s, $_), 0 .. length( $s) - 1;
> chop $s;
> }
> @sub;
> }
>
> Anno
Thank you, it was real pleasure to read such a code.. it's perfect.
only one drawback I can see - it generates list of all possible substrings
before checking it, which can take too much memory on arbitrarily long
strings.. but I'm in doubt if this function would be useful on long
strings comparison. so your version is great.
Alexey