Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Matching mixed up words

Reply
Thread Tools

Matching mixed up words

 
 
Michael T. Davis
Guest
Posts: n/a
 
      04-12-2005
Say I want to match "gremlin" or the letters that compose the word
"gremlin", but in any order. Note that once "g" is consumed, the set of
available letters no longer includes "g". (Also, "g" isn't necessarily
going to be the first letter.) I would anticipate that a proper solution
for a word of <N> letters would approach a complexity (or "big O") of N!
(read "N factorial"). Is there a solution which could be implemented as
a single match, or would this require some extra code around a match?

Thanks,
Mike
--
| Systems Specialist: CBE,MSE
Michael T. Davis | Departmental Networking/Computing
http://www.ecr6.ohio-state.edu/~davism/ | The Ohio State University
| 197 Watts, (614) 292-6928
 
Reply With Quote
 
 
 
 
A. Sinan Unur
Guest
Posts: n/a
 
      04-12-2005
http://www.velocityreviews.com/forums/(E-Mail Removed)-state.edu (Michael T. Davis) wrote in
news:d3h7bu$hbt$(E-Mail Removed)-state.edu:

> Say I want to match "gremlin" or the letters that compose the
> word "gremlin", but in any order. Note that once "g" is consumed,
> the set of available letters no longer includes "g". (Also, "g" isn't
> necessarily going to be the first letter.) I would anticipate that a
> proper solution for a word of <N> letters would approach a complexity
> (or "big O") of N! (read "N factorial").


You are too pessimistic

> Is there a solution which could be implemented as a single match,
> or would this require some extra code around a match?


I don't see any mention of regexes in your post. I am not sure if that
is what you are after. There is a simple solution to this that falls
directly from your explanation of the problem:

use strict;
use warnings;

use Data:umper;

sub check {
my ($orig, $target) = @_;

my %c;

use integer;
my @l = split //, $orig;
++$c{$_} for @l;

@l = split //, $target;
for (@l) {
if(exists($c{$_}) and $c{$_}) {
--$c{$_};
}
}

@l = grep { $_ > 0 } values %c;
scalar @l ? 0 : 1;
}

my %check = (
sinan => [ 'nasin', 'nasina', 'lasin' ],
gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
);

for my $k (keys %check) {
for my $t (@{ $check{$k} }) {
print "$k matches $t?: ";
if(check($k, $t)) {
print "Yes\n";
} else {
print "No\n";
}
}
}


__END__


I am sure someone will show a regex solution that I have overlooked.

By the way, your signature is not formatted properly:

> --


The proper signature marker is two dashes followed by a space and a
newline. Please do use that.

Sinan


--
A. Sinan Unur <(E-Mail Removed)>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/cl...uidelines.html
 
Reply With Quote
 
 
 
 
thundergnat
Guest
Posts: n/a
 
      04-12-2005
Michael T. Davis wrote:
> Say I want to match "gremlin" or the letters that compose the word
> "gremlin", but in any order. Note that once "g" is consumed, the set of
> available letters no longer includes "g". (Also, "g" isn't necessarily
> going to be the first letter.) I would anticipate that a proper solution
> for a word of <N> letters would approach a complexity (or "big O") of N!
> (read "N factorial"). Is there a solution which could be implemented as
> a single match, or would this require some extra code around a match?
>


I'm sure it could be done more efficiently but it was an interesting
little diversion. I wandered a little from the OPs spec since I am
ignoring spaces, punctuation and case, I guess.


use warnings;
use strict;

my $phrase = 'George W. Bush';

my %letters;

for (split//, $phrase){
$letters{lc($_)}++ if /[a-zA-Z]/;
}

while (<DATA>){
chomp (my $test_phrase = $_);
my $no_match;
my %testhash = %letters;
for (split//, $test_phrase){
if (/[a-zA-Z]/){
if (--$testhash{lc($_)} < 0){
$no_match++;
last;
}
}
}
for (values %testhash){
last if $no_match;
if ($_ < 0){
$no_match++;
}
}
print "Phrase \"$test_phrase\" ".($no_match ?
'does not match' : 'matches')." $phrase.\n";
}

__DATA__
NOT A MATCH
SHRUB EGG WOE
BUG GORE HEWS
GOB SEWER HUG
WEB USER GOGH
RUBES EGG WHO
BUG GREW HOSE
WHOSE BUGGER
BEG WORSE UGH
A BOGUS ENTRY
 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      04-12-2005
(E-Mail Removed)-state.edu (Michael T. Davis) wrote:
> Say I want to match "gremlin" or the letters that compose the
> word "gremlin", but in any order. Note that once "g" is consumed, the
> set of available letters no longer includes "g". (Also, "g" isn't
> necessarily going to be the first letter.) I would anticipate that a
> proper solution for a word of <N> letters would approach a complexity (or
> "big O") of N! (read "N factorial"). Is there a solution which could be
> implemented as a single match, or would this require some extra code
> around a match?


canon("gremlin") eq canon($foo) or die;

sub canon {
join "", sort split //, $_[0];
};

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      04-12-2005
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote in news:20050412170728.908$(E-Mail Removed):

> (E-Mail Removed)-state.edu (Michael T. Davis) wrote:
>> Say I want to match "gremlin" or the letters that compose the
>> word "gremlin", but in any order match?


....

> canon("gremlin") eq canon($foo) or die;
>
> sub canon {
> join "", sort split //, $_[0];
> };


That's what I call the power of a clear mind

Simple and elegant.

Sinan

--
A. Sinan Unur <(E-Mail Removed)>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/cl...uidelines.html
 
Reply With Quote
 
Tassilo v. Parseval
Guest
Posts: n/a
 
      04-12-2005
Also sprach A. Sinan Unur:

> (E-Mail Removed)-state.edu (Michael T. Davis) wrote in
> news:d3h7bu$hbt$(E-Mail Removed)-state.edu:
>
>> Say I want to match "gremlin" or the letters that compose the
>> word "gremlin", but in any order. Note that once "g" is consumed,
>> the set of available letters no longer includes "g". (Also, "g" isn't
>> necessarily going to be the first letter.) I would anticipate that a
>> proper solution for a word of <N> letters would approach a complexity
>> (or "big O") of N! (read "N factorial").

>
> You are too pessimistic
>
>> Is there a solution which could be implemented as a single match,
>> or would this require some extra code around a match?

>
> I don't see any mention of regexes in your post. I am not sure if that
> is what you are after. There is a simple solution to this that falls
> directly from your explanation of the problem:
>
> use strict;
> use warnings;
>
> use Data:umper;
>
> sub check {
> my ($orig, $target) = @_;
>
> my %c;
>
> use integer;
> my @l = split //, $orig;
> ++$c{$_} for @l;
>
> @l = split //, $target;
> for (@l) {
> if(exists($c{$_}) and $c{$_}) {
> --$c{$_};
> }
> }
>
> @l = grep { $_ > 0 } values %c;
> scalar @l ? 0 : 1;
> }
>
> my %check = (
> sinan => [ 'nasin', 'nasina', 'lasin' ],
> gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
> );
>
> for my $k (keys %check) {
> for my $t (@{ $check{$k} }) {
> print "$k matches $t?: ";
> if(check($k, $t)) {
> print "Yes\n";
> } else {
> print "No\n";
> }
> }
> }


A faster solution appears to involve sort(): split both strings, sort
them and compare for equality. According to a benchmark:

use strict;
use Benchmark qw/cmpthese/;

sub check {
my ($orig, $target) = @_;
my %c;
use integer;
my @l = split //, $orig;
++$c{$_} for @l;
@l = split //, $target;
for (@l) {
if(exists($c{$_}) and $c{$_}) {
--$c{$_};
}
}
@l = grep { $_ > 0 } values %c;
scalar @l ? 0 : 1;
}

sub check_sort {
my ($orig, $target) = @_;
my $o = join '', sort split //, $orig;
my $t = join '', sort split //, $target;
return $o eq $t;
}

my %check = (
sinan => [ 'nasin', 'nasina', 'lasin', ],
gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
);

cmpthese(-2, {
histo => sub {
for my $k (keys %check) {
for my $t (@{ $check{$k} }) {
check($k, $t);
}
}
},
sort => sub {
for my $k (keys %check) {
for my $t (@{ $check{$k} }) {
check_sort($k, $t);
}
}
},
});
__END__
Rate histo sort
histo 2280/s -- -43%
sort 3992/s 75% --

This might however be due to a denser implementation of check_sort()
avoiding temporary variables etc.

Also, check_sort() is more correct as it wont falsely report 'sinan' and
'nasina' as matching, which check() does. I'd write check() thusly:

sub check {
my ($orig, $target) = @_;
my %c;
++$c{$_} for split //, $orig;
--$c{$_} for split //, $target;
return ! grep $_, values %c;
}

This is still slower by roughly 25% than using sort. The 'use integer'
appears to have no effect on the benchmark.

Tassilo
--
use bigint;
$n=71423350343770280161397026330337371139054411854 220053437565440;
$m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($ m+=<=200);
 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      04-12-2005
"Tassilo v. Parseval" <(E-Mail Removed)> wrote in
news:slrnd5ohfe.178.tassilo.von.parseval@localhost .localdomain:

> sub check_sort {
> my ($orig, $target) = @_;
> my $o = join '', sort split //, $orig;
> my $t = join '', sort split //, $target;
> return $o eq $t;
> }

....

> Also, check_sort() is more correct as it wont falsely report 'sinan'
> and 'nasina' as matching, which check() does.


And to think that I actually look at the output, and somehow did not
notice my error. Thank you for catching that.

Sinan

--
A. Sinan Unur <(E-Mail Removed)>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/cl...uidelines.html
 
Reply With Quote
 
Michael T. Davis
Guest
Posts: n/a
 
      04-13-2005
Just to be clear, I'm looking for a regex-based mechanism that will
work within the confines of "m/.../". I would imagine it's going to need to
rely on the "(${code})" construct.

BTW, my signature includes a trailing space at the end of the first
line, but the gateway I'm using apparently strips it off. I have alerted
them to the mistake.

Regards,
Mike
--
| Systems Specialist: CBE,MSE
Michael T. Davis | Departmental Networking/Computing
http://www.ecr6.ohio-state.edu/~davism/ | The Ohio State University
| 197 Watts, (614) 292-6928
 
Reply With Quote
 
Tassilo v. Parseval
Guest
Posts: n/a
 
      04-13-2005
Also sprach Michael T. Davis:

> Just to be clear, I'm looking for a regex-based mechanism that will
> work within the confines of "m/.../". I would imagine it's going to need to
> rely on the "(${code})" construct.


Most likely even (??{CODE}). However, any of my attempts so far ended up
in a segmentation fault or 'panic: '. I knew that some of these extended
patterns are flagged as experimental but I didn't expect them to be that
fragile. It's tricky enough coming up with a pure regex solution but
here you'll also need to find one that wont crash perl. So I wouldn't
bother.

Tassilo
--
use bigint;
$n=71423350343770280161397026330337371139054411854 220053437565440;
$m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($ m+=<=200);
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      04-13-2005
A. Sinan Unur <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> (E-Mail Removed)-state.edu (Michael T. Davis) wrote in
> news:d3h7bu$hbt$(E-Mail Removed)-state.edu:


[how to test for anagrams]

> use strict;
> use warnings;
>
> use Data:umper;
>
> sub check {
> my ($orig, $target) = @_;
>
> my %c;
>
> use integer;
> my @l = split //, $orig;
> ++$c{$_} for @l;
>
> @l = split //, $target;
> for (@l) {
> if(exists($c{$_}) and $c{$_}) {
> --$c{$_};
> }
> }
>
> @l = grep { $_ > 0 } values %c;
> scalar @l ? 0 : 1;
> }
>
> my %check = (
> sinan => [ 'nasin', 'nasina', 'lasin' ],
> gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
> );
>
> for my $k (keys %check) {
> for my $t (@{ $check{$k} }) {
> print "$k matches $t?: ";
> if(check($k, $t)) {
> print "Yes\n";
> } else {
> print "No\n";
> }
> }
> }
>
>
> __END__
>
>
> I am sure someone will show a regex solution that I have overlooked.


A regex solution seems unlikely. It would require jumping back and
forth in a string while keeping track of what was matched where.
Regexes aren't very good at that.

Using a hash for counting is just fine. It is basically a well known
data structure that implements what has been called "bags". Bags are
like sets, but each element (a hash key) can be contained multiple times
(the hash value). Containment and equality of bags are defined in the
obvious way. Then, to check if two strings are anagrams, create the
corresponding bags and test for equality. Code:

my %check = (
sinan => [ 'nasin', 'nasina', 'lasin' ],
gremlin => [ 'mergnil', 'mgrelanl', 'gremlin' ],
);

for my $k (keys %check) {
my $bk = Bag->embag( $k);
for my $t (@{ $check{$k} }) {
print "$k matches $t?: ";
print $bk eq Bag->embag( $t) ? "Yes\n" : "No\n";
}
}

################################################## #######################

package Bag;

sub embag { # create a bag of letters from a string
my $class = shift;
my %bag;
$bag{ $_} ++ for split //, shift;
bless \ %bag, $class;
}

sub contained {
my ( $b1, $b2) = @_;
$b2->{ $_} and $b1->{ $_} > $b2->{ $_} and return 0 for keys %$b1;
1;
}

use overload(
le => 'contained',
eq => sub { $_[ 0] le $_[ 1] and $_[ 1] le $_[ 0] },
);

__END__

Anno
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Replace stop words (remove words from a string) BerlinBrown Python 6 01-17-2008 02:37 PM
Mixed Content XML pattern matching phaeton123 XML 2 08-29-2006 09:23 AM
Words Words utab C++ 6 02-16-2006 07:00 PM
Non-noise words are incorrectly recognised as noise words. Peter Strĝiman ASP .Net 1 08-23-2005 01:26 PM
Re: A little bit of help regarding my linked list program required. - "words.c" - "words.c" Richard Heathfield C Programming 7 10-05-2003 02:38 PM



Advertisments