Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Pattern Searching

Reply
Thread Tools

Pattern Searching

 
 
Shiraz
Guest
Posts: n/a
 
      06-22-2005
This is my goal, if i have an input file:

BEGIN----------
Smith
Johnathan
Smiths
Mona
Moynaham
END------------

i want to generate a minimum match so that with the minimum match
string, each entry can be identified. the output would be:

BEGIN----------
Smith - Smith
Smiths - Smiths or hs
Mona - Mon or ona
Moynaham - Moy or nah or aha or ham
END------------

Is there any built-in function to do such analysis in perl?

 
Reply With Quote
 
 
 
 
A. Sinan Unur
Guest
Posts: n/a
 
      06-23-2005
"Shiraz" <(E-Mail Removed)> wrote in news:1119479503.935902.303370
@z14g2000cwz.googlegroups.com:

> This is my goal, if i have an input file:
>
> BEGIN----------
> Smith
> Johnathan
> Smiths
> Mona
> Moynaham
> END------------
>
> i want to generate a minimum match so that with the minimum match
> string, each entry can be identified. the output would be:
>
> BEGIN----------
> Smith - Smith
> Smiths - Smiths or hs
> Mona - Mon or ona
> Moynaham - Moy or nah or aha or ham
> END------------
>
> Is there any built-in function to do such analysis in perl?


No.

But you can always browse through CPAN to see if someone has written a
module to help with this task.

Sinan
--
A. Sinan Unur <(E-Mail Removed)>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/cl...uidelines.html
 
Reply With Quote
 
 
 
 
Arne Ruhnau
Guest
Posts: n/a
 
      06-23-2005
Shiraz wrote:
> This is my goal, if i have an input file:
>
> BEGIN----------
> Smith
> Johnathan
> Smiths
> Mona
> Moynaham
> END------------
>
> i want to generate a minimum match so that with the minimum match
> string, each entry can be identified. the output would be:
>
> BEGIN----------
> Smith - Smith
> Smiths - Smiths or hs
> Mona - Mon or ona
> Moynaham - Moy or nah or aha or ham
> END------------
>
> Is there any built-in function to do such analysis in perl?


Hmm... my first idea would be to, for every entry, extract every possible
2..length($entry) - Gram (i.e. each sequence of 2..length($entry) letters)
and count them.
For Smith, this would become
Sm, mi, it, th, Smi, mit, ith, Smit, mith, Smith
and for Smiths
Sm, mi, it, th, hs, Smi, mit, ith, ths, Smit, mith, iths, Smith, miths, Smiths

(thus, there are more unique-to-the-entry substrings than you listed above)

Of course, one would have to remember which entry generated the ngram just
counted. Afterwards, you can grep all those from the resulting tree whose
terminal nodes (i.e. counts) are exactly one, i.e. which were unique to the
associated word.

hth,

Arne Ruhnau
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      06-23-2005
Arne Ruhnau <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> Shiraz wrote:
> > This is my goal, if i have an input file:
> >
> > BEGIN----------
> > Smith
> > Johnathan
> > Smiths
> > Mona
> > Moynaham
> > END------------
> >
> > i want to generate a minimum match so that with the minimum match
> > string, each entry can be identified. the output would be:
> >
> > BEGIN----------
> > Smith - Smith
> > Smiths - Smiths or hs
> > Mona - Mon or ona
> > Moynaham - Moy or nah or aha or ham
> > END------------
> >
> > Is there any built-in function to do such analysis in perl?

>
> Hmm... my first idea would be to, for every entry, extract every possible
> 2..length($entry) - Gram (i.e. each sequence of 2..length($entry) letters)


Why only strings of length 2 or more? "y" uniquely identifies "Moynaham"
in the list above.

> and count them.


> For Smith, this would become
> Sm, mi, it, th, Smi, mit, ith, Smit, mith, Smith
> and for Smiths
> Sm, mi, it, th, hs, Smi, mit, ith, ths, Smit, mith, iths, Smith, miths, Smiths
>
> (thus, there are more unique-to-the-entry substrings than you listed above)
>
> Of course, one would have to remember which entry generated the ngram just
> counted. Afterwards, you can grep all those from the resulting tree whose
> terminal nodes (i.e. counts) are exactly one, i.e. which were unique to the
> associated word.


If you know the substring is unique (i.e. is a substring of only one of
the names) you don't *have* to remember which name generated which
substring. You can find it by looking which of the names contains the
substring.

Anno
 
Reply With Quote
 
Arne Ruhnau
Guest
Posts: n/a
 
      06-23-2005
Anno Siegel wrote:
> Arne Ruhnau <(E-Mail Removed)> wrote in comp.lang.perl.misc:
>
>>Shiraz wrote:
>>
>>>This is my goal, if i have an input file:
>>>
>>>BEGIN----------
>>>Smith
>>>Johnathan
>>>Smiths
>>>Mona
>>>Moynaham
>>>END------------
>>>
>>>i want to generate a minimum match so that with the minimum match
>>>string, each entry can be identified. the output would be:
>>>
>>>BEGIN----------
>>>Smith - Smith
>>>Smiths - Smiths or hs
>>>Mona - Mon or ona
>>>Moynaham - Moy or nah or aha or ham
>>>END------------
>>>
>>>Is there any built-in function to do such analysis in perl?

>>
>>Hmm... my first idea would be to, for every entry, extract every possible
>>2..length($entry) - Gram (i.e. each sequence of 2..length($entry) letters)

>
> Why only strings of length 2 or more? "y" uniquely identifies "Moynaham"
> in the list above.


Well, I thought unique unigrams could be too sparse / non-existent, so that
there is no point in looking at them anyway. But for completeness/maximally
short unique substrings, they should be included.

<snip>
>>Of course, one would have to remember which entry generated the ngram just
>>counted. Afterwards, you can grep all those from the resulting tree whose
>>terminal nodes (i.e. counts) are exactly one, i.e. which were unique to the
>>associated word.

>
> If you know the substring is unique (i.e. is a substring of only one of
> the names) you don't *have* to remember which name generated which
> substring. You can find it by looking which of the names contains the
> substring.


Correct, hadn't thought of that... But if you want to make some kind of
look-up based on the unique substring, returning the string identified by
it, you have to know it (to prevent yourself from scanning the original
list). Or am I getting it wrong?

Arne
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      06-23-2005
Arne Ruhnau <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> Anno Siegel wrote:
> > Arne Ruhnau <(E-Mail Removed)> wrote in

> comp.lang.perl.misc:
> >
> >>Shiraz wrote:
> >>
> >>>This is my goal, if i have an input file:
> >>>
> >>>BEGIN----------
> >>>Smith
> >>>Johnathan
> >>>Smiths
> >>>Mona
> >>>Moynaham
> >>>END------------
> >>>
> >>>i want to generate a minimum match so that with the minimum match
> >>>string, each entry can be identified. the output would be:
> >>>
> >>>BEGIN----------
> >>>Smith - Smith
> >>>Smiths - Smiths or hs
> >>>Mona - Mon or ona
> >>>Moynaham - Moy or nah or aha or ham
> >>>END------------
> >>>
> >>>Is there any built-in function to do such analysis in perl?
> >>
> >>Hmm... my first idea would be to, for every entry, extract every possible
> >>2..length($entry) - Gram (i.e. each sequence of 2..length($entry) letters)


[...]

> >>Of course, one would have to remember which entry generated the ngram just
> >>counted. Afterwards, you can grep all those from the resulting tree whose
> >>terminal nodes (i.e. counts) are exactly one, i.e. which were unique to the
> >>associated word.

> >
> > If you know the substring is unique (i.e. is a substring of only one of
> > the names) you don't *have* to remember which name generated which
> > substring. You can find it by looking which of the names contains the
> > substring.

>
> Correct, hadn't thought of that... But if you want to make some kind of
> look-up based on the unique substring, returning the string identified by
> it, you have to know it (to prevent yourself from scanning the original
> list). Or am I getting it wrong?


Sure, you probably will eventually build a lookup table for the unique
substrings, but you don't have to keep track of each substring's origin
while you find them.

Anno

 
Reply With Quote
 
Shiraz
Guest
Posts: n/a
 
      06-23-2005
Arne, i appreciate the input.... i'll see what kind of tuning can be
done to accomplish that fast since the input file maybe as big as 50000
lines at a time.....

i'll put up the code in a couple of days.

Thanks all

 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      06-23-2005
Shiraz wrote:

> Arne, i appreciate the input....


What input was that? Please quote some context when posting replies.

> done to accomplish that fast since the input file maybe as big as 50000
> lines at a time.....


50000 is really not that huge a number.

> i'll put up the code in a couple of days.


Please read the posting guidelines for this group by then.

Sinan
 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      06-23-2005
"Shiraz" <(E-Mail Removed)> wrote:
> This is my goal, if i have an input file:
>
> BEGIN----------
> Smith
> Johnathan
> Smiths
> Mona
> Moynaham
> END------------
>
> i want to generate a minimum match so that with the minimum match
> string, each entry can be identified. the output would be:
>
> BEGIN----------
> Smith - Smith
> Smiths - Smiths or hs
> Mona - Mon or ona
> Moynaham - Moy or nah or aha or ham
> END------------


Smith has *no* unambiguous match string!

What do you mean by "minimum"? "hs" is shorter than "Smiths" so why
is "Smiths" included if you only want the minimum?

>
> Is there any built-in function to do such analysis in perl?


No.

sub foo {
my %h2;
foreach (@_) {
my %h;
foreach my $p1 (0..length($_)-1) {
foreach my $p2 ($p1 .. length($_)-1) {
$h{substr $_, $p1, $p2-$p1+1}=1;
}
};
$h2{$_}++ foreach keys %h;
};
return map { my $t=$_; $t, grep -1!=index($_,$t),@_}
grep $h2{$_}==1, keys %h2;
};

print join "\t", foo(qw/Smith Smiths Mona Moynaham/);

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
Shiraz
Guest
Posts: n/a
 
      06-23-2005
Arne,
the input about how to analyze the problem, look for strings of various
lengths and see which one is unique to an entry

xhos
if in the list that i get contains, both smith and smiths, there is
nothing i can do about it, so to identify them separately, i need a
total match of smith, to identify 'smith' and i need 'hs' to identify
'Smiths';
you were correct about the hs;

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Searching regular expr: How to match the following pattern with quotes, brackets and semicolons? Peter Stacy Perl Misc 1 11-08-2009 09:27 AM
Google search result to be URL-limited when searching site, but notwhen searching Web stumblng.tumblr Javascript 1 02-04-2008 09:01 AM
searching for a pattern for multiple matches jhu Perl Misc 6 11-26-2007 10:49 AM
Searching a pattern gonas Java 2 11-21-2004 06:58 PM
searching for null in pattern Brian Andrus Perl Misc 6 11-08-2003 02:15 AM



Advertisments