Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Converting a string to multiple search patterns

Reply
Thread Tools

Converting a string to multiple search patterns

 
 
Tore Aursand
Guest
Posts: n/a
 
      06-08-2004
Hi all!

I'm stumped on this one: I have an application where I need to refine the
search mechanism. The concept is quite simple: Get a string, convert it
to separate words, count (and "score") each word for each document, and
then display the result based on the score;

my $query = 'A B C D';
my @words = split( /\s+/, $query );
foreach ( @documents ) {
# ...
}

I need to refine it, as said. I want a higher score for word sequences,
and in a particular order. For the example above ('A B C D'), I want to
match in this order:

1. A B C D
2. A B C
3. B C D
4. A B
5. C D
6. A C
7. B D
9. A D
9. A
10. B
11. C
12. D

Anyone know of a module which can accomplis this? I really haven't tried
with anything yet, 'cause I have no clue on how to do it. The closest
thing I've been, has been with the Algorithm:ermute module. It doesn't
give me what I want "out of the box", though...

Any ideas?


--
Tore Aursand <(E-Mail Removed)>
"Progress is made by lazy men looking for easier ways to do things."
(Robert Heinlein)
 
Reply With Quote
 
 
 
 
Anno Siegel
Guest
Posts: n/a
 
      06-08-2004
Tore Aursand <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> Hi all!
>
> I'm stumped on this one: I have an application where I need to refine the
> search mechanism. The concept is quite simple: Get a string, convert it
> to separate words, count (and "score") each word for each document, and
> then display the result based on the score;
>
> my $query = 'A B C D';
> my @words = split( /\s+/, $query );
> foreach ( @documents ) {
> # ...
> }
>
> I need to refine it, as said. I want a higher score for word sequences,
> and in a particular order. For the example above ('A B C D'), I want to
> match in this order:
>
> 1. A B C D
> 2. A B C
> 3. B C D
> 4. A B
> 5. C D
> 6. A C
> 7. B D
> 9. A D
> 9. A
> 10. B
> 11. C
> 12. D


I'm missing "A B D", "A C D", and " B C " from the collection.
Are these entirely arbitrary?

> Anyone know of a module which can accomplis this? I really haven't tried
> with anything yet, 'cause I have no clue on how to do it. The closest
> thing I've been, has been with the Algorithm:ermute module. It doesn't
> give me what I want "out of the box", though...


I'm not sure what you are asking. Is it the generation of all selections
of 1 .. 4 objects from a set of 4? These don't correspond to permutations,
but to four-digit binary numbers (so there are 2**4 - 1 = 15 of them,
not counting the empty selection). I'm sure there is a module on CPAN
to generate them, but ad-hoc solutions aren't too hard either.

Or is the issue how to assign a score to each of a collection of
regexes and retrieve the score after each match? This can be done
using the (?{}) construct to execute code at match time.

Starting from your list (@lines, say) above, I'd generate a list @score
of pairs where each pair holds a score and a string to match:

my @score = map [ split /\./], @lines;
$_->[ 1] =~ tr/ //d for @score;

The second line simplifies things by deleting all blanks from the strings
to match. Your practical regexes may look different.

Build an alternation of patterns where each pattern includes code
to set a variable ($scored) to the corresponding score:

my $rex = join '|', map "$_->[ 1](?\{ \$scored = $_->[ 0] \})", @score;

Generate a test string and check it.

my $text = join '', map qw( A B C D E)[ rand 5], 1 .. 100;

my $scored;
use re 'eval';
while ( $text =~ /($rex)/g ) {
print "score $scored: $1\n";
}

Anno
 
Reply With Quote
 
 
 
 
Jeff 'japhy' Pinyan
Guest
Posts: n/a
 
      06-08-2004
[posted & mailed]

On Tue, 8 Jun 2004, Tore Aursand wrote:

>I'm stumped on this one: I have an application where I need to refine the
>search mechanism. The concept is quite simple: Get a string, convert it
>to separate words, count (and "score") each word for each document, and
>then display the result based on the score;
>
> my $query = 'A B C D';
> my @words = split( /\s+/, $query );
> foreach ( @documents ) {
> # ...
> }
>
>I need to refine it, as said. I want a higher score for word sequences,
>and in a particular order. For the example above ('A B C D'), I want to
>match in this order:
>
> 1. A B C D
> 2. A B C
> 3. B C D
> 4. A B
> 5. C D
> 6. A C
> 7. B D
> 9. A D
> 9. A
> 10. B
> 11. C
> 12. D


As Anno said, you're missing three (non-empty) strings:

A B D
A C D
B C

Now, does this mean you want to attempt matching them in that order?
Like, you want your regex to do:

/ABCD|ABC|BCD|ABD|ACD|AB|BC|CD|AC|AD|BD|A|B|C|D/

Is that what you want your regex to do?

>Anyone know of a module which can accomplis this? I really haven't tried
>with anything yet, 'cause I have no clue on how to do it. The closest
>thing I've been, has been with the Algorithm:ermute module. It doesn't
>give me what I want "out of the box", though...


I think you can accomplish it by way of embedded code in your regex, but
I'd need a little more information about the aim of the regex before I
could write one.

--
Jeff Pinyan RPI Acacia Brother #734 RPI Acacia Corp Secretary
"And I vos head of Gestapo for ten | Michael Palin (as Heinrich Bimmler)
years. Ah! Five years! Nein! No! | in: The North Minehead Bye-Election
Oh. Was NOT head of Gestapo AT ALL!" | (Monty Python's Flying Circus)

 
Reply With Quote
 
Tore Aursand
Guest
Posts: n/a
 
      06-08-2004
On Tue, 08 Jun 2004 11:53:53 +0000, Anno Siegel wrote:
>> 1. A B C D
>> 2. A B C
>> 3. B C D
>> 4. A B
>> 5. C D
>> 6. A C
>> 7. B D
>> 9. A D
>> 9. A
>> 10. B
>> 11. C
>> 12. D


> I'm missing "A B D", "A C D", and " B C " from the collection.


Doh! You're right;

1. A B C D
2. A B C
3. B C D
4. A B D <--
5. A C D <--
6. A B
7. B C <--
8. C D
9. A C
10. B D
11. A D
12. A
13. B
14. C
15. D

> I'm not sure what you are asking. Is it the generation of all selections
> of 1 .. 4 objects from a set of 4?


Yes. I have a string containing 'A B C D'. I split those to an array,
and I want to generate a new list (as above), ie. and AoA;

$VAR1 = [
[ A, B, C, D ],
[ A, B, C ],
[ B, C, D ],
 
Reply With Quote
 
Tore Aursand
Guest
Posts: n/a
 
      06-08-2004
On Tue, 08 Jun 2004 09:31:39 -0400, Jeff 'japhy' Pinyan wrote:
> [...]
> Now, does this mean you want to attempt matching them in that order?


Yes, As mentioned, the matching isn't the problem. It's generating the
structure/pattern which is the problem. See my reply to Anno for more
information.


--
Tore Aursand <(E-Mail Removed)>
"First get your facts; then you can distort them at your leisure."
(Mark Twain)
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      06-08-2004
Tore Aursand <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> On Tue, 08 Jun 2004 11:53:53 +0000, Anno Siegel wrote:
> >> 1. A B C D
> >> 2. A B C
> >> 3. B C D
> >> 4. A B
> >> 5. C D
> >> 6. A C
> >> 7. B D
> >> 9. A D
> >> 9. A
> >> 10. B
> >> 11. C
> >> 12. D

>
> > I'm missing "A B D", "A C D", and " B C " from the collection.

>
> Doh! You're right;
>
> 1. A B C D
> 2. A B C
> 3. B C D
> 4. A B D <--
> 5. A C D <--
> 6. A B
> 7. B C <--
> 8. C D
> 9. A C
> 10. B D
> 11. A D
> 12. A
> 13. B
> 14. C
> 15. D
>
> > I'm not sure what you are asking. Is it the generation of all selections
> > of 1 .. 4 objects from a set of 4?

>
> Yes. I have a string containing 'A B C D'. I split those to an array,
> and I want to generate a new list (as above), ie. and AoA;
>
> $VAR1 = [
> [ A, B, C, D ],
> [ A, B, C ],
> [ B, C, D ],
> .
> .
> .
> ]
>
> Once I have this structure, it's quite easy to do the scoring. But I'm
> really stuck on how to generate this structure...


I found the other part harder.

sub selections {
my @sel = [];
for my $elem ( @_ ) {
unshift @sel, map [ $elem, @$_], @sel;
}
@sel;
}

Anno
 
Reply With Quote
 
Brad Baxter
Guest
Posts: n/a
 
      06-08-2004
On Tue, 8 Jun 2004, Tore Aursand wrote:
> Doh! You're right;
>
> 1. A B C D
> 2. A B C
> 3. B C D
> 4. A B D <--
> 5. A C D <--
> 6. A B
> 7. B C <--
> 8. C D
> 9. A C
> 10. B D
> 11. A D
> 12. A
> 13. B
> 14. C
> 15. D
>
> > I'm not sure what you are asking. Is it the generation of all selections
> > of 1 .. 4 objects from a set of 4?

>
> Yes. I have a string containing 'A B C D'. I split those to an array,
> and I want to generate a new list (as above), ie. and AoA;


The order you list isn't the order I'd expect from a simple generation of
all selections, so tell me if my interpretation is correct:

o More terms = higher score
o More adjacent terms = higher score
o Leftward terms = higher score than those to the right

Right?

Regards,

Brad
 
Reply With Quote
 
Jeff 'japhy' Pinyan
Guest
Posts: n/a
 
      06-08-2004
[posted & mailed]

On Tue, 8 Jun 2004, Jeff 'japhy' Pinyan wrote:

>[posted & mailed]
>
>Now, does this mean you want to attempt matching them in that order?
>Like, you want your regex to do:
>
> /ABCD|ABC|BCD|ABD|ACD|AB|BC|CD|AC|AD|BD|A|B|C|D/
>
>Is that what you want your regex to do?


I *think* this is what you want your regex to do:

my $rx = qr{
(?{ local ($s, $f) = (0, 1) })
^ \s*
(?: A (?{ $s += $f <<= 1 }) | (?{ $f = 1 }) ) \s*
(?: B (?{ $s += $f <<= 1 }) | (?{ $f = 1 }) ) \s*
(?: C (?{ $s += $f <<= 1 }) | (?{ $f = 1 }) ) \s*
(?: D (?{ $s += $f <<= 1 }) | (?{ $f = 1 }) ) \s*
$
(?{ $s })
}x;

You don't need to worry about the set logic now -- the regex engine will
take care of that.

That regex will match "ABCD", "ABC", "BCD" (and " BCD"), etc.
Specifically, it will match the 15 non-empty strings listed, with optional
whitespace throughout. Modification of HOW it matches is simple. What's
important to see is the scoring algorithm.

Score ($s) starts out at 0, and the factor ($f) starts out at 1. Every
time we hit a consecutive keyword, we left-shift the factor by 1 (that is,
we multiply it by 2), and add it to the score. Every time the next
keyword is missing, we reset the factor to 1. This means that the string
"ABCD" has the maximum score of 30 (2 + 4 + 8 + 16), and single-character
strings like "B" have the minimum non-zero score of 2.

To *use* this, we would do something like:

# this could be made more efficient via a Guttman-Rosler
# Transform, but that's not the point here

@sorted_by_score =
map { $_->[1] }
sort { $b->[0] <=> $a->[0] }
map { /$rx/ ? [ $^R, $_ ] : () }
@data;

The $^R variable contains the return value of the most recent (?{ ... })
in a regex -- for us, this is the one that merely holds the value of $s.
We then sort the data by its score, and extract the data.

If this is totally off-base, I apologize. On the other hand, if this is
exactly what you're looking for, then the process of constructing the
regex is simple:

use re 'eval';
my @kw = qw( A B C D );
my $rx = qr{
(?{ local ($s, $f) = (0, 1) })
^ \s*
@{[ map "(?:\Q$_\E(?{ \$s += \$f <<= 1 })|(?{ \$f = 1 }) )\\s*", @kw ]}
$
(?{ $s })
}x;

That does it.

--
Jeff Pinyan RPI Acacia Brother #734 RPI Acacia Corp Secretary
"And I vos head of Gestapo for ten | Michael Palin (as Heinrich Bimmler)
years. Ah! Five years! Nein! No! | in: The North Minehead Bye-Election
Oh. Was NOT head of Gestapo AT ALL!" | (Monty Python's Flying Circus)



 
Reply With Quote
 
Tore Aursand
Guest
Posts: n/a
 
      06-08-2004
On Tue, 08 Jun 2004 11:02:29 -0400, Brad Baxter wrote:
> [...]
> o More terms = higher score
> o More adjacent terms = higher score
> o Leftward terms = higher score than those to the right


That's correct. Am I missing something? Is my list wrong? Please
correct me or make any suggestions.


--
Tore Aursand <(E-Mail Removed)>
"What we anticipate seldom occurs. What we least expected generally
happens." (Benjamin Disraeli)
 
Reply With Quote
 
Tore Aursand
Guest
Posts: n/a
 
      06-08-2004
On Tue, 08 Jun 2004 14:52:41 +0000, Anno Siegel wrote:
> sub selections {
> my @sel = [];
> for my $elem ( @_ ) {
> unshift @sel, map [ $elem, @$_], @sel;
> }
> @sel;
> }


This one almost does the job, but it doesn't output what I want; The
elements are mostly reversed, so reverse()'ing @_ makes it a bit better.

Still, 'A B D' comes before 'B C D'. Take a look at my previous reply to
you:

1. A B C D
2. A B C
3. B C D
4. A B D
5. A C D
6. A B
7. B C
8. C D
9. A C
10. B D
11. A D
12. A
13. B
14. C
15. D

Try this simple script using your subroutine (rewritten, 'cause
"sometimes" Pan won't let me paste thing I mark in xterm):

my @array = selections( qw(A B C D) );
for ( 0..$#array ) {
print "$_. " . join(' ', @{$array[$_]}) . "\n";
}

But you're close, Anno!


--
Tore Aursand <(E-Mail Removed)>
"What we do is never understood, but only praised and blamed."
(Friedrich Nietzsche)
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
checking a string against multiple patterns tomasz Python 7 12-18-2007 07:27 PM
replace multiple patterns of a string in single pass Vellingiri Arul Ruby 1 09-19-2007 01:48 PM
Interactive Find and Replace String Patterns on Multiple Files Xah Lee Java 0 06-14-2006 08:54 PM
Interactive Find and Replace String Patterns on Multiple Files Xah Lee Python 0 06-14-2006 08:54 PM
where to find good patterns and sources of patterns (was Re: singletons) crichmon C++ 4 07-07-2004 10:02 PM



Advertisments