Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > running a sub inside regex

Reply
Thread Tools

running a sub inside regex

 
 
Thomas Isenbarger
Guest
Posts: n/a
 
      11-17-2003
I have never posted to this group before, so please forgive me if I am
posting in the wrong place

in the perl script below, I am trying to construct a string, within a
regex, based on an earlier captured match.

i think something about the last elsif block doesn't work. the regex
within the while loop near the bottom (before the subroutine) returns no
matches to sequences that should work.

for molecular biologists out there, I am trying to find the reverse
complement on the fly inside the regex. the @pairing array is something
like @pairing = (AU, CG, GC, UA) and defines the substitutions to be made
in order to build a reverse complement.

for non molecular biologists out there, a sequence that should match is
'acgu' if you need a test case.

input for the program:

use "acgu" for the target sequence.

then, for the sequence elements enter "1s=ac", RETURN, "1r", RETURN, RETURN.

please contact me at isen (AT) mgh (DOT) molbio (DOT) harvard (DOT) edu

if you are willing to help and/or need more information.

thank-you,
Tom Isenbarger


#!/usr/bin/perl -w
use re 'eval'; #allows
execution of code within regex expressions (??{})

@pairing = (AU, GC, CG, UA, NN);

print "enter target sequence ";
$target = uc<STDIN>;
chomp($target);

if ($target =~ /[^ACGTU]/) {
die "invalid characters found in target sequence. exiting isenfind\n\n";
}

#ask to keep IUB codes or convert them?
#ask to leave N or not?

print "enter sequence elements, one line at a time (return to stop)\n";

$input = "";
$i = 0;

do {
$input = uc<STDIN>;
chomp ($input);
$input =~ s/\s+//g;
$element[$i] = $input;
$i++;
} until ($input eq "");

for ($i = 0; $i < @element; $i++) {
$item = $element[$i];
print "$item\n";
if ($item =~ /^{/) { #a base
pairing rule
$pairingstring = $item;
$pairingstring =~ s/[{}]//g;
@pairing = split /,/, $pairingstring;
#check for valid pair rule syntax
print "pairing now @pairing\n";
}

elsif ($item =~ /^(\d+)S=([ACGUBDHKMNRSVWY]+)/) { #a specific
sequence element to remember
$pattern = "(".$2.")";
$regex .= $pattern;
$regexpos[$1] = ($i+1); #record the
position of this sequence in the element list
}

elsif ($item =~ /^(\d+)S=(\d+)-(\d+)/) { #a
non-specific sequence element to remember
$pattern = "([ACGU]{$2,$3})";
$regex .= $pattern;
$regexpos[$1] = ($i+1);
}

elsif ($item =~/^(\d+)S/) {
$lookwhere = $regexpos[$1];
$pattern = "(\\".$lookwhere.")";
$regex .= $pattern;
}

elsif ($item =~ /^[ACGUBDHKMNRSVWY]+/) { #a specific
sequence element
$pattern = "(".$item.")";
$regex .= $pattern;
}

elsif ($item =~ /^(\d+)-(\d+)/) { #a
non-specific sequence element
$pattern = "([ACGU]{$1,$2})";
$regex .= $pattern;
}

elsif ($item =~ /^(\d+)P/) { #a palindrome
of an earlier saved element
$lookwhere = $regexpos[$1];
$pattern = "(??{reverse \$".$lookwhere."})";
$regex .= $pattern;
}

elsif ($item =~ /^(\d+)R/) { #a reverse
complement of an earlier saved element
$lookwhere = $regexpos[$1];
$pattern = "(??{revcomp (\$".$lookwhere.', @pairing)})'; #use '
quotes so that @pairing is not interpolated
$regex .= $pattern;
}

}

use re 'debug';

print "regex $regex\n";
print "target $target\n";

while ($target =~ /$regex/g) {
$position = pos $target;
print "$& $position\n";
}

### subroutines

sub revcomp {
my $sequence = shift (@_);
my @pairing = @_;
my $rc = undef;

foreach $pair (@pairing) {
($first, $second) = split //, $pair;
$match{$first} .= $second;
}
foreach $key (keys(%match)) {
if (length($match{$key}) > 1) {
$match{$key} = "[".$match{$key}."]";
}
}

@string = split (//, reverse ($sequence)); #process string
one char at a time using substitutions in %match

foreach $base (@string) {
$rc .= $match{$base};
}

print "in sub rc = $rc\n\n";

return $rc; #return reverse
complement of sequence
}
 
Reply With Quote
 
 
 
 
Austin P. So (Hae Jin)
Guest
Posts: n/a
 
      11-17-2003


Thomas Isenbarger wrote:

> in the perl script below, I am trying to construct a string, within a
> regex, based on an earlier captured match.


> i think something about the last elsif block doesn't work. the regex
> within the while loop near the bottom (before the subroutine) returns no
> matches to sequences that should work.
>
> for molecular biologists out there, I am trying to find the reverse
> complement on the fly inside the regex. the @pairing array is something
> like @pairing = (AU, CG, GC, UA) and defines the substitutions to be made
> in order to build a reverse complement.


It is quite possible to do it on the fly...something like

if ($string =~ m/($substring|revcomp($substring))/ig) {
$match_position = pos + 1;
$match = $1;
...
}

sub revcomp {
#make reverse complement
return
}

Everyone has their own style of perl, but I'm having a hard time
deciphering what you are hoping to do...so as I'm not all that sure what
you wish to do, this particular solution may not be what you want...

Austin

 
Reply With Quote
 
 
 
 
Jay Tilton
Guest
Posts: n/a
 
      11-17-2003
"Austin P. So (Hae Jin)" <> wrote:

: if ($string =~ m/($substring|revcomp($substring))/ig) {
: $match_position = pos + 1;
: $match = $1;
: ...
: }
:
: sub revcomp {
: #make reverse complement
: return
: }

Did you test that?

The m// operator is like a double-quotish string. Subroutine calls are
not interpolated without some extra work.

if ($string =~ m/($substring|@{[revcomp($substring)]})/ig) { ... }
^^^ ^^
 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      11-18-2003
Jay Tilton <> wrote:
> "Austin P. So (Hae Jin)" <> wrote:
>
>: if ($string =~ m/($substring|revcomp($substring))/ig) {
>: $match_position = pos + 1;
>: $match = $1;
>: ...
>: }
>:
>: sub revcomp {
>: #make reverse complement
>: return
>: }
>
> Did you test that?
>
> The m// operator is like a double-quotish string. Subroutine calls are
> not interpolated without some extra work.
>
> if ($string =~ m/($substring|@{[revcomp($substring)]})/ig) { ... }



And shouldn't it be either:

if ( m// )

or

while ( m//g )

??


--
Tad McClellan SGML consulting
Perl programming
Fort Worth, Texas
 
Reply With Quote
 
Jeff 'japhy' Pinyan
Guest
Posts: n/a
 
      11-18-2003
[posted & mailed]

On Mon, 17 Nov 2003, Thomas Isenbarger wrote:

>for molecular biologists out there, I am trying to find the reverse
>complement on the fly inside the regex. the @pairing array is something
>like @pairing = (AU, CG, GC, UA) and defines the substitutions to be made
>in order to build a reverse complement.
>
>for non molecular biologists out there, a sequence that should match is
>'acgu' if you need a test case.


A context-free grammar for this would be:

S -> a S u | u S a | c S g | g S c | [nothing]

You *could* do this with Perl regexes, but it's unweildy (and inefficient,
I can assure you):

my $pair;
$pair = qr{
a (??{ $pair }) u |
u (??{ $pair }) a |
c (??{ $pair }) g |
g (??{ $pair }) c |
(?# nothing )
}x;
if ("acgu" =~ /^($pair)$/) {
print "matched '$1'\n";
}

However, it's probably easier just to match a sequence, and then try to
match its reverse:

if ("acgu" =~ /^(([acgu]+)(??{ complement($2) }))$/) {
print "matched '$1' ('$2')\n";
}

sub complement {
my $str = reverse shift;
$str =~ tr/aucg/uagc/;
return $str;
}

I didn't need to use "use re 'eval'" for either of these, by the way,
because the variables in the regexes are qr// objects.

--
Jeff Pinyan RPI Acacia Brother #734 2003 Rush Chairman
"And I vos head of Gestapo for ten | Michael Palin (as Heinrich Bimmler)
years. Ah! Five years! Nein! No! | in: The North Minehead Bye-Election
Oh. Was NOT head of Gestapo AT ALL!" | (Monty Python's Flying Circus)


 
Reply With Quote
 
Austin P. So (Hae Jin)
Guest
Posts: n/a
 
      11-18-2003


Jay Tilton wrote:
> "Austin P. So (Hae Jin)" <> wrote:
>
> : if ($string =~ m/($substring|revcomp($substring))/ig) {
> : $match_position = pos + 1;
> : $match = $1;
> : ...
> : }


Oops...it should be:
$match_position = (pos $string) + 1;

> : sub revcomp {
> : #make reverse complement
> : return
> : }
>
> Did you test that?


Actually I did after you posted...just to be sure since it had been a
while since I did this...

And of course it didn't work...I guess my fallback is that I'm a crappy
perl programmer......I think I actually rewrote it too way back when...

> The m// operator is like a double-quotish string. Subroutine calls are
> not interpolated without some extra work.
>
> if ($string =~ m/($substring|@{[revcomp($substring)]})/ig) { ... }


Yep. That works brilliantly. Thanks.

And just to polish it further, to get the substring start site, it
should be:

if ($string =~ m/(?=($substring|@{[revComp($substring)]}))/ig) {...}

Good thing I started lurking this newsgroup again...

BTW...where is a reference for subroutine calls within a regex?


Austin

 
Reply With Quote
 
Austin P. So (Hae Jin)
Guest
Posts: n/a
 
      11-18-2003


Tad McClellan wrote:

> And shouldn't it be either:
>
> if ( m// )
>
> or
>
> while ( m//g )
>
> ??


Right. My bad...

the latter to get all the substring instances...

Austin

 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      11-18-2003
Austin P. So (Hae Jin) <> wrote:

> BTW...where is a reference for subroutine calls within a regex?



Step 1 is to recognize that it isn't the regexness that matters,
it is the double-quotishness that matters.

Step 2 is to lookup subroutine calls within a double-quotish string.


perldoc -q expand

How do I expand function calls in a string?


--
Tad McClellan SGML consulting
Perl programming
Fort Worth, Texas
 
Reply With Quote
 
Austin P. So (Hae Jin)
Guest
Posts: n/a
 
      11-18-2003
Tad McClellan wrote:
> Austin P. So (Hae Jin) <> wrote:
>>BTW...where is a reference for subroutine calls within a regex?


> Step 1 is to recognize that it isn't the regexness that matters,
> it is the double-quotishness that matters.


Uh...okay...obviously I haven't gotten to a point where I've had to
understand these kinds of subtleties in perl...

> Step 2 is to lookup subroutine calls within a double-quotish string.
> perldoc -q expand
> How do I expand function calls in a string?


Hmmm...I guess I never really thought about the search pattern within a
regex to be considered a "double-quotish string"...


Austin



 
Reply With Quote
 
Thomas Isenbarger
Guest
Posts: n/a
 
      11-18-2003
thanks for all your help people.

here is a (perhaps) better description of what i want to do:

first, I want to allow the user to input a literal sequence to match such
as 'ACCCUCUAUUCUC', and also allow the user to match arbitrary seqeunce
elements of any length and then be able to match that pattern again, the
reverse, or the reverse complement of that sequence. I also want to allow
the user to input his own pairing rules for revcomp matches. For example,
if you would want to match an RNA hairpin structure of 10-20 bases with a
4 base loop, allowing for non watson and crick pairing, I would think of
it this way:

match any sequence that is a minimum length of 10 to a maximum length of
20, followed by any 4 bases, followed by the reverse complement of what
was matched in the first part as defined by the pairing rules A-U, C-G,
G-C, U-A, U-G, G-U.

in my program this would be:

(AU, CG, GC, UA, UG, GU)
1s=10-20
4-4
1r

of course, i don't want any of these particular values hard wired, I want
them all to be input by the user. the elements I want to include (so far)
are these:

-arbitrary sequence of length N to M for matching later: 1s=N-M
-literal: some string of ACGU such as: ACCCUAUA
-literal to match and remember for later: 1s=ACCCUA (for example)
-match a remembered sequence again: 1s (for example to find tandem
repeats 1 to 10 bases long I would use:

1s=1-10
1s

-match the reverse of a sequence already matched: 1p (for example to
find inverted repeats 1 to 10 bases long I would use:

1s=1-10
1p

-match the revcomp of a sequence already matched: 1r (for example to
find hairpins:

1s=1-10
1r

-pairing rules: (AU, CG, UA, GC, AG, GA, UG, GU)

I also want to eventually do this with approximate matching to allow
mismatches, insertions, and deletions.

I am essentially trying to duplicate the PatSearch program (Nucleic Acids
research 2003, 31(13):360 that is available on the web, but not as an
executable I can install locally.

Thanks for the offer to help and let me know what more information you
need from me.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Death To Sub-Sub-Sub-Directories! Lawrence D'Oliveiro Java 92 05-20-2011 06:50 AM
Outer scope of a sub inside a sub Koszalek Opalek Perl Misc 10 10-29-2010 10:51 PM
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
Recognising Sub-Items and sub-sub items using xslt Ben XML 2 09-19-2007 09:35 AM
how do make a pop-up in sub ASP.net sub ? THY ASP .Net 1 08-18-2003 11:30 PM



Advertisments