Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Why this Regex not working?

Reply
Thread Tools

Why this Regex not working?

 
 
Looking
Guest
Posts: n/a
 
      09-16-2004
$s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
#$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
print "$s\n";

The scond regex works. I wonder why the first regex not working?
I am trying to get whatever is between the first pair of "" or '' after
content=. It is parsing the header file of HTML pages.

The first regex gave me this:
"this is what i want " asd " sdf " adfa

But I need this:
this is what i want


 
Reply With Quote
 
 
 
 
Mark Clements
Guest
Posts: n/a
 
      09-16-2004
Looking wrote:
> $s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
> $s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
> #$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
> print "$s\n";
>
> The scond regex works. I wonder why the first regex not working?
> I am trying to get whatever is between the first pair of "" or '' after
> content=. It is parsing the header file of HTML pages.
>
> The first regex gave me this:
> "this is what i want " asd " sdf " adfa
>
> But I need this:
> this is what i want

You may want to check out HTTP::Headers rather than doing this yourself.

With this regex

(this won't work for readers using proportional fonts)

$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
^

The problem is that in order to do a non-greedy match the question mark
should be immediately adjacent to the * ie you need to remove the
brackets or put the ? inside the brackets. Also, you don't need the |
(pipe symbol) inside [] character classes.

regards,

Mark



 
Reply With Quote
 
 
 
 
Anno Siegel
Guest
Posts: n/a
 
      09-16-2004
Looking <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> $s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
> $s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;

^ ^
Do you actually want to allow | besides " and ' for quotes? I think
you have conflated character class notation and alternation.

> #$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
> print "$s\n";
>
> The scond regex works. I wonder why the first regex not working?
> I am trying to get whatever is between the first pair of "" or '' after
> content=. It is parsing the header file of HTML pages.


Better use a real HTML parser.

> The first regex gave me this:
> "this is what i want " asd " sdf " adfa
>
> But I need this:
> this is what i want


Simple. /.*/ is greedy, it matches the longest string it can while
still having the rest of the pattern match. So it picks up everything
until the last " or ' in the line. The question mark in /(.*)?/
serves no purpose. You probably meant to put it inside the parentheses:
/(.*?)/. In that position the match will be non-greedy.

Anno
 
Reply With Quote
 
John W. Krahn
Guest
Posts: n/a
 
      09-16-2004
Looking wrote:
> $s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
> $s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
> #$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
> print "$s\n";
>
> The scond regex works. I wonder why the first regex not working?


That is because *, + and ? are greedy and will match as many characters as
possible so (.*) will match everything to the end until the last ", | or '
character. (Why are you trying to match the | character?) You probably want
something like:

$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;


John
--
use Perl;
program
fulfillment
 
Reply With Quote
 
Looking
Guest
Posts: n/a
 
      09-16-2004
> Looking wrote:
> > $s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
> > $s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
> > #$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
> > print "$s\n";
> >
> > The scond regex works. I wonder why the first regex not working?

>
> That is because *, + and ? are greedy and will match as many characters as
> possible so (.*) will match everything to the end until the last ", | or '
> character. (Why are you trying to match the | character?) You probably

want
> something like:
>
> $s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;


May I ask what \1 is? I am trying to do a search of \1 on google but this
string is too short.
I need to get whatever is between the first 2 pairs of "" or '' after
content=

>
>
> John
> --
> use Perl;
> program
> fulfillment




 
Reply With Quote
 
Looking
Guest
Posts: n/a
 
      09-16-2004
> Looking wrote:
> > $s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
> > $s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
> > #$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
> > print "$s\n";
> >
> > The scond regex works. I wonder why the first regex not working?
> > I am trying to get whatever is between the first pair of "" or '' after
> > content=. It is parsing the header file of HTML pages.
> >
> > The first regex gave me this:
> > "this is what i want " asd " sdf " adfa
> >
> > But I need this:
> > this is what i want

> You may want to check out HTTP::Headers rather than doing this yourself.
>


If you mean HTML::HeadParser
I tried it and it is not working!.

That is the sample it gave:
$h = HTTP::Headers->new;
$p = HTML::HeadParser->new($h);
$p->parse(<<EOT);
<title>Stupid example</title>
<base href="http://www.linpro.no/lwp/";>
Normal text starts here.
EOT
undef $p;
print $h->title; # should print "Stupid example"

I tried to use $h->description, it does not return anything. I am trying to
get keywords, description etc, but got nothing.
If you know where the bugs are, let me know.


 
Reply With Quote
 
Looking
Guest
Posts: n/a
 
      09-16-2004

> > $s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
> > $s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
> > #$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
> > print "$s\n";
> >
> > The scond regex works. I wonder why the first regex not working?

>
> That is because *, + and ? are greedy and will match as many characters as
> possible so (.*) will match everything to the end until the last ", | or '
> character. (Why are you trying to match the | character?) You probably

want
> something like:
>
> $s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;
>


By the way, I assume \1 is same as $1 but on the left side. Your code is not
working. It does not match anything. Although, I think your idea is right

$s=qq( "sadf content= "this is what i' want " asd " sdf " adfa " sdf' );
#$s =~ s/.*content=.*?["'](.*?)["'].*/$1/si;
$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;
print "$s\n";

I hope it can return
this is what i' want
but yours return
"sadf content= "this is what i' want " asd " sdf " adfa " sdf'
so, no match.


 
Reply With Quote
 
Mark Clements
Guest
Posts: n/a
 
      09-16-2004
Looking wrote:

>>
>>$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;

>
>
> May I ask what \1 is? I am trying to do a search of \1 on google but this
> string is too short.
> I need to get whatever is between the first 2 pairs of "" or '' after
> content=

you need to read up on regexps. check out

man perlre

For the record, \1 is a backreference ie it refers to a previously
matched and captured part of the regexp.

so

(["'])([^\1]*)[\1]

matches " or ', followed by any character other than these zero or more
times, followed by whichever of " and ' was matched the first time.

\1, \2 etc are typically used within the regexp itself, and $1, $2 etc
outside it (or in the second part of a s/// operation).

Mark
 
Reply With Quote
 
Mark Clements
Guest
Posts: n/a
 
      09-16-2004
Looking wrote:

> If you mean HTML::HeadParser
> I tried it and it is not working!.

Er - I misread your requirement as parsing HTTP headers rather than the
<HEAD> section of an HTML document. Sorry for leading you down the wrong
path. Try this


use strict;
use warnings;

use HTML::HeadParser;

my $p = HTML::HeadParser->new();
$p->parse(<<EOT);
<title>Stupid example</title>
<base href="http://www.linpro.no/lwp/";>
Normal text starts here.
EOT
print $p->header("title");

 
Reply With Quote
 
Jeff 'japhy' Pinyan
Guest
Posts: n/a
 
      09-16-2004
On Thu, 16 Sep 2004, Mark Clements wrote:

>For the record, \1 is a backreference ie it refers to a previously
>matched and captured part of the regexp.
>
>so
>
>(["'])([^\1]*)[\1]
>
>matches " or ', followed by any character other than these zero or more
>times, followed by whichever of " and ' was matched the first time.


No it doesn't. Character classes are created when the regex is compiled,
but \1 is not known until the regex is EXECUTED. Using \1 inside a
character class is that same as using \x01 or \001, it's the ASCII
character whose ordinal value is 1.

--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
Senior Dean, Fall 2004 % have long ago been overpaid?
RPI Corporation Secretary %
http://japhy.perlmonk.org/ % -- Meister Eckhart


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
Why :: ? Why not : ? Why not . ? <- less clutter ?!? Skybuck Flying C++ 16 08-25-2007 09:48 PM
why why why why why Mr. SweatyFinger ASP .Net 4 12-21-2006 01:15 PM
findcontrol("PlaceHolderPrice") why why why why why why why why why why why Mr. SweatyFinger ASP .Net 2 12-02-2006 03:46 PM
regex bug (comments within regex not as robust) kg.google@olympiakos.com Perl Misc 3 10-27-2005 07:21 PM



Advertisments