Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Regexp to match an URL in an HTML <a href=""></a> tag

Reply
Thread Tools

Regexp to match an URL in an HTML <a href=""></a> tag

 
 
Charles Nadeau
Guest
Posts: n/a
 
      11-15-2003
Hello,

I am trying to craft a regular expression to filter an URL from a <a
href=""></a> tag and the one I have doesn't seen right.
I use the regular expression from this snippet of code:

foreach my $message (@messages)
{
my @match=($message->decoded=~/\bhref="(http.*)">.*/gi);

foreach my $match(@match)
{
print $match,"\n";
}

}

but it doesn't lead to results that are exactly what I need. An excerpt of
what I get as an output looks like:

http://2%30%33.197.%3204.1%355/mout/
http://www.superrxsalesman.info/aff1/?mulish
http://www.superrxsalesman.info/aff1/?acme
http://www.superrxsalesman.info/aff1/?blister
http://www.superrxsalesman.info/aff1/?samba
http://www.superrxsalesman.info/aff1/?depot"><font color="#0033CC
http://www.superrxsalesman.info/aff1/?procter"><font color="#0033CC
http://www.superrxsalesman.info/aff1/?use"><font color="#0033CC
http://www.superrxsalesman.info/aff1/?butane"><font color="#0033CC
http://www.superrxsalesman.info/aff1/?fiche"><font color="#0033CC

The first 5 lines are exactly what I want but I don't understand why in the
following lines I get characters after and including ". I want basically to
keep what is in between the "" of the <href=""> tag.
Could anybody tell me what is wrong with my regular expression?
Thanks!

Charles

--
Charles-E. Nadeau Ph.D
http://radio.weblogs.com/0111823/
 
Reply With Quote
 
 
 
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      11-15-2003
Charles Nadeau wrote:
> I am trying to craft a regular expression to filter an URL from a
> <a href=""></a> tag and the one I have doesn't seen right. I use
> the regular expression from this snippet of code:
>
> foreach my $message (@messages)
> {
> my @match=($message->decoded=~/\bhref="(http.*)">.*/gi);
>
> foreach my $match(@match)
> {
> print $match,"\n";
> }
>
> }
>
> but it doesn't lead to results that are exactly what I need.


http://theoryx5.uwinnipeg.ca/CPAN/pe...ract_URLs.html

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

 
Reply With Quote
 
 
 
 
Andy R
Guest
Posts: n/a
 
      11-15-2003

"Charles Nadeau" <(E-Mail Removed)> wrote in message
news:bp483h$1gv0$(E-Mail Removed)...
> Hello,
>
> I am trying to craft a regular expression to filter an URL from a <a
> href=""></a> tag and the one I have doesn't seen right.
> I use the regular expression from this snippet of code:
>
> foreach my $message (@messages)
> {
> my @match=($message->decoded=~/\bhref="(http.*)">.*/gi);
>
> foreach my $match(@match)
> {
> print $match,"\n";
> }
>
> }
>
> but it doesn't lead to results that are exactly what I need. An excerpt of
> what I get as an output looks like:
>
> http://2%30%33.197.%3204.1%355/mout/
> http://www.superrxsalesman.info/aff1/?mulish
> http://www.superrxsalesman.info/aff1/?acme
> http://www.superrxsalesman.info/aff1/?blister
> http://www.superrxsalesman.info/aff1/?samba
> http://www.superrxsalesman.info/aff1/?depot"><font color="#0033CC
> http://www.superrxsalesman.info/aff1/?procter"><font color="#0033CC
> http://www.superrxsalesman.info/aff1/?use"><font color="#0033CC
> http://www.superrxsalesman.info/aff1/?butane"><font color="#0033CC
> http://www.superrxsalesman.info/aff1/?fiche"><font color="#0033CC
>
> The first 5 lines are exactly what I want but I don't understand why in

the
> following lines I get characters after and including ". I want basically

to
> keep what is in between the "" of the <href=""> tag.
> Could anybody tell me what is wrong with my regular expression?
> Thanks!
>
> Charles
>
> --
> Charles-E. Nadeau Ph.D
> http://radio.weblogs.com/0111823/


Use a ? to perform a non-greedy match ie:

my @match=($message->decoded=~/\bhref="(http.*?)">.*/gi);

Should work, though I've not tested it.

Andy R


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[regexp] How to convert string "/regexp/i" to /regexp/i - ? Joao Silva Ruby 16 08-21-2009 05:52 PM
String#match vs. Regexp#match - confused Old Echo Ruby 1 09-04-2008 06:11 PM
Ruby 1.9 - ArgumentError: incompatible encoding regexp match(US-ASCII regexp with ISO-2022-JP string) Mikel Lindsaar Ruby 0 03-31-2008 10:27 AM
RegExp.exec() returns null when there is a match - a JavaScript RegExp bug? Uldis Bojars Javascript 2 12-17-2006 09:59 PM
how do u invoke Tag b's Tag Handler from within Tag a's tag Handler? shruds Java 1 01-27-2006 03:00 AM



Advertisments