Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Regexp kicking my ass

Reply
Thread Tools

Regexp kicking my ass

 
 
Tuc
Guest
Posts: n/a
 
      01-27-2005
Hi,

I'm trying to get a regexp to make a match, and its not working,
and its kicking my ass. The text I'm going against is :

$text='<div id="sr_SearchResultsPageNavTop"> <div
id="sr_SaveSearchImage"><img
src="http://images.match.com/match//search/sr_NavIconPlaceHolder.gif"
width="15
" height="12" alt="" border="0"></div> <div
id="sr_ViewPhotoGalleryText"><a
href="come.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0
&RN=2102522&lid=7&PN=1&DO=2" class="cssGlobalLinks_PageNav"
id="lnkSaveThisSearch">viewas photo gallery</a></div> <div
id="sr_Pagination"><span
class="cssGlobalSysText_LightGray">page&nbsp;</span><a
href="some.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid =8&PN=1&DO=0"
class="cssSr_PaginationCurrentPage" id="lnkPage">1</a><a
href="come.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid =8&PN=2&DO=0"class="cssSr_PageNav"
id="lnkPage">2</a><a
href="come.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid =8&PN=3&DO=0"
class="cssSr_PageNav" id="lnkPage">';

What I'm looking for is the url between the href and
cssSr_PaginationCurrentPage . When I do it, it ends ip starting at
the first href and going all the way to the
cssSr_PaginationCurrentPage. I've tried \b, I've tried {}, I tried
()'s.... And I just can't get it to get the one url of
some.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid =8&PN=1&DO=0

How am I to tell it to start at the cssSr_PaginationCurrentPage
and work backwards to the first instance of href="


Thanks, Tuc

 
Reply With Quote
 
 
 
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      01-27-2005
Tuc wrote:
> I'm trying to get a regexp to make a match, and its not working,
> and its kicking my ass. The text I'm going against is :
>
> $text='<div id="sr_SearchResultsPageNavTop"> <div
> id="sr_SaveSearchImage"><img
> src="http://images.match.com/match//search/sr_NavIconPlaceHolder.gif"
> width="15
> " height="12" alt="" border="0"></div> <div
> id="sr_ViewPhotoGalleryText"><a
> href="come.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0
> &RN=2102522&lid=7&PN=1&DO=2" class="cssGlobalLinks_PageNav"
> id="lnkSaveThisSearch">viewas photo gallery</a></div> <div
> id="sr_Pagination"><span
> class="cssGlobalSysText_LightGray">page&nbsp;</span><a
> href="some.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid =8&PN=1&DO=0"
> class="cssSr_PaginationCurrentPage" id="lnkPage">1</a><a
> href="come.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid =8&PN=2&DO=0"class="cssSr_PageNav"
> id="lnkPage">2</a><a
> href="come.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid =8&PN=3&DO=0"
> class="cssSr_PageNav" id="lnkPage">';
>
> What I'm looking for is the url between the href and
> cssSr_PaginationCurrentPage .


This may or may not work:

if ( $text =~ /<a\s+href\s*=\s*
(??["'])(\S+)\1)|(\S+))
[^>]*class\s*=\s*(?:["'])?cssSr_PaginationCurrentPage/x ) {
print $+;
}

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
 
 
 
terry l. ridder
Guest
Posts: n/a
 
      01-27-2005
On Wed, 26 Jan 2005, Tuc wrote:

> Hi,
>
> I'm trying to get a regexp to make a match, and its not working,
> and its kicking my ass. The text I'm going against is :
>

<snip>
>
> What I'm looking for is the url between the href and
> cssSr_PaginationCurrentPage . When I do it, it ends ip starting at
> the first href and going all the way to the
> cssSr_PaginationCurrentPage. I've tried \b, I've tried {}, I tried
> ()'s.... And I just can't get it to get the one url of
> some.aspx?sid=A1065D66-8275-47BE-85F2-AC161E2D6D26&theme=214&trackingid=0&RN=2102522&lid =8&PN=1&DO=0
>
> How am I to tell it to start at the cssSr_PaginationCurrentPage
> and work backwards to the first instance of href="
>


perhaps you need to 'divide and conquer'.

this works for me.

use strict;
use warnings;

if ( $text =~ /href="(.*?)class="cssSr_PaginationCurrentPage/s )
{
my $url = $1;
chomp($url);
$url =~ s/^.*?href="//s;
$url =~ s/"$//s;
print STDOUT "url == ``". $url . "''\n";
}

>
>
> Thanks, Tuc
>
>


--
terry l. ridder ><>
 
Reply With Quote
 
terry l. ridder
Guest
Posts: n/a
 
      01-27-2005
On Thu, 27 Jan 2005, Gunnar Hjalmarsson wrote:

>
> This may or may not work:
>
> if ( $text =~ /<a\s+href\s*=\s*
> (??["'])(\S+)\1)|(\S+))
> [^>]*class\s*=\s*(?:["'])?cssSr_PaginationCurrentPage/x ) {
> print $+;
> }
>
>


that works rather well.
beats my 'divide and conquer' approach.

--
terry l. ridder ><>
 
Reply With Quote
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      01-27-2005
terry l. ridder wrote:
> On Thu, 27 Jan 2005, Gunnar Hjalmarsson wrote:
>> This may or may not work:
>>
>> if ( $text =~ /<a\s+href\s*=\s*
>> (??["'])(\S+)\1)|(\S+))
>> [^>]*class\s*=\s*(?:["'])?cssSr_PaginationCurrentPage/x ) {
>> print $+;
>> }

>
> that works rather well.


A shorter (and clearer) variant would be:

if ( $text =~ /href\s*=\s*
(?:
(?:
(["'])(\S+)\1 # quoted URL
)
|
(\S+) # non-quoted URL
)
[^>]+cssSr_PaginationCurrentPage/x ) {
print $+;
}

Yeah, it works, provided that

1) the class attribute actually does come after the href attribute, and

2) no 'weird' attribute such as

someattr="x > z"

has been put in between.

Which I suppose illustrates Bob's point that it *is* difficult to parse
HTML with regular expressions...

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: 16mm is kicking my ass ftran999 Computer Support 2 10-31-2009 11:46 PM
Re: 16mm is kicking my ass richard Computer Support 4 10-31-2009 04:28 PM
[regexp] How to convert string "/regexp/i" to /regexp/i - ? Joao Silva Ruby 16 08-21-2009 05:52 PM
2nd wireless signal kicking user off network =?Utf-8?B?emFnbmV3?= Wireless Networking 3 09-18-2005 10:43 PM
Datagrids are kicking my ass!! Please help =?Utf-8?B?SSBhbSBTYW0=?= ASP .Net 2 03-14-2005 04:15 PM



Advertisments