Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Regular expression, getting href which is followed by img tag with specific src

Reply
Thread Tools

Regular expression, getting href which is followed by img tag with specific src

 
 
fatted
Guest
Posts: n/a
 
      08-20-2003
From a html file, I'd like to extract a href value of an <a> tag which
contains an <img> tag who's src value I'm searching on.

Basically (but theres more!):
<a href="IwantThis.html"><img src="importantimage.gif"></a>

(Un)Interesting part:
I first match a line from the html file containing importantimage.gif,
I then try to find my href value on this line.
But this line contains multiple <a> tags, (which have href values and
might also have an <img> tag with associated src value). Also all of
the <a> tags and <img> tags have more than one attribute.
So the line actually looks something like this:
<a class="red" href="uninteresting.html" target="_new">Not so exciting
text</a><a href="equallyboring.html" class = "blue">yawn</a><a
class="green" href="IwantThis.html"><img border="0"
src="importantimage.gif" alt="MeMe"></a>

My code:

use warnings;
use strict;

open(FILE,"<","4body.html");
while(<FILE>)
{
my $line = $_;
if($line =~ /importantimage\.gif/i)
{
if($line =~ /<a.+?href="(.+?)".+?src="importantimage\.gif".+?>< \/a>/)
{
print $1."\n";
}
}
}

which results in:

uninteresting.html

I think I understand why it gets this value, but I can't get the value
I want
 
Reply With Quote
 
 
 
 
codyhess
Guest
Posts: n/a
 
      08-20-2003

Your parenthesis are set to capture the first bit of ".+" in the scalar.
If you want the third link you should make your expression more
specific. Instead of

if($line =~
/<a.+?href="(.+?)".+?src="importantimage\.gif".+?>< \/a>/) try

if($line =~ /<a.+?href=".+?".+?href=".+".+href="(.+).+src="impo rtantima-
ge\.gif".+?><\/a>/)



Why are you using .+? instead of .+

uh....?


--
print &quot;Aspiring to be just another perl hacker,&quot;


Posted via http://dbforums.com
 
Reply With Quote
 
 
 
 
Fatted
Guest
Posts: n/a
 
      08-20-2003
"Tad McClellan" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> fatted <(E-Mail Removed)> wrote:


> You should use a module that understands HTML for processing HTML data.


Unfortunately I don't think that will help me with my problem, I want to
extract the value of a href, for an <a> tag, preceding an <img> tag which
has an attribute src with a specific value. I'm not sure what module does
this. (I'm going to look again though!)

> > Basically (but theres more!):
> ><a href="IwantThis.html"><img src="importantimage.gif"></a>
> >
> > (Un)Interesting part:
> > I first match a line

>
>
> "lines" do not matter in HTML.


Thanks for the reminder However if I were to use perl to parse a plain
text file (which just happened to contain html), "lines" do matter. I
first wanted to find the line (thereby ignoring all the rest of the html)
which contained the <img src="importantimage.gif" (there just happens to be
lots of tags on this line), and then try to find the preceding value of the
<a> tags href. I was trying to break the problem down (in my own little way


> > So the line

> ^^^^^^^^
>
> "the line" is singular, you didn't post 1 line, you posted 4 lines.


I posted 1 line (at least that was the attempt), unfortunately Google groups
did a bit of a hatchet job on it, and it got spread over 4 lines. Thats why
I referred to one line

> > actually looks something like this:
> ><a class="red" href="uninteresting.html" target="_new">Not so exciting
> > text</a><a href="equallyboring.html" class = "blue">yawn</a><a
> > class="green" href="IwantThis.html"><img border="0"
> > src="importantimage.gif" alt="MeMe"></a>

>
>
> If that _was_ really all on a single line, then it would still be
> equivalent HTML, since most whitespace does not matter in HTML data.
>
> <br>
> and
> <br >
> and
> <br
> >

>
> Are all the same HTML data.


Revision is always good

> > open(FILE,"<","4body.html");

>
>
> You should always, yes *always*, check the return value from open():


I know, I know but I was working just on the regular expression in a tester
script, so it'd be obvious if there was a file problem, (my real script does
check for return value. Honest . Good habits are good habits though.

> open(FILE, '<', '4body.html') or die "could not open '4body.html' $!";
>
>
> > while(<FILE>)
> > {
> > my $line = $_;

>
>
> If you want it in $line instead of $_ then you can put it
> in $line straightaway:
>
> while ( my $line = <FILE> )


Good point.

> This will NOT do what you asked, because it does not handle
> arbitrary HTML, it handles only the one case that you have shown.


You're right it won't do what I asked, I think the google wrap, put you off.

> It can be easily broken by legal HTML.


I'll try to keep my HTML as bad as my perl code

> It would work correctly if I had used a module that understands
> HTML data...


See my first comment, but I'd be delighted to be proved wrong. In the mean
time, I'd still appreciate some tips on the regular expression...

> ------------------------------------
> #!/usr/bin/perl
> use strict;
> use warnings;
>
> my $html = '
> <a class="red" href="uninteresting.html" target="_new">Not so exciting
> text</a><a href="equallyboring.html" class = "blue">yawn</a><a
> class="green" href="IwantThis.html"><img border="0"
> src="importantimage.gif" alt="MeMe"></a>';
>
>
> while ( $html =~ m#(<a\s.*?</a>)#sg ) {
> my $anchor = $1;
> next unless $anchor =~ /src="importantimage\.gif"/;
>
> print "$1\n" if $anchor =~ /href="([^"]*)/;
> }





 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      08-20-2003
Fatted <(E-Mail Removed)> wrote:
> "Tad McClellan" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed)...
>> fatted <(E-Mail Removed)> wrote:

>
>> You should use a module that understands HTML for processing HTML data.

>
> Unfortunately I don't think that will help me with my problem,



Yes it will. That is why I suggested it.


> I want to
> extract the value of a href, for an <a> tag, preceding an <img> tag which
> has an attribute src with a specific value. I'm not sure what module does
> this. (I'm going to look again though!)



I understood what you wanted to do quite clearly, that's why the
code that I already posted does just what you describe above!

Did you run the program?


>> "lines" do not matter in HTML.

>
> Thanks for the reminder



But you are going to forget it again before you get to the
end of your followup...


> I
> first wanted to find the line



If you think of "lines" when processing HTML you aren't thinking
correctly, and it will hurt you at some point.

So don't do that.


> which contained the <img src="importantimage.gif" (there just happens to be
> lots of tags on this line), and then try to find the preceding value of the
><a> tags href.



That is what my code does.


> I posted 1 line (at least that was the attempt), unfortunately Google groups
> did a bit of a hatchet job on it, and it got spread over 4 lines. Thats why
> I referred to one line



Yes I expected that that is what happened.

Have you seen the Posting Guidelines that are posted here frequently?

If you had said it "in Perl" then you could have conveyed your
actual data without "helpful" tools (attempting to) break it for you.


$html = '<a class="red" href="uninteresting.html" target="_new">'
. 'Not so exciting text</a><a href="equallyboring.html" '
. 'class = "blue"> ...';


>> If that _was_ really all on a single line, then it would still be
>> equivalent HTML, since most whitespace does not matter in HTML data.



>> This will NOT do what you asked, because it does not handle
>> arbitrary HTML, it handles only the one case that you have shown.

>
> You're right it won't do what I asked,



You're wrong, it *will* do what you asked.

Did you run the program?

It prints

IwantThis.html

isn't that what you wanted to be able to find?

But it will not work for real-world HTML, only for the specific
example of HTML that you posted. This legal HTML would break
it for instance:

<a class="green" href="Ido*NOT*wantThis.html">
<!-- src="importantimage.gif" -->
</a>

Whereas a Real HTML parser would not report that false positive.


> I think the google wrap, put you off.



No it didn't.

First, my code does exactly what you asked for with the data you gave.
(and if you modify the data to be all on one line, it will _still_
do the Right Thing.
)

Did you run the program?

Secondly, the word-wrapping did *not* break anything, because the
HTML is equivalent whether wrapped or all on a single line.

Your code should be able to handle HTML, and line breaks don't matter
in HTML, so your code should be able to handle the data either way.


>> It would work correctly if I had used a module that understands
>> HTML data...

>
> See my first comment, but I'd be delighted to be proved wrong.

^^^^^^^^^^^^

I'll do that a little farther down.


> In the mean
> time, I'd still appreciate some tips on the regular expression...



Trying to accomplish what you want with regular expressions is the
path to madness. You can work on it for many days and it will
still be easily broken by legal HTML data.

I know, I've been doing this sort of thing for 13 years.

regexs are not sufficiently powerful for the job you need done.

You need a Real Parser.


[snip working code]

You can do it in less than 10 lines of code with HTML::Tree

http://search.cpan.org/author/SBURKE/HTML-Tree-3.17/


---------------------------------------------------------
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;

my $html = '
<a class="red" href="uninteresting.html" target="_new">Not so exciting
text</a><a href="equallyboring.html" class = "blue">yawn</a><a
class="green" href="IwantThis.html"><img border="0"
src="importantimage.gif" alt="MeMe"></a>
';

# $html =~ s/\n/ /g; # make it all on one line

my $tree = HTML::TreeBuilder->new();
$tree->parse($html);

# find elements containing: src="importantimage.gif"
foreach my $img ( $tree->look_down('src', 'importantimage.gif') ) {
next unless $img->tag eq 'img'; # ensure the "src" attr was on
# an <img> element

next unless $img->parent->tag eq 'a'; # ensure parent is an <a> element
my $href = $img->parent->attr('href'); # grab its "href" attr value

print "$href\n";
}

$tree->delete;
---------------------------------------------------------


--
Tad McClellan SGML consulting
http://www.velocityreviews.com/forums/(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
fatted
Guest
Posts: n/a
 
      08-21-2003
(E-Mail Removed) (Tad McClellan) wrote in message news:<(E-Mail Removed)>.. .
> Fatted <(E-Mail Removed)> wrote:
> > "Tad McClellan" <(E-Mail Removed)> wrote in message
> > news:(E-Mail Removed)...
> >> fatted <(E-Mail Removed)> wrote:

>
> >> You should use a module that understands HTML for processing HTML data.

> >
> > Unfortunately I don't think that will help me with my problem,

>
>
> Yes it will. That is why I suggested it.


Perhaps, I mean't that I couldn't see *how* it would help with my
problem

>
> > I want to
> > extract the value of a href, for an <a> tag, preceding an <img> tag which
> > has an attribute src with a specific value. I'm not sure what module does
> > this. (I'm going to look again though!)

>
>
> I understood what you wanted to do quite clearly, that's why the
> code that I already posted does just what you describe above!
>
> Did you run the program?


I did, but some idiot copy pasted incorrectly When I catch that
guy...

>
> >> "lines" do not matter in HTML.

> >
> > Thanks for the reminder

>
>
> But you are going to forget it again before you get to the
> end of your followup...


Just put the gun down son... No I really do understand how HTML works.
I talked about a line, because, I am absolutely sure that the <a><img
/></a> tags which I'm interested in are always on one text line from
the html file.

> > I
> > first wanted to find the line

>
>
> If you think of "lines" when processing HTML you aren't thinking
> correctly, and it will hurt you at some point.
>
> So don't do that.


No more please

>
>
> > which contained the <img src="importantimage.gif" (there just happens to be
> > lots of tags on this line), and then try to find the preceding value of the
> ><a> tags href.

>


<snip>


> You can do it in less than 10 lines of code with HTML::Tree
>
> http://search.cpan.org/author/SBURKE/HTML-Tree-3.17/
> ---------------------------------------------------------
> #!/usr/bin/perl
> use strict;
> use warnings;
> use HTML::TreeBuilder;
>
> my $html = '
> <a class="red" href="uninteresting.html" target="_new">Not so exciting
> text</a><a href="equallyboring.html" class = "blue">yawn</a><a
> class="green" href="IwantThis.html"><img border="0"
> src="importantimage.gif" alt="MeMe"></a>
> ';
>
> # $html =~ s/\n/ /g; # make it all on one line
>
> my $tree = HTML::TreeBuilder->new();
> $tree->parse($html);
>
> # find elements containing: src="importantimage.gif"
> foreach my $img ( $tree->look_down('src', 'importantimage.gif') ) {
> next unless $img->tag eq 'img'; # ensure the "src" attr was on
> # an <img> element
>
> next unless $img->parent->tag eq 'a'; # ensure parent is an <a> element
> my $href = $img->parent->attr('href'); # grab its "href" attr value
>
> print "$href\n";
> }
>
> $tree->delete;
> ---------------------------------------------------------


Thanks.

I also figured out what was wrong (Keep the list short with the
regular expression in my original post. I had:

if($line =~ /<a.+?href="(.+?)".+?src="importantimage\.gif".+?>< \/a>/)

But if I'd tried:

if($line =~ /<a.+href="(.+?)".+?src="importantimage\.gif".+>< \/a>/)

I would have managed. Although I'll have to think about that a bit
more.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
What does the SRC in <IMG SRC> stand for? pheadxdll HTML 16 06-06-2007 02:45 AM
<txt src= ...> equivalent of <img src= ...> Steve Richter ASP .Net 3 02-09-2006 08:44 PM
'batch' img src switching in link.. rel.. href.. or userstylesheet? QuasiAnon@anon.com HTML 3 09-16-2005 08:22 PM
How to set the src of a html <img> tag to a string returned from a jsp page? Antti Nummiaho Java 7 11-17-2003 03:39 PM
Problem: Setting MSIE iframe innerHTML change relative href/src to absolute href/src Soren Vejrum Javascript 4 07-05-2003 01:47 PM



Advertisments