Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Help on regular expression match

Reply
Thread Tools

Help on regular expression match

 
 
Johnny Lee
Guest
Posts: n/a
 
      09-23-2005
Hi,
I've met a problem in match a regular expression in python. Hope
any of you could help me. Here are the details:

I have many tags like this:
xxx<a href="http://xxx.xxx.xxx" xxx>xxx
xxx<a href="wap://xxx.xxx.xxx" xxx>xxx
xxx<a href="http://xxx.xxx.xxx" xxx>xxx
.....
And I want to find all the "http://xxx.xxx.xxx" out, so I do it
like this:
httpPat = re.compile("(<a )(href=\")(http://.*)(\")")
result = httpPat.findall(data)
I use this to observe my output:
for i in result:
print i[2]
Surprisingly I will get some output like this:
http://xxx.xxx.xxx">xxx</a>xxx
In fact it's filtered from this kind of source:
<a href="http://xxx.xxx.xxx">xxx</a>xxx"
But some result are right, I wonder how can I get the all the
answers clean like "http://xxx.xxx.xxx"? Thanks for your help.


Regards,
Johnny

 
Reply With Quote
 
 
 
 
Fredrik Lundh
Guest
Posts: n/a
 
      09-23-2005
Johnny Lee wrote:

> I've met a problem in match a regular expression in python. Hope
> any of you could help me. Here are the details:
>
> I have many tags like this:
> xxx<a href="http://xxx.xxx.xxx" xxx>xxx
> xxx<a href="wap://xxx.xxx.xxx" xxx>xxx
> xxx<a href="http://xxx.xxx.xxx" xxx>xxx
> .....
> And I want to find all the "http://xxx.xxx.xxx" out, so I do it
> like this:
> httpPat = re.compile("(<a )(href=\")(http://.*)(\")")
> result = httpPat.findall(data)
> I use this to observe my output:
> for i in result:
> print i[2]
> Surprisingly I will get some output like this:
> http://xxx.xxx.xxx">xxx</a>xxx
> In fact it's filtered from this kind of source:
> <a href="http://xxx.xxx.xxx">xxx</a>xxx"
> But some result are right, I wonder how can I get the all the
> answers clean like "http://xxx.xxx.xxx"? Thanks for your help.


".*" gives the longest possible match (you can think of it as searching back-
wards from the right end). if you want to search for "everything until a given
character", searching for "[^x]*x" is often a better choice than ".*x".

in this case, I suggest using something like

print re.findall("href=\"([^\"]+)\"", text)

or, if you're going to parse HTML pages from many different sources, a
real parser:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):
if tag == "a":
for key, value in attrs:
if key == "href":
print value

p = MyHTMLParser()
p.feed(text)
p.close()

see:

http://docs.python.org/lib/module-HTMLParser.html
http://docs.python.org/lib/htmlparser-example.html
http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html

</F>



 
Reply With Quote
 
 
 
 
Johnny Lee
Guest
Posts: n/a
 
      09-23-2005

Fredrik Lundh wrote:
> ".*" gives the longest possible match (you can think of it as searching back-
> wards from the right end). if you want to search for "everything until a given
> character", searching for "[^x]*x" is often a better choice than ".*x".
>
> in this case, I suggest using something like
>
> print re.findall("href=\"([^\"]+)\"", text)
>
> or, if you're going to parse HTML pages from many different sources, a
> real parser:
>
> from HTMLParser import HTMLParser
>
> class MyHTMLParser(HTMLParser):
>
> def handle_starttag(self, tag, attrs):
> if tag == "a":
> for key, value in attrs:
> if key == "href":
> print value
>
> p = MyHTMLParser()
> p.feed(text)
> p.close()
>
> see:
>
> http://docs.python.org/lib/module-HTMLParser.html
> http://docs.python.org/lib/htmlparser-example.html
> http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html
>
> </F>


Thanks for your help.
I found another solution by just simply adding a '?' after ".*" which
makes the it searching for the minimal length to match the regular
expression.
To the HTMLParser, there is another problem (take my code for example):

import urllib
import formatter
parser = htmllib.HTMLParser(formatter.NullFormatter())
parser.feed(urllib.urlopen(baseUrl).read())
parser.close()
for url in parser.anchorlist:
if url[0:7] == "http://":
print url

when the baseUrl="http://www.nba.com", there will raise an
HTMLParseError because of a line of code "<! Copyright IBM Corporation,
2001, 2002 !>". I found that this line of code is inside <script> tags,
maybe it's because of this?

 
Reply With Quote
 
John J. Lee
Guest
Posts: n/a
 
      09-24-2005
"Fredrik Lundh" <> writes:
[...]
> or, if you're going to parse HTML pages from many different sources, a
> real parser:
>
> from HTMLParser import HTMLParser
>
> class MyHTMLParser(HTMLParser):
>
> def handle_starttag(self, tag, attrs):
> if tag == "a":
> for key, value in attrs:
> if key == "href":
> print value
>
> p = MyHTMLParser()
> p.feed(text)
> p.close()
>
> see:
>
> http://docs.python.org/lib/module-HTMLParser.html
> http://docs.python.org/lib/htmlparser-example.html
> http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html


It's worth noting that module HTMLParser is less tolerant of the bad
HTML you find in the real world than is module sgmllib, which has a
similar interface. There are also third party libraries like
BeautifulSoup and mxTidy that you may find useful for parsing "HTML as
deployed" (ie. bad HTML, often).

Also, htmllib is an extension to sgmllib, and will do your link
parsing with even less effort:

import htmllib, formatter, urllib2
pp = htmllib.HTMLParser(formatter.NullFormatter())
pp.feed(urllib2.urlopen("http://python.org/").read())
print pp.anchorlist


Module HTMLParser does have better support for XHTML, though.


John

 
Reply With Quote
 
John J. Lee
Guest
Posts: n/a
 
      09-24-2005
"Johnny Lee" <> writes:

> Fredrik Lundh wrote:

[...]
> To the HTMLParser, there is another problem (take my code for example):
>
> import urllib
> import formatter
> parser = htmllib.HTMLParser(formatter.NullFormatter())
> parser.feed(urllib.urlopen(baseUrl).read())
> parser.close()
> for url in parser.anchorlist:
> if url[0:7] == "http://":
> print url
>
> when the baseUrl="http://www.nba.com", there will raise an
> HTMLParseError because of a line of code "<! Copyright IBM Corporation,
> 2001, 2002 !>". I found that this line of code is inside <script> tags,
> maybe it's because of this?


No, i's because they're using a broken HTML comment (should be
"<!--comment-->"). BeautifulSoup is more tolerant:

import urllib2
from BeautifulSoup import BeautifulSoup
bs = BeautifulSoup(urllib2.urlopen('http://www.nba.com/').read())
for el in bs.fetch('a'):
print el['href']


Or you could pre-process the HTML using mxTidy, and carry on using
module htmllib.

Hmm, are you the same Johnny Lee who contributed the MSIE cookie
support to LWP?


John

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Regular Expression - looking to match 'www' only if it is the start of a URL hooterbite@yahoo.com ASP .Net 0 07-20-2005 04:11 PM
Regular Expression - looking to match 'www' only if it the start of a URL hooterbite@yahoo.com ASP .Net 4 07-12-2005 01:01 PM
how to match regular expression from right to left Liang Perl 2 08-27-2004 10:03 PM
match three digit number using regular expression championsleeper Perl 6 04-06-2004 08:54 PM
Dynamically changing the regular expression of Regular Expression validator VSK ASP .Net 2 08-24-2003 02:47 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57