Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Regexp

Reply
Thread Tools

Regexp

 
 
gervaz
Guest
Posts: n/a
 
      01-19-2009
Hi all, I need to find all the address in a html source page, I'm
using:
'href="(?P<url>http://mysite.com/[^"]+)">(<b>)?(?P<name>[^</a>]+)(</
b>)?</a>'
but the [^</a>]+ pattern retrieve all the strings not containing <
or / or a etc, although I just not want the word "</a>". How can I
specify: 'do not search the string "blabla"?'

Thanks
 
Reply With Quote
 
 
 
 
MRAB
Guest
Posts: n/a
 
      01-19-2009
gervaz wrote:
> Hi all, I need to find all the address in a html source page, I'm
> using:
> 'href="(?P<url>http://mysite.com/[^"]+)">(<b>)?(?P<name>[^</a>]+)(</
> b>)?</a>'
> but the [^</a>]+ pattern retrieve all the strings not containing <
> or / or a etc, although I just not want the word "</a>". How can I
> specify: 'do not search the string "blabla"?'
>

If the name is followed by "<" then just match the name with [^<]+:

href="(?P<url>http://mysite\.com/[^"]+)">(<b>)?(?P<name>[^<]+)(</
> b>)?</a>


I've also changed mysite.com to mysite\.com because . will match any
character, but what you probably want to match is ".".
 
Reply With Quote
 
 
 
 
Diez B. Roggisch
Guest
Posts: n/a
 
      01-19-2009
gervaz wrote:

> Hi all, I need to find all the address in a html source page, I'm
> using:
> 'href="(?P<url>http://mysite.com/[^"]+)">(<b>)?(?P<name>[^</a>]+)(</
> b>)?</a>'
> but the [^</a>]+ pattern retrieve all the strings not containing <
> or / or a etc, although I just not want the word "</a>". How can I
> specify: 'do not search the string "blabla"?'


You should consider using BeautifulSoup or lxml2's error-tolerant parser to
work with HTML-documents.

Sooner or later your regex-based processing is bound to fail, as documents
get more complicated. Better to use the right tool for the job.

The code should look like this (untested):

from BeautifulSoup import BeautifulSoup
html = """<html><a href="http://mysite.com/foobar/baz">link</a></html>"""

res = []
soup = BeautifulSoup(html)
for tag in soup.findAll("a"):
if tag["href"].startswith("http://mysite.com"):
res.append(tag["href"])


Not so hard, and *much* more robust.

Diez
 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      01-19-2009
gervaz wrote:

> Hi all, I need to find all the address in a html source page, I'm
> using:
> 'href="(?P<url>http://mysite.com/[^"]+)">(<b>)?(?P<name>[^</a>]+)(</
> b>)?</a>'
> but the [^</a>]+ pattern retrieve all the strings not containing <
> or / or a etc, although I just not want the word "</a>". How can I
> specify: 'do not search the string "blabla"?'


Have considered BeautifulSoup?

from BeautifulSoup import BeautifulSoup
from urlparse import urlparse

for a in BeautifulSoup(page)("a"):
try:
href = a["href"]
except KeyError:
pass
else:
url = urlparse(href)
if url.hostname == "mysite.com":
print href

Peter
 
Reply With Quote
 
Ant
Guest
Posts: n/a
 
      01-19-2009
A 0-width positive lookahead is probably what you want here:

>>> s = """

.... hdhd <a href="http://mysite.com/blah.html">Test <i>String</i> OK</
a>
....
.... """
>>> p = r'href="(http://mysite.com/[^"]+)">(.*)(?=</a>)'
>>> m = re.search(p, s)
>>> m.group(1)

'http://mysite.com/blah.html'
>>> m.group(2)

'Test <i>String</i> OK'

The (?=...) bit is the lookahead, and won't consume any of the string
you are searching. I've binned the named groups for clarity.

The beautiful soup answers are a better bet though - they've already
done the hard work, and after all, you are trying to roll your own
partial HTML parser here, which will struggle with badly formed html...
 
Reply With Quote
 
gervaz
Guest
Posts: n/a
 
      01-19-2009
On Jan 19, 4:01*pm, Ant <(E-Mail Removed)> wrote:
> A 0-width positive lookahead is probably what you want here:
>
> >>> s = """

>
> ... hdhd <a href="http://mysite.com/blah.html">Test <i>String</i> OK</
> a>
> ...
> ... """>>> p = r'href="(http://mysite.com/[^"]+)">(.*)(?=</a>)'
> >>> m = re.search(p, s)
> >>> m.group(1)

>
> 'http://mysite.com/blah.html'>>> m.group(2)
>
> 'Test <i>String</i> OK'
>
> The (?=...) bit is the lookahead, and won't consume any of the string
> you are searching. I've binned the named groups for clarity.
>
> The beautiful soup answers are a better bet though - they've already
> done the hard work, and after all, you are trying to roll your own
> partial HTML parser here, which will struggle with badly formed html...


Ok, thank you all, I'll take a look at beautiful soup, albeit the
lookahead solution fits better for the little I have to do.
 
Reply With Quote
 
Diez B. Roggisch
Guest
Posts: n/a
 
      01-19-2009
gervaz wrote:

> On Jan 19, 4:01Â*pm, Ant <(E-Mail Removed)> wrote:
>> A 0-width positive lookahead is probably what you want here:
>>
>> >>> s = """

>>
>> ... hdhd <a href="http://mysite.com/blah.html">Test <i>String</i> OK</
>> a>
>> ...
>> ... """>>> p = r'href="(http://mysite.com/[^"]+)">(.*)(?=</a>)'
>> >>> m = re.search(p, s)
>> >>> m.group(1)

>>
>> 'http://mysite.com/blah.html'>>> m.group(2)
>>
>> 'Test <i>String</i> OK'
>>
>> The (?=...) bit is the lookahead, and won't consume any of the string
>> you are searching. I've binned the named groups for clarity.
>>
>> The beautiful soup answers are a better bet though - they've already
>> done the hard work, and after all, you are trying to roll your own
>> partial HTML parser here, which will struggle with badly formed html...

>
> Ok, thank you all, I'll take a look at beautiful soup, albeit the
> lookahead solution fits better for the little I have to do.


Little things tend to get out of hand quickly... This is the reason why so
many gave you the hint.

Diez
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
new RegExp().test() or just RegExp().test() Matěj Cepl Javascript 3 11-24-2009 02:41 PM
[regexp] How to convert string "/regexp/i" to /regexp/i - ? Joao Silva Ruby 16 08-21-2009 05:52 PM
Ruby 1.9 - ArgumentError: incompatible encoding regexp match(US-ASCII regexp with ISO-2022-JP string) Mikel Lindsaar Ruby 0 03-31-2008 10:27 AM
Programmatically turning a Regexp into an anchored Regexp Greg Hurrell Ruby 4 02-14-2007 06:56 PM
RegExp.exec() returns null when there is a match - a JavaScript RegExp bug? Uldis Bojars Javascript 2 12-17-2006 09:59 PM



Advertisments