Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > regex help

Reply
Thread Tools

regex help

 
 
David
Guest
Posts: n/a
 
      07-08-2009
Hi

I have a few regexs I need to do, but im struggling to come up with a
nice way of doing them, and more than anything am here to learn some
tricks and some neat code rather than getting an answer - although
thats obviously what i would like to get to.

Problem 1 -

<span class="chg"
id="ref_678774_cp">(25.47%)</span><br>

I want to extract 25.47 from here - so far I've tried -

xPer = re.search('<span class="chg" id="ref_"'+str(xID.group(1))+'"_cp
\">(.*?)%', content)

and

xPer = re.search('<span class=\"chg\" id=\"ref_"+str(xID.group(1))+"_cp
\">\((\d*)%\)</span><br>', content)

neither of these seem to do what I want - am I not doing this
correctly? (obviously!)

Problem 2 -

<td>&nbsp;</td>

<td width="1%" class=key>Open:
</td>
<td width="1%" class=val>5.50
</td>
<td>&nbsp;</td>
<td width="1%" class=key>Mkt Cap:
</td>
<td width="1%" class=val>6.92M
</td>
<td>&nbsp;</td>
<td width="1%" class=key>P/E:
</td>
<td width="1%" class=val>21.99
</td>


I want to extract the open, mkt cap and P/E values - but apart from
doing loads of indivdual REs which I think would look messy, I can't
think of a better and neater looking way. Any ideas?

Cheers

David

 
Reply With Quote
 
 
 
 
Chris Rebert
Guest
Posts: n/a
 
      07-08-2009
On Wed, Jul 8, 2009 at 3:06 PM, David<> wrote:
> Hi
>
> I have a few regexs I need to do, but im struggling to come up with a
> nice way of doing them, and more than anything am here to learn some
> tricks and some neat code rather than getting an answer - although
> thats obviously what i would like to get to.
>
> Problem 1 -
>
> <span class="chg"
> Â* Â* Â* Â* Â* Â* Â* Â*id="ref_678774_cp">(25.47%)</span><br>
>
> I want to extract 25.47 from here - so far I've tried -
>
> xPer = re.search('<span class="chg" id="ref_"'+str(xID.group(1))+'"_cp
> \">(.*?)%', content)
>
> and
>
> xPer = re.search('<span class=\"chg\" id=\"ref_"+str(xID.group(1))+"_cp
> \">\((\d*)%\)</span><br>', content)
>
> neither of these seem to do what I want - am I not doing this
> correctly? (obviously!)
>
> Problem 2 -
>
> <td>&nbsp;</td>
>
> <td width="1%" class=key>Open:
> </td>
> <td width="1%" class=val>5.50
> </td>
> <td>&nbsp;</td>
> <td width="1%" class=key>Mkt Cap:
> </td>
> <td width="1%" class=val>6.92M
> </td>
> <td>&nbsp;</td>
> <td width="1%" class=key>P/E:
> </td>
> <td width="1%" class=val>21.99
> </td>
>
>
> I want to extract the open, mkt cap and P/E values - but apart from
> doing loads of indivdual REs which I think would look messy, I can't
> think of a better and neater looking way. Any ideas?


Use an actual HTML parser? Like BeautifulSoup
(http://www.crummy.com/software/BeautifulSoup/), for instance.

I will never understand why so many people try to parse/scrape
HTML/XML with regexes...

Cheers,
Chris
--
http://blog.rebertia.com
 
Reply With Quote
 
 
 
 
Tim Harig
Guest
Posts: n/a
 
      07-08-2009
On 2009-07-08, Chris Rebert <> wrote:
> On Wed, Jul 8, 2009 at 3:06 PM, David<> wrote:
>> I want to extract the open, mkt cap and P/E values - but apart from
>> doing loads of indivdual REs which I think would look messy, I can't
>> think of a better and neater looking way. Any ideas?


You are downloading market data? Yahoo offers its stats in CSV format that
is easier to parse without a dedicated parser.

> Use an actual HTML parser? Like BeautifulSoup
> (http://www.crummy.com/software/BeautifulSoup/), for instance.


I agree with your sentiment exactly. If the regex he is trying to get is
difficult enough that he has to ask; then, yes, he should be using a
parser.

> I will never understand why so many people try to parse/scrape
> HTML/XML with regexes...


Why? Because some times it is good enough to get the job done easily.
 
Reply With Quote
 
Rhodri James
Guest
Posts: n/a
 
      07-08-2009
On Wed, 08 Jul 2009 23:06:22 +0100, David <>
wrote:

> Hi
>
> I have a few regexs I need to do, but im struggling to come up with a
> nice way of doing them, and more than anything am here to learn some
> tricks and some neat code rather than getting an answer - although
> thats obviously what i would like to get to.
>
> Problem 1 -
>
> <span class="chg"
> id="ref_678774_cp">(25.47%)</span><br>
>
> I want to extract 25.47 from here - so far I've tried -
>
> xPer = re.search('<span class="chg" id="ref_"'+str(xID.group(1))+'"_cp
> \">(.*?)%', content)


Supposing that str(xID.group(1)) == "678774", let's see how that string
concatenation turns out:

<span class="chg" id="ref_"678774"_cp">(.*?)%

The obvious problems here are the spurious double quotes, the spurious
(but harmless) escaping of a double quote, and the lack of (escaped)
backslash and (escaped) open parenthesis. The latter you can always
strip off later, but the first sink the match rather thoroughly.

>
> and
>
> xPer = re.search('<span class=\"chg\" id=\"ref_"+str(xID.group(1))+"_cp
> \">\((\d*)%\)</span><br>', content)


With only two single quotes present, the biggest problem should be obvious.

Unfortunately if you just fix the obvious in either of the two regular
expressions, you're setting yourself up for a fall later on. As The Fine
Manual says right at the top of the page on the re module
(http://docs.python.org/library/re.html), you want to be using raw string
literals when you're dealing with regular expressions, because you want
the backslashes getting through without being interpreted specially by
Python's own parser. As it happens you get away with it in this case,
since neither '\d' nor '\(' have a special meaning to Python, so aren't
changed, and '\"' is interpreted as '"', which happens to be the right
thing anyway.


> Problem 2 -
>
> <td>&nbsp;</td>
>
> <td width="1%" class=key>Open:
> </td>
> <td width="1%" class=val>5.50
> </td>
> <td>&nbsp;</td>
> <td width="1%" class=key>Mkt Cap:
> </td>
> <td width="1%" class=val>6.92M
> </td>
> <td>&nbsp;</td>
> <td width="1%" class=key>P/E:
> </td>
> <td width="1%" class=val>21.99
> </td>
>
>
> I want to extract the open, mkt cap and P/E values - but apart from
> doing loads of indivdual REs which I think would look messy, I can't
> think of a better and neater looking way. Any ideas?


What you're trying to do is inherently messy. You might want to use
something like BeautifulSoup to hide the mess, but never having had
cause to use it myself I couldn't say for sure.

--
Rhodri James *-* Wildebeest Herder to the Masses
 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      07-09-2009
David wrote:

> <td>&nbsp;</td>
>
> <td width="1%" class=key>Open:
> </td>
> <td width="1%" class=val>5.50
> </td>
> <td>&nbsp;</td>
> <td width="1%" class=key>Mkt Cap:
> </td>
> <td width="1%" class=val>6.92M
> </td>
> <td>&nbsp;</td>
> <td width="1%" class=key>P/E:
> </td>
> <td width="1%" class=val>21.99
> </td>
>
>
> I want to extract the open, mkt cap and P/E values - but apart from
> doing loads of indivdual REs which I think would look messy, I can't
> think of a better and neater looking way. Any ideas?


>>> from BeautifulSoup import BeautifulSoup
>>> bs = BeautifulSoup("""<td>&nbsp;</td>

....
.... <td width="1%" class=key>Open:
.... </td>
.... <td width="1%" class=val>5.50
.... </td>
.... <td>&nbsp;</td>
.... <td width="1%" class=key>Mkt Cap:
.... </td>
.... <td width="1%" class=val>6.92M
.... </td>
.... <td>&nbsp;</td>
.... <td width="1%" class=key>P/E:
.... </td>
.... <td width="1%" class=val>21.99
.... </td>
.... """)
>>> for key in bs.findAll(attrs={"class": "key"}):

.... value = key.findNext(attrs={"class": "val"})
.... print key.string.strip(), "-->", value.string.strip()
....
Open: --> 5.50
Mkt Cap: --> 6.92M
P/E: --> 21.99


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
String Pattern Matching: regex and Python regex documentation Xah Lee Java 1 09-22-2006 07:11 PM
Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine? =?Utf-8?B?SmViQnVzaGVsbA==?= ASP .Net 2 10-22-2005 02:43 PM
Java regex imposture re: Perl regex compatibility a_c_Attlee@yahoo.com Java 2 05-06-2005 12:16 AM
perl regex to java regex Rick Venter Java 5 11-06-2003 10:55 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57