Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > What's the best way to write this regular expression?

Reply
Thread Tools

What's the best way to write this regular expression?

 
 
John Salerno
Guest
Posts: n/a
 
      03-06-2012
Thanks. I'm thinking the choice might be between lxml and Beautiful
Soup, but since BS uses lxml as a parser, I'm trying to figure out the
difference between them. I don't necessarily need the simplest
(html.parser), but I want to choose one that is simple enough yet
powerful enough that I won't have to learn another method later.




On Tue, Mar 6, 2012 at 5:35 PM, Ian Kelly <(E-Mail Removed)> wrote:
> On Tue, Mar 6, 2012 at 4:05 PM, John Salerno <(E-Mail Removed)> wrote:
>>> Anything that allows me NOT to use REs is welcome news, so I look forward to learning about something new!

>>
>> I should ask though...are there alternatives already bundled with Pythonthat I could use? Now that you mention it, I remember something called HTMLParser (or something like that) and I have no idea why I never looked intothat before I messed with REs.

>
> HTMLParser is pretty basic, although it may be sufficient for your
> needs. *It just converts an html document into a stream of start tags,
> end tags, and text, with no guarantee that the tags will actually
> correspond in any meaningful way. *lxml can be used to output an
> actual hierarchical structure that may be easier to manipulate and
> extract data from.
>
> Cheers,
> Ian

 
Reply With Quote
 
 
 
 
Steven D'Aprano
Guest
Posts: n/a
 
      03-06-2012
On Tue, 06 Mar 2012 15:05:39 -0800, John Salerno wrote:

>> Anything that allows me NOT to use REs is welcome news, so I look
>> forward to learning about something new!

>
> I should ask though...are there alternatives already bundled with Python
> that I could use? Now that you mention it, I remember something called
> HTMLParser (or something like that) and I have no idea why I never
> looked into that before I messed with REs.


import htmllib
help(htmllib)

The help is pretty minimal and technical, you might like to google on a
tutorial or two:

https://duckduckgo.com/html/?q=pytho...lib%20tutorial

Also, you're still double-posting.


--
Steven
 
Reply With Quote
 
 
 
 
John Salerno
Guest
Posts: n/a
 
      03-06-2012
> Also, you're still double-posting.

Grr. I just reported it to Google, but I think if I start to frequent the newsgroup again I'll have to switch to Thunderbird, or perhaps I'll just try switching back to the old Google Groups interface. I think the issue is the new interface.

Sorry.
 
Reply With Quote
 
Prasad, Ramit
Guest
Posts: n/a
 
      03-07-2012
>

> > Also, you're still double-posting.

>
> Grr. I just reported it to Google, but I think if I start to frequent the
> newsgroup again I'll have to switch to Thunderbird, or perhaps I'll just
> try switching back to the old Google Groups interface. I think the issue is
> the new interface.
>
> Sorry.


Oddly, I see no double posting for this thread on my end (email list).

Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.
 
Reply With Quote
 
Terry Reedy
Guest
Posts: n/a
 
      03-07-2012
On 3/6/2012 6:05 PM, John Salerno wrote:
>> Anything that allows me NOT to use REs is welcome news, so I look
>> forward to learning about something new!

>
> I should ask though...are there alternatives already bundled with
> Python that I could use?


lxml is +- upward compatible with xml.etree in the stdlib.

--
Terry Jan Reedy

 
Reply With Quote
 
Terry Reedy
Guest
Posts: n/a
 
      03-07-2012
On 3/6/2012 6:57 PM, John Salerno wrote:
>> Also, you're still double-posting.

>
> Grr. I just reported it to Google, but I think if I start to frequent
> the newsgroup again I'll have to switch to Thunderbird, or perhaps
> I'll just try switching back to the old Google Groups interface. I
> think the issue is the new interface.


I am not seeing the double posting, but I use Thunderbird + the
news.gmane.org mirrors of python-list and others.

--
Terry Jan Reedy

 
Reply With Quote
 
Roy Smith
Guest
Posts: n/a
 
      03-07-2012
In article
<12783654.1174.1331073814011.JavaMail.geo-discussion-forums@yner4>,
John Salerno <(E-Mail Removed)> wrote:

> I sort of have to work with what the website gives me (as you'll see below),
> but today I encountered an exception to my RE. Let me just give all the
> specific information first. The point of my script is to go to the specified
> URL and extract song information from it.


Rule #1: Don't try to parse XML, HTML, or any other kind of ML with
regular expressions.

Rule #2: Use a dedicated ML parser. I like lxml (http://lxml.de/).
There's other possibilities.

Rule #3: If in doubt, see rule #1.
 
Reply With Quote
 
John Salerno
Guest
Posts: n/a
 
      03-07-2012
After a bit of reading, I've decided to use Beautiful Soup 4, with
lxml as the parser. I considered simply using lxml to do all the work,
but I just got lost in the documentation and tutorials. I couldn't
find a clear explanation of how to parse an HTML file and then
navigate its structure.

The Beautiful Soup 4 documentation was very clear, and BS4 itself is
so simple and Pythonic. And best of all, since version 4 no longer
does the parsing itself, you can choose your own parser, and it works
with lxml, so I'll still be using lxml, but with a nice, clean overlay
for navigating the tree structure.

Thanks for the advice!
 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      03-07-2012
John Salerno <(E-Mail Removed)> writes:
> The Beautiful Soup 4 documentation was very clear, and BS4 itself is
> so simple and Pythonic. And best of all, since version 4 no longer
> does the parsing itself, you can choose your own parser, and it works
> with lxml, so I'll still be using lxml, but with a nice, clean overlay
> for navigating the tree structure.


I haven't used BS4 but have made good use of earlier versions.

Main thing to understand is that an awful lot of HTML in the real world
is malformed and will break an XML parser or anything that expects
syntactically invalid HTML. People tend to write HTML that works well
enough to render decently in browsers, whose parsers therefore have to
be tolerant of bad errors. Beautiful Soup also tries to make sense of
crappy, malformed, HTML. Partly as a result, it's dog slow compared to
any serious XML parser. But it works very well if you don't mind the
low speed.
 
Reply With Quote
 
John Salerno
Guest
Posts: n/a
 
      03-07-2012
Ok, first major roadblock. I have no idea how to install Beautiful
Soup or lxml on Windows! All I can find are .tar files. Based on what
I've read, I can use the easy_setup module to install these types of
files, but when I went to download the setuptools package, it only
seemed to support Python 2.7. I'm using 3.2. Is 2.7 just the minimum
version it requires? It didn't say something like "2.7+", so I wasn't
sure, and I don't want to start installing a bunch of stuff that will
clog up my directories and not even work.

What's the best way for me to install these two packages? I've also
seen a reference to using setup.py...is that a separate package too,
or is that something that comes with Python by default?

Thanks.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Best way to write text content from servlet Bruce Lee Java 1 10-09-2005 08:41 AM
TURNING CRAZY, is there a way to write it in a different way? whats wrong francisco lopez Javascript 2 12-31-2004 11:15 PM
Best way to write log file Pablo Tola ASP .Net 2 05-27-2004 05:41 PM
Best way to write a file n-bytes long Tony C Python 6 08-30-2003 08:25 AM
RE: Best way to write a file n-bytes long Batista, Facundo Python 0 08-27-2003 12:31 PM



Advertisments