Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > PyParsing module or HTMLParser

Reply
Thread Tools

PyParsing module or HTMLParser

 
 
Lad
Guest
Posts: n/a
 
      03-28-2005
I came across pyparsing module by Paul McGuire. It seems to be nice but
I am not sure if it is the best for my need.
I need to extract some text from html page. The text is in tables and a
table can be inside another table.
Is it better and easier to use the pyparsing module or HTMLparser?

Thanks for suggestions.
La.

 
Reply With Quote
 
 
 
 
Bill Mill
Guest
Posts: n/a
 
      03-28-2005
On 28 Mar 2005 12:01:34 -0800, Lad <(E-Mail Removed)> wrote:
> I came across pyparsing module by Paul McGuire. It seems to be nice but
> I am not sure if it is the best for my need.
> I need to extract some text from html page. The text is in tables and a
> table can be inside another table.
> Is it better and easier to use the pyparsing module or HTMLparser?
>


You might want to check out BeautifulSoup at:
http://www.crummy.com/software/BeautifulSoup/ .

Peace
Bill Mill
bill.mill at gmail.com
 
Reply With Quote
 
 
 
 
EuGeNe
Guest
Posts: n/a
 
      03-28-2005
Lad wrote:
> I came across pyparsing module by Paul McGuire. It seems to be nice but
> I am not sure if it is the best for my need.
> I need to extract some text from html page. The text is in tables and a
> table can be inside another table.
> Is it better and easier to use the pyparsing module or HTMLparser?
>
> Thanks for suggestions.
> La.
>


Check BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/)it
did the job for me!

--
EuGeNe

[----
www.boardkulture.com
www.actiphot.com
www.xsbar.com
----]
 
Reply With Quote
 
Paul McGuire
Guest
Posts: n/a
 
      03-29-2005
La -

In general, I have shied away from doing general-purpose HTML parsing
with pyparsing. It's a crowded field, and it's likely that there are
better candidates out there for your problem. I've heard good things
about BeautifulSoup, but I've also heard from at least one person that
they prefer pyparsing to BS.

I personally have had good luck with *simple* HTML scraping with
pyparsing, such as extracting data from tables. It just depends on how
variable your source text is. Tables within tables may be a bit
challenging, but we'll never know unless you provide more to go on. If
you post a URL or some sample HTML, I could give you a more definitive
answer (possibly even a working code sample, you never know).

-- Paul

 
Reply With Quote
 
Lad
Guest
Posts: n/a
 
      03-30-2005
Paul,
Thank you for your reply.

Here is a test page that I woul like to test with PyParsing

http://www.ourglobalmarket.com/Test.htm

>From that

I would like to extract the tittle ( it is below Lanjin Electronics
Co., Ltd. )
(Sell 2.4GHz Wireless Mini Color Camera With Audio Function )

description - below the tittle next to the picture
Contact person
Company name
Address
fax
phone
Website Address

Do you think that the PyParsing will work for that?

Best regards,
Lad.

 
Reply With Quote
 
Paul McGuire
Guest
Posts: n/a
 
      03-30-2005
Lad -

Well, here's what I've got so far. I'll leave the extraction of the
description to you as an exercise, but as a clue, it looks like it is
delimited by "<b>View Detail</b></a></td></tr></tbody></table> <br>" at
the beginning, and "Quantity: 500<br>" at the end, where 500 could be
any number. This program will print out:

['Title:', 'Sell 2.4GHz Wireless Mini Color Camera With Audio Function
Manufacturers Hong Kong - Exporters, Suppliers, Factories, Seller']
['Contact:', 'Mr. Simon Cheung']
['Company:', 'Lanjin Electronics Co., Ltd.']
['Address:', 'Rm 602, 6/F., Tung Ning Bldg., 2 Hillier Street, Sheung
Wan , Hong Kong\n , HK\n ( Hong Kong
)']
['Phone:', '852 35763877']
['Fax:', '852 31056238']
['Mobile:', '852-96439737']

So I think pyparsing will get you pretty far along the way. Code
attached below (unfortunately, I am posting thru Google Groups, which
strips leading whitespace, so I have inserted '.'s to preserve code
indentation; just strip the leading '.' characters).

-- Paul

===================================
from pyparsing import *
import urllib

# get input data
url = "http://www.ourglobalmarket.com/Test.htm"
page = urllib.urlopen( url )
pageHTML = page.read()
page.close()

#~ I would like to extract the tittle ( it is below Lanjin Electronics
#~ Co., Ltd. )
#~ (Sell 2.4GHz Wireless Mini Color Camera With Audio Function )

#~ description - below the tittle next to the picture
#~ Contact person
#~ Company name
#~ Address
#~ fax
#~ phone
#~ Website Address

LANGBRK = Literal("<")
RANGBRK = Literal(">")
SLASH = Literal("/")
tagAttr = Word(alphanums) + "=" + dblQuotedString

# helpers for defining HTML tag expressions
def startTag( tagname ):
.....return ( LANGBRK + CaselessLiteral(tagname) + \
................ZeroOrMore(tagAttr) + RANGBRK ).suppress()
def endTag( tagname ):
.....return ( LANGBRK + SLASH + CaselessLiteral(tagname) + RANGBRK
).suppress()
def makeHTMLtags( tagname ):
.....return startTag(tagname), endTag(tagname)
def strong( expr ):
.....return strongStartTag + expr + strongEndTag

strongStartTag, strongEndTag = makeHTMLtags("strong")
titleStart, titleEnd = makeHTMLtags("title")
tdStart, tdEnd = makeHTMLtags("td")
h1Start, h1End = makeHTMLtags("h1")

title = titleStart + SkipTo( titleEnd ).setResultsName("title") +
titleEnd
contactPerson = tdStart + h1Start + \
................SkipTo( h1End ).setResultsName("contact")
company = ( tdStart + strong("Company:") + tdEnd + tdStart ) + \
................SkipTo( tdEnd ).setResultsName("company")
address = ( tdStart + strong("Address:") + tdEnd + tdStart ) + \
................SkipTo( tdEnd ).setResultsName("address")
phoneNum = ( tdStart + strong("Phone:") + tdEnd + tdStart ) + \
................SkipTo( tdEnd ).setResultsName("phoneNum")
faxNum = ( tdStart + strong("Fax:") + tdEnd + tdStart ) + \
................SkipTo( tdEnd ).setResultsName("faxNum")
mobileNum = ( tdStart + strong("Mobile:") + tdEnd + tdStart ) + \
................SkipTo( tdEnd ).setResultsName("mobileNum")
webSite = ( tdStart + strong("Website Address:") + tdEnd + tdStart )
+ \
................SkipTo( tdEnd ).setResultsName("webSite")
scrapes = title | contactPerson | company | address | phoneNum | faxNum
| mobileNum | webSite

# use parse actions to remove hyperlinks
linkStart, linkEnd = makeHTMLtags("a")
linkExpr = linkStart + SkipTo( linkEnd ) + linkEnd
def stripHyperLink(s,l,t):
.....return [ t[0], linkExpr.transformString( t[1] ) ]
company.setParseAction( stripHyperLink )

# use parse actions to add labels for data elements that don't
# have labels in the HTML
def prependLabel(pre):
.....def prependAction(s,l,t):
.........return [pre] + t[:]
.....return prependAction
title.setParseAction( prependLabel("Title:") )
contactPerson.setParseAction( prependLabel("Contact:") )

for tokens,start,end in scrapes.scanString( pageHTML ):
.....print tokens

 
Reply With Quote
 
Lad
Guest
Posts: n/a
 
      03-31-2005
Paul, thanks a lot.
It seems to work but I will have to study the sample hard to be able to
do the exercise (the extraction of the
description ) successfully. Is it possible to email you if I need some
help with that exercise?
Thanks again for help
Lad.

 
Reply With Quote
 
Paul McGuire
Guest
Posts: n/a
 
      03-31-2005
Yes, drop me a note if you get stuck.

-- Paul
base64.decodestring('cHRtY2dAYXVzdGluLnJyLmNvbQ==' )

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
ImportError: No module named HTMLParser Mike Python 2 05-03-2006 11:15 AM
I use htmlparser http://htmlparser.sourceforge.net/javadoc/index.html mike Java 0 01-11-2005 03:46 PM
ANN: pyparsing 1.0.3 - text parsing library module Paul McGuire Python 0 12-24-2003 06:08 PM
HTMLParser solution! Tan Vu Ngoc Java 0 11-18-2003 02:07 AM
Question regarding HTMLParser module. Adonis Python 1 07-28-2003 04:48 AM



Advertisments