Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > extracting from web pages but got disordered words sometimes

Reply
Thread Tools

extracting from web pages but got disordered words sometimes

 
 
Frank Potter
Guest
Posts: n/a
 
      01-27-2007
There are ten web pages I want to deal with.
from http://www.af.shejis.com/new_lw/html/125926.shtml
to http://www.af.shejis.com/new_lw/html/125936.shtml

Each of them uses the charset of Chinese "gb2312", and firefox
displays all of them in the right form, that's readable Chinese.

My job is, I get every page and extract the html title of it and
dispaly the title on linux shell Termial.

And, my problem is, to some page, I get human readable title(that's in
Chinese), but to other pages, I got disordered word. Since each page
has the same charset, I don't know why I can't get every title in the
same way.

Here's my python code, get_title.py :

Code:
#!/usr/bin/python
import urllib2
from BeautifulSoup import BeautifulSoup

min_page=125926
max_page=125936

def make_page_url(page_index):
    return ur"".join([ur"http://www.af.shejis.com/new_lw/
html/",str(page_index),ur".shtml"])

def get_page_title(page_index):
    url=make_page_url(page_index)
    print "now getting: ", url
    user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers={'User-Agent':user_agent}
    req=urllib2.Request(url,None,headers)
    response=urllib2.urlopen(req)
    #print response.info()
    page=response.read()

    #extract tile by beautiful soup
    soup=BeautifulSoup(page)
    full_title=str(soup.html.head.title.string)

    #title is in the format of "title --title"
    #use this code to delete the "--" and the duplicate title
    title=full_title[full_title.rfind('-')+1::]

    return title

for i in xrange(min_page,max_page):
    print get_page_title(i)
Will somebody please help me out? Thanks in advance.

 
Reply With Quote
 
 
 
 
Paul McGuire
Guest
Posts: n/a
 
      01-27-2007
On Jan 27, 5:18 am, "Frank Potter" <(E-Mail Removed)> wrote:
> There are ten web pages I want to deal with.
> fromhttp://www.af.shejis.com/new_lw/html/125926.shtml
> to http://www.af.shejis.com/new_lw/html/125936.shtml
>
> Each of them uses the charset of Chinese "gb2312", and firefox
> displays all of them in the right form, that's readable Chinese.
>
> My job is, I get every page and extract the html title of it and
> dispaly the title on linux shell Termial.
>
> And, my problem is, to some page, I get human readable title(that's in
> Chinese), but to other pages, I got disordered word. Since each page
> has the same charset, I don't know why I can't get every title in the
> same way.
>
> Here's my python code, get_title.py :
>
>
Code:
> #!/usr/bin/python
> import urllib2
> from BeautifulSoup import BeautifulSoup
>
> min_page=125926
> max_page=125936
>
> def make_page_url(page_index):
>     return ur"".join([ur"http://www.af.shejis.com/new_lw/
> html/",str(page_index),ur".shtml"])
>
> def get_page_title(page_index):
>     url=make_page_url(page_index)
>     print "now getting: ", url
>     user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
>     headers={'User-Agent':user_agent}
>     req=urllib2.Request(url,None,headers)
>     response=urllib2.urlopen(req)
>     #print response.info()
>     page=response.read()
>
>     #extract tile by beautiful soup
>     soup=BeautifulSoup(page)
>     full_title=str(soup.html.head.title.string)
>
>     #title is in the format of "title --title"
>     #use this code to delete the "--" and the duplicate title
>     title=full_title[full_title.rfind('-')+1::]
>
>     return title
>
> for i in xrange(min_page,max_page):
>     print get_page_title(i)
>
>
> Will somebody please help me out? Thanks in advance.


This pyparsing solution seems to extract what you were looking for,
but I don't know if this will render to Chinese or not.

-- Paul

from pyparsing import makeHTMLTags,SkipTo
import urllib

titleStart,titleEnd = makeHTMLTags("title")
scanExpr = titleStart + SkipTo("- -",include=True) +
SkipTo(titleEnd).setResultsName("titleChars") + titleEnd

def extractTitle(htmlSource):
titleSource = scanExpr.searchString(htmlSource, maxMatches=1)[0]
return titleSource.titleChars


for urlIndex in range(125926,125936+1):
url = "http://www.af.shejis.com/new_lw/html/%d.shtml" % urlIndex
pg = urllib.urlopen(url)
html = pg.read()
pg.close()
print url,':',extractTitle(html)


Gives:

http://www.af.shejis.com/new_lw/html/125926.shtml : GSM本地网组网方式
http://www.af.shejis.com/new_lw/html/125927.shtml : GSM
本地网组网方式初探
http://www.af.shejis.com/new_lw/html/125928.shtml : GSM的数据业务
http://www.af.shejis.com/new_lw/html/125929.shtml :
GSM的数据业务和承载能力
http://www.af.shejis.com/new_lw/html/125930.shtml : GSM的网络演进-
从GSM到GPRS到3G (附图)
http://www.af.shejis.com/new_lw/html/125931.shtml : GSM短消
⒁滴裨谒樽远獗ㄏ低持械挠τ矛
http://www.af.shejis.com/new_lw/html/125932.shtml : GS
M交换系统的网络优化
http://www.af.shejis.com/new_lw/html/125933.shtml : GSM切换掉话的分析
敖饩霭旆
http://www.af.shejis.com/new_lw/html/125934.shtml : GSM手机拨叫市话
?榫钟没Ч收系钠饰
http://www.af.shejis.com/new_lw/html/125935.shtml :
GSM手机到WCDMA终端的演变
http://www.af.shejis.com/new_lw/html/125936.shtml : GSM手机的维修方法

 
Reply With Quote
 
 
 
 
Paul McGuire
Guest
Posts: n/a
 
      01-27-2007

After looking at the pyparsing results, I think I see the problem with
your original code. You are selecting only the characters after the
rightmost "-" character, but you really want to select everything to
the right of "- -". In some of the titles, the encoded Chinese
includes a "-" character, so you are chopping off everything before
that.

Try changing your code to:
title=full_title.split("- -")[1]

I think then your original program will work.

-- Paul

 
Reply With Quote
 
Frank Potter
Guest
Posts: n/a
 
      01-28-2007
Thank you, I tried again and I figured it out.
That's something with beautiful soup, I worked with it a year ago also
dealing with Chinese html pages and nothing error happened. I read the
old code and I find the difference. Change the page to unicode before
feeding to beautiful soup, then everything will be OK.

On Jan 28, 3:26 am, "Paul McGuire" <(E-Mail Removed)> wrote:
> After looking at the pyparsing results, I think I see the problem with
> your original code. You are selecting only the characters after the
> rightmost "-" character, but you really want to select everything to
> the right of "- -". In some of the titles, the encoded Chinese
> includes a "-" character, so you are chopping off everything before
> that.
>
> Try changing your code to:
> title=full_title.split("- -")[1]
>
> I think then your original program will work.
>
> -- Paul


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
strange problem with asp.net web app calling web service. sometimes 403 sometimes not gkellymail@gmail.com ASP .Net 1 11-29-2006 07:11 AM
Firefox locks up on first URL sometimes; Sometimes closes itself Jim Firefox 0 06-28-2005 01:42 PM
? Content Entered in Forms Sometimes Duplicated and Sometimes Not ? Nehmo Sergheyev HTML 1 05-09-2004 07:07 PM
::std sometimes needed, sometimes not Marcin Vorbrodt C++ 24 09-17-2003 03:01 PM
Re: Sometimes ASP.NET does find the dll, sometimes doesn't John Saunders ASP .Net 0 08-28-2003 01:40 PM



Advertisments