Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   extracting from web pages but got disordered words sometimes (http://www.velocityreviews.com/forums/t398634-extracting-from-web-pages-but-got-disordered-words-sometimes.html)

Frank Potter 01-27-2007 11:18 AM

extracting from web pages but got disordered words sometimes
 
There are ten web pages I want to deal with.
from http://www.af.shejis.com/new_lw/html/125926.shtml
to http://www.af.shejis.com/new_lw/html/125936.shtml

Each of them uses the charset of Chinese "gb2312", and firefox
displays all of them in the right form, that's readable Chinese.

My job is, I get every page and extract the html title of it and
dispaly the title on linux shell Termial.

And, my problem is, to some page, I get human readable title(that's in
Chinese), but to other pages, I got disordered word. Since each page
has the same charset, I don't know why I can't get every title in the
same way.

Here's my python code, get_title.py :

Code:

#!/usr/bin/python
import urllib2
from BeautifulSoup import BeautifulSoup

min_page=125926
max_page=125936

def make_page_url(page_index):
    return ur"".join([ur"http://www.af.shejis.com/new_lw/
html/",str(page_index),ur".shtml"])

def get_page_title(page_index):
    url=make_page_url(page_index)
    print "now getting: ", url
    user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers={'User-Agent':user_agent}
    req=urllib2.Request(url,None,headers)
    response=urllib2.urlopen(req)
    #print response.info()
    page=response.read()

    #extract tile by beautiful soup
    soup=BeautifulSoup(page)
    full_title=str(soup.html.head.title.string)

    #title is in the format of "title --title"
    #use this code to delete the "--" and the duplicate title
    title=full_title[full_title.rfind('-')+1::]

    return title

for i in xrange(min_page,max_page):
    print get_page_title(i)

Will somebody please help me out? Thanks in advance.


Paul McGuire 01-27-2007 07:18 PM

Re: extracting from web pages but got disordered words sometimes
 
On Jan 27, 5:18 am, "Frank Potter" <could....@gmail.com> wrote:
> There are ten web pages I want to deal with.
> fromhttp://www.af.shejis.com/new_lw/html/125926.shtml
> to http://www.af.shejis.com/new_lw/html/125936.shtml
>
> Each of them uses the charset of Chinese "gb2312", and firefox
> displays all of them in the right form, that's readable Chinese.
>
> My job is, I get every page and extract the html title of it and
> dispaly the title on linux shell Termial.
>
> And, my problem is, to some page, I get human readable title(that's in
> Chinese), but to other pages, I got disordered word. Since each page
> has the same charset, I don't know why I can't get every title in the
> same way.
>
> Here's my python code, get_title.py :
>
>
Code:

> #!/usr/bin/python
> import urllib2
> from BeautifulSoup import BeautifulSoup
>
> min_page=125926
> max_page=125936
>
> def make_page_url(page_index):
>    return ur"".join([ur"http://www.af.shejis.com/new_lw/
> html/",str(page_index),ur".shtml"])
>
> def get_page_title(page_index):
>    url=make_page_url(page_index)
>    print "now getting: ", url
>    user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
>    headers={'User-Agent':user_agent}
>    req=urllib2.Request(url,None,headers)
>    response=urllib2.urlopen(req)
>    #print response.info()
>    page=response.read()
>
>    #extract tile by beautiful soup
>    soup=BeautifulSoup(page)
>    full_title=str(soup.html.head.title.string)
>
>    #title is in the format of "title --title"
>    #use this code to delete the "--" and the duplicate title
>    title=full_title[full_title.rfind('-')+1::]
>
>    return title
>
> for i in xrange(min_page,max_page):
>    print get_page_title(i)
>

>
> Will somebody please help me out? Thanks in advance.


This pyparsing solution seems to extract what you were looking for,
but I don't know if this will render to Chinese or not.

-- Paul

from pyparsing import makeHTMLTags,SkipTo
import urllib

titleStart,titleEnd = makeHTMLTags("title")
scanExpr = titleStart + SkipTo("- -",include=True) +
SkipTo(titleEnd).setResultsName("titleChars") + titleEnd

def extractTitle(htmlSource):
titleSource = scanExpr.searchString(htmlSource, maxMatches=1)[0]
return titleSource.titleChars


for urlIndex in range(125926,125936+1):
url = "http://www.af.shejis.com/new_lw/html/%d.shtml" % urlIndex
pg = urllib.urlopen(url)
html = pg.read()
pg.close()
print url,':',extractTitle(html)


Gives:

http://www.af.shejis.com/new_lw/html/125926.shtml : GSM本地网组网方式
http://www.af.shejis.com/new_lw/html/125927.shtml : GSM
本地网组网方式初探
http://www.af.shejis.com/new_lw/html/125928.shtml : GSM的数据业务
http://www.af.shejis.com/new_lw/html/125929.shtml :
GSM的数据业务和承载能力
http://www.af.shejis.com/new_lw/html/125930.shtml : GSM的网络演进-
从GSM到GPRS到3G (附图)
http://www.af.shejis.com/new_lw/html/125931.shtml : GSM短消
⒁滴裨谒樽远獗ㄏ低持械挠τ矛
http://www.af.shejis.com/new_lw/html/125932.shtml : GS
M交换系统的网络优化
http://www.af.shejis.com/new_lw/html/125933.shtml : GSM切换掉话的分析
敖饩霭旆
http://www.af.shejis.com/new_lw/html/125934.shtml : GSM手机拨叫市话
?榫钟没Ч收系钠饰
http://www.af.shejis.com/new_lw/html/125935.shtml :
GSM手机到WCDMA终端的演变
http://www.af.shejis.com/new_lw/html/125936.shtml : GSM手机的维修方法


Paul McGuire 01-27-2007 07:26 PM

Re: extracting from web pages but got disordered words sometimes
 

After looking at the pyparsing results, I think I see the problem with
your original code. You are selecting only the characters after the
rightmost "-" character, but you really want to select everything to
the right of "- -". In some of the titles, the encoded Chinese
includes a "-" character, so you are chopping off everything before
that.

Try changing your code to:
title=full_title.split("- -")[1]

I think then your original program will work.

-- Paul


Frank Potter 01-28-2007 02:33 AM

Re: extracting from web pages but got disordered words sometimes
 
Thank you, I tried again and I figured it out.
That's something with beautiful soup, I worked with it a year ago also
dealing with Chinese html pages and nothing error happened. I read the
old code and I find the difference. Change the page to unicode before
feeding to beautiful soup, then everything will be OK.

On Jan 28, 3:26 am, "Paul McGuire" <p...@austin.rr.com> wrote:
> After looking at the pyparsing results, I think I see the problem with
> your original code. You are selecting only the characters after the
> rightmost "-" character, but you really want to select everything to
> the right of "- -". In some of the titles, the encoded Chinese
> includes a "-" character, so you are chopping off everything before
> that.
>
> Try changing your code to:
> title=full_title.split("- -")[1]
>
> I think then your original program will work.
>
> -- Paul




All times are GMT. The time now is 03:39 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.