Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Extracting text from a Webpage using BeautifulSoup

Reply
Thread Tools

Extracting text from a Webpage using BeautifulSoup

 
 
Magnus.Moraberg@gmail.com
Guest
Posts: n/a
 
      05-27-2008
Hi,

I wish to extract all the words on a set of webpages and store them in
a large dictionary. I then wish to procuce a list with the most common
words for the language under consideration. So, my code below reads
the page -

http://news.bbc.co.uk/welsh/hi/newsi...00/7420967.stm

a welsh language page. I hope to then establish the 1000 most commonly
used words in Welsh. The problem I'm having is that
soup.findAll(text=True) is returning the likes of -

u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
www.w3.org/TR/REC-html40/loose.dtd"'

and -

<a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"

Any suggestions how I might overcome this problem?

Thanks,

Barry.


Here's my code -

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

# proxy_support = urllib2.ProxyHandler({"http":"http://
999.999.999.999:8080"})
# opener = urllib2.build_opener(proxy_support)
# urllib2.install_opener(opener)

page = urllib2.urlopen('http://news.bbc.co.uk/welsh/hi/newsid_7420000/
newsid_7420900/7420967.stm')
soup = BeautifulSoup(page)

pageText = soup.findAll(text=True)
print pageText

 
Reply With Quote
 
 
 
 
Marc 'BlackJack' Rintsch
Guest
Posts: n/a
 
      05-27-2008
On Tue, 27 May 2008 03:01:30 -0700, Magnus.Moraberg wrote:

> I wish to extract all the words on a set of webpages and store them in
> a large dictionary. I then wish to procuce a list with the most common
> words for the language under consideration. So, my code below reads
> the page -
>
> http://news.bbc.co.uk/welsh/hi/newsi...00/7420967.stm
>
> a welsh language page. I hope to then establish the 1000 most commonly
> used words in Welsh. The problem I'm having is that
> soup.findAll(text=True) is returning the likes of -
>
> u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
> www.w3.org/TR/REC-html40/loose.dtd"'


Just extract the text from the body of the document.

body_texts = soup.body(text=True)

> and -
>
> <a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"
>
> Any suggestions how I might overcome this problem?


Ask the BBC to produce HTML that's less buggy.

http://validator.w3.org/ reports bugs like "'body' tag not allowed here"
or closing tags without opening ones and so on.

Ciao,
Marc 'BlackJack' Rintsch
 
Reply With Quote
 
 
 
 
Magnus.Moraberg@gmail.com
Guest
Posts: n/a
 
      05-27-2008
On 27 Maj, 12:54, Marc 'BlackJack' Rintsch <(E-Mail Removed)> wrote:
> On Tue, 27 May 2008 03:01:30 -0700, Magnus.Moraberg wrote:
> > I wish to extract all the words on a set of webpages and store them in
> > a large dictionary. I then wish to procuce a list with the most common
> > words for the language under consideration. So, my code below reads
> > the page -

>
> >http://news.bbc.co.uk/welsh/hi/newsi...00/7420967.stm

>
> > a welsh language page. I hope to then establish the 1000 most commonly
> > used words in Welsh. The problem I'm having is that
> > soup.findAll(text=True) is returning the likes of -

>
> > u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
> >www.w3.org/TR/REC-html40/loose.dtd"'

>
> Just extract the text from the body of the document.
>
> body_texts = soup.body(text=True)
>
> > and -

>
> > <a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"

>
> > Any suggestions how I might overcome this problem?

>
> Ask the BBC to produce HTML that's less buggy.
>
> http://validator.w3.org/reports bugs like "'body' tag not allowed here"
> or closing tags without opening ones and so on.
>
> Ciao,
> Marc 'BlackJack' Rintsch


Great, thanks!
 
Reply With Quote
 
Paul McGuire
Guest
Posts: n/a
 
      05-28-2008
On May 27, 5:01*am, (E-Mail Removed) wrote:
> Hi,
>
> I wish to extract all the words on a set of webpages and store them in
> a large dictionary. I then wish to procuce a list with the most common
> words for the language under consideration. So, my code below reads
> the page -
>
> http://news.bbc.co.uk/welsh/hi/newsi...00/7420967.stm
>
> a welsh language page. I hope to then establish the 1000 most commonly
> used words in Welsh. The problem I'm having is that
> soup.findAll(text=True) is returning the likes of -
>
> u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"'
>
> and -
>
> <a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"
>
> Any suggestions how I might overcome this problem?
>
> Thanks,
>
> Barry.
>
> Here's my code -
>
> import urllib
> import urllib2
> from BeautifulSoup import BeautifulSoup
>
> # proxy_support = urllib2.ProxyHandler({"http":"http://
> 999.999.999.999:8080"})
> # opener = urllib2.build_opener(proxy_support)
> # urllib2.install_opener(opener)
>
> page = urllib2.urlopen('http://news.bbc.co.uk/welsh/hi/newsid_7420000/
> newsid_7420900/7420967.stm')
> soup = BeautifulSoup(page)
>
> pageText = soup.findAll(text=True)
> print pageText


As an alternative datapoint, you can try out the htmlStripper example
on the pyparsing wiki: http://pyparsing.wikispaces.com/spac...tmlStripper.py

-- Paul
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Extracting html urls on a webpage using linktext shankar_perl_rookie Perl Misc 1 01-27-2011 05:00 AM
Extracting text using Beautifulsoup TC Python 0 10-25-2009 07:13 PM
Extracting a table from a webpage googlinggoogler@hotmail.com Perl Misc 2 04-28-2008 11:03 PM
Extracting Data from a Webpage Tj Superfly Ruby 16 01-28-2008 10:08 AM
Email contents of webpage or Form on webpage w/o using Server scripting sifar Javascript 5 08-24-2005 05:47 PM



Advertisments