Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > python-parser running Beautiful Soup needs to be reviewed

Reply
Thread Tools

python-parser running Beautiful Soup needs to be reviewed

 
 
Martin Kaspar
Guest
Posts: n/a
 
      12-11-2010
Hello commnity

i am new to Python and to Beatiful Soup also!
It is told to be a great tool to parse and extract content. So here i
am...:

I want to take the content of a <td>-tag of a table in a html
document. For example, i have this table

<table class="bp_ergebnis_tab_info">
<tr>
<td>
This is a sample text
</td>

<td>
This is the second sample text
</td>
</tr>
</table>

How can i use beautifulsoup to take the text "This is a sample text"?

Should i make use
soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'}) to get
the whole table.

See the target http://www.schulministerium.nrw.de/B...seMapDO=142323

Well - what have we to do first:

The first thing is t o find the table:

i do this with Using find rather than findall returns the first item
in the list
(rather than returning a list of all finds - in which case we'd have
to add an extra [0]
to take the first element of the list):


table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'})

Then use find again to find the first td:

first_td = soup.find('td')

Then we have to use renderContents() to extract the textual contents:

text = first_td.renderContents()

.... and the job is done (though we may also want to use strip() to
remove leading and trailing spaces:

trimmed_text = text.strip()

This should give us:


print trimmed_text
This is a sample text

as desired.


What do you think about the code? I love to hear from you!?

greetings
matze
 
Reply With Quote
 
 
 
 
Stef Mientki
Guest
Posts: n/a
 
      12-11-2010
On 11-12-2010 17:24, Martin Kaspar wrote:
> Hello commnity
>
> i am new to Python and to Beatiful Soup also!
> It is told to be a great tool to parse and extract content. So here i
> am...:
>
> I want to take the content of a <td>-tag of a table in a html
> document. For example, i have this table
>
> <table class="bp_ergebnis_tab_info">
> <tr>
> <td>
> This is a sample text
> </td>
>
> <td>
> This is the second sample text
> </td>
> </tr>
> </table>
>
> How can i use beautifulsoup to take the text "This is a sample text"?
>
> Should i make use
> soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'}) to get
> the whole table.
>
> See the target http://www.schulministerium.nrw.de/B...seMapDO=142323
>
> Well - what have we to do first:
>
> The first thing is t o find the table:
>
> i do this with Using find rather than findall returns the first item
> in the list
> (rather than returning a list of all finds - in which case we'd have
> to add an extra [0]
> to take the first element of the list):
>
>
> table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'})
>
> Then use find again to find the first td:
>
> first_td = soup.find('td')
>
> Then we have to use renderContents() to extract the textual contents:
>
> text = first_td.renderContents()
>
> ... and the job is done (though we may also want to use strip() to
> remove leading and trailing spaces:
>
> trimmed_text = text.strip()
>
> This should give us:
>
>
> print trimmed_text
> This is a sample text
>
> as desired.
>
>
> What do you think about the code? I love to hear from you!?

I've no opinion.
I'm just struggling with BeautifulSoup myself, finding it one of the toughest libs I've seen

So the simplest solution I came up with:

Text = """
<table class="bp_ergebnis_tab_info">
<tr>
<td>
This is a sample text
</td>

<td>
This is the second sample text
</td>
</tr>
</table>
"""
Content = BeautifulSoup ( Text )
print Content.find('td').contents[0].strip()
>>> This is a sample text


And now I wonder how to get the next contents !!

cheers,
Stef
> greetings
> matze


 
Reply With Quote
 
 
 
 
Peter Pearson
Guest
Posts: n/a
 
      12-11-2010
On Sat, 11 Dec 2010 22:38:43 +0100, Stef Mientki wrote:
[snip]
> So the simplest solution I came up with:
>
> Text = """
><table class="bp_ergebnis_tab_info">
> <tr>
> <td>
> This is a sample text
> </td>
>
> <td>
> This is the second sample text
> </td>
> </tr>
></table>
> """
> Content = BeautifulSoup ( Text )
> print Content.find('td').contents[0].strip()
>>>> This is a sample text

>
> And now I wonder how to get the next contents !!


Here's a suggestion:

peter@eleodes:~$ python
Python 2.5.2 (r252:60911, Jul 22 2009, 15:35:03)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from BeautifulSoup import BeautifulSoup
>>> Text = """

.... <table class="bp_ergebnis_tab_info">
.... <tr>
.... <td>
.... This is a sample text
.... </td>
....
.... <td>
.... This is the second sample text
.... </td>
.... </tr>
.... </table>
.... """
>>> Content = BeautifulSoup ( Text )
>>> for xx in Content.findAll('td'):

.... print xx.contents[0].strip()
....
This is a sample text
This is the second sample text
>>>


--
To email me, substitute nowhere->spamcop, invalid->net.
 
Reply With Quote
 
Alexander Kapps
Guest
Posts: n/a
 
      12-11-2010
On 11.12.2010 22:38, Stef Mientki wrote:
> On 11-12-2010 17:24, Martin Kaspar wrote:
>> Hello commnity
>>
>> i am new to Python and to Beatiful Soup also!
>> It is told to be a great tool to parse and extract content. So here i
>> am...:
>>
>> I want to take the content of a<td>-tag of a table in a html
>> document. For example, i have this table
>>
>> <table class="bp_ergebnis_tab_info">
>> <tr>
>> <td>
>> This is a sample text
>> </td>
>>
>> <td>
>> This is the second sample text
>> </td>
>> </tr>
>> </table>
>>
>> How can i use beautifulsoup to take the text "This is a sample text"?
>>
>> Should i make use
>> soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'}) to get
>> the whole table.
>>
>> See the target http://www.schulministerium.nrw.de/B...seMapDO=142323
>>
>> Well - what have we to do first:
>>
>> The first thing is t o find the table:
>>
>> i do this with Using find rather than findall returns the first item
>> in the list
>> (rather than returning a list of all finds - in which case we'd have
>> to add an extra [0]
>> to take the first element of the list):
>>
>>
>> table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'})
>>
>> Then use find again to find the first td:
>>
>> first_td = soup.find('td')
>>
>> Then we have to use renderContents() to extract the textual contents:
>>
>> text = first_td.renderContents()
>>
>> ... and the job is done (though we may also want to use strip() to
>> remove leading and trailing spaces:
>>
>> trimmed_text = text.strip()
>>
>> This should give us:
>>
>>
>> print trimmed_text
>> This is a sample text
>>
>> as desired.
>>
>>
>> What do you think about the code? I love to hear from you!?

> I've no opinion.
> I'm just struggling with BeautifulSoup myself, finding it one of the toughest libs I've seen


Really? While I'm by no means an expert, I find it very easy to work
with. It's very well structured IMHO.

> So the simplest solution I came up with:
>
> Text = """
> <table class="bp_ergebnis_tab_info">
> <tr>
> <td>
> This is a sample text
> </td>
>
> <td>
> This is the second sample text
> </td>
> </tr>
> </table>
> """
> Content = BeautifulSoup ( Text )
> print Content.find('td').contents[0].strip()
>>>> This is a sample text

>
> And now I wonder how to get the next contents !!


Content = BeautifulSoup ( Text )
for td in Content.findAll('td'):
print td.string.strip() # or td.renderContents().strip()
 
Reply With Quote
 
Stef Mientki
Guest
Posts: n/a
 
      12-12-2010
I've no opinion.
>> I'm just struggling with BeautifulSoup myself, finding it one of the toughest libs I've seen

>
> Really? While I'm by no means an expert, I find it very easy to work with. It's very well
> structured IMHO.

I think the cause lies in the documentation.
The PySide documentation is much easier to understand (at least for me)

http://www.pyside.org/docs/pyside/Py...ebElement.html

cheers,
Stef
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
python-parser running Beautiful Soup only spits out one line of 10.What i have gotten wrong here? Martin Kaspar Python 1 12-25-2010 06:36 PM
Using Beautiful Soup to entangle bookmarks.html Francach Python 15 09-21-2006 08:56 PM
Using Beautiful Soup to entangle bookmarks.html Anthra Norell Python 0 09-07-2006 08:47 PM
Using Beautiful Soup Tempo Python 1 08-19-2006 01:11 AM
beautiful soup library question meyerkp@gmail.com Python 2 03-11-2006 04:28 AM



Advertisments