Velocity Reviews > Beautiful Soup iterator question....

# Beautiful Soup iterator question....

cjl
Guest
Posts: n/a

 04-20-2007
P:

I am screen-scraping a table. The table has an unknown number of rows,
but each row has exactly 8 cells. I would like to extract the data
from the cells, but the first three cells in each row have their data
nested inside other tags.

So I have the following code:

for row in table.findAll("tr"):
for cell in row.findAll("td"):
print cell.contents[0]

This code prints out all the data, but of course the first three cells
still contain their unwanted tags.

I would like to do something like this:

for cell1, cell2, cell3, cell4, cell5, cell6, cell7, cell8 in
row.findAll("td"):

Then treat each cell differently.

I can't figure this out. Can anyone point me in the right direction?

-CJL

Steve Holden
Guest
Posts: n/a

 04-20-2007
cjl wrote:
> P:
>
> I am screen-scraping a table. The table has an unknown number of rows,
> but each row has exactly 8 cells. I would like to extract the data
> from the cells, but the first three cells in each row have their data
> nested inside other tags.
>
> So I have the following code:
>
> for row in table.findAll("tr"):
> for cell in row.findAll("td"):
> print cell.contents[0]
>
> This code prints out all the data, but of course the first three cells
> still contain their unwanted tags.
>
> I would like to do something like this:
>
> for cell1, cell2, cell3, cell4, cell5, cell6, cell7, cell8 in
> row.findAll("td"):
>
> Then treat each cell differently.
>
> I can't figure this out. Can anyone point me in the right direction?
>

did you try something like (untested)

cell1, cell2, cell3, cell4, cell5, \
cell6, cell7, cell8 = row.findAll("td")

No need for the "for" if you want to handle each cell differently, you
won;t be iterating over htem . And, as you saw, it doesn't work unless
row.findAll(...) returns a sequence of eight-item containers.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Recent Ramblings http://holdenweb.blogspot.com

Paul McGuire
Guest
Posts: n/a

 04-20-2007
On Apr 20, 2:05 pm, Steve Holden <(E-Mail Removed)> wrote:
<snip>
>
> did you try something like (untested)
>
> cell1, cell2, cell3, cell4, cell5, \
> cell6, cell7, cell8 = row.findAll("td")
>
> No need for the "for" if you want to handle each cell differently, you
> won;t be iterating over htem . And, as you saw, it doesn't work unless
> row.findAll(...) returns a sequence of eight-item containers.
>

One defensive approach to handle rows that might have too few or too
many elements, is to construct a larger list, and then slice the right
number of elements from it.

cell1, cell2, cell3, cell4, cell5, \
cell6, cell7, cell8 = (row.findAll("td") + [None]*[:
8]

-- Paul