Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > how can i use lxml with win32com?

Reply
Thread Tools

how can i use lxml with win32com?

 
 
Michiel Overtoom
Guest
Posts: n/a
 
      10-25-2009
elca wrote:

> http://news.search.naver.com/search....+times&x=0&y=0
> that is korea portal site and i was search keyword using 'korea times'
> and i want to scrap resulted to text name with 'blogscrap_save.txt'


Aha, now we're getting somewhere.

Getting and parsing that page is no problem, and doesn't need JavaScript
or Internet Explorer.

import urllib2
import BeautifulSoup
doc=urllib2.urlopen("http://news.search.naver.com/search.naver?sm=tab_hty&where=news&query=korea+tim es&x=0&y=0")
soup=BeautifulSoup.BeautifulSoup(doc)


By analyzing the structure of that page you can see that the articles
are presented in an unordered list which has class "type01". The
interesting bit in each list item is encapsulated in a <dd> tag with
class "sh_news_passage". So, to parse the articles:

ul=soup.find("ul","type01")
for li in ul.findAll("li"):
dd=li.find("dd","sh_news_passage")
print dd.renderContents()
print

This example prints them, but you could also save them to a file (or a
database, whatever).

Greetings,



--
"The ability of the OSS process to collect and harness
the collective IQ of thousands of individuals across
the Internet is simply amazing." - Vinod Valloppillil
http://www.catb.org/~esr/halloween/halloween4.html
 
Reply With Quote
 
 
 
 
Dennis Lee Bieber
Guest
Posts: n/a
 
      10-26-2009
On Sun, 25 Oct 2009 14:50:22 +0100, Irmen de Jong
<(E-Mail Removed)> declaimed the following in
gmane.comp.python.general:

> Michiel Overtoom wrote:
> > elca wrote:
> >
> >> im sorry ,also im not familiar with newsgroup.

> >
> > It's not a newsgroup, but a mailing list. And if you're new to a certain
> > community you're not familiar with, it's best to lurk a few days to see
> > how it is used.

>
> Pot. Kettle. Black.
> comp.lang.python really is a usenet news group. There is a mailing list that mirrors the
> newsgroup though.
>

And the mailing list is then also available via NNTP on gmane as
gmane.comp.python.general...

comp.lang.python (via NNTP)
<> mailing list (via SMTP/POP3)
<> gmane.comp.python.general (via NNTP)


I'm deliberately not defining what Google does with it...
--
Wulfraed Dennis Lee Bieber KD6MOG
http://www.velocityreviews.com/forums/(E-Mail Removed) HTTP://wlfraed.home.netcom.com/

 
Reply With Quote
 
 
 
 
elca
Guest
Posts: n/a
 
      10-26-2009



motoom wrote:
>
> elca wrote:
>
>> http://news.search.naver.com/search....+times&x=0&y=0
>> that is korea portal site and i was search keyword using 'korea times'
>> and i want to scrap resulted to text name with 'blogscrap_save.txt'

>
> Aha, now we're getting somewhere.
>
> Getting and parsing that page is no problem, and doesn't need JavaScript
> or Internet Explorer.
>
> import urllib2
> import BeautifulSoup
> doc=urllib2.urlopen("http://news.search.naver.com/search.naver?sm=tab_hty&where=news&query=korea+tim es&x=0&y=0")
> soup=BeautifulSoup.BeautifulSoup(doc)
>
>
> By analyzing the structure of that page you can see that the articles
> are presented in an unordered list which has class "type01". The
> interesting bit in each list item is encapsulated in a <dd> tag with
> class "sh_news_passage". So, to parse the articles:
>
> ul=soup.find("ul","type01")
> for li in ul.findAll("li"):
> dd=li.find("dd","sh_news_passage")
> print dd.renderContents()
> print
>
> This example prints them, but you could also save them to a file (or a
> database, whatever).
>
> Greetings,
>
>
>
> --
> "The ability of the OSS process to collect and harness
> the collective IQ of thousands of individuals across
> the Internet is simply amazing." - Vinod Valloppillil
> http://www.catb.org/~esr/halloween/halloween4.html
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>



Hi, thanks for your help..
thread is too long, so i will open another new post.
thanks a lot

Paul
--
View this message in context: http://www.nabble.com/how-can-i-use-...p26055191.html
Sent from the Python - python-list mailing list archive at Nabble.com.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Re: lxml can't output right unicode result MRAB Python 0 09-07-2012 01:14 AM
[ANN] lxml 1.0 released Stefan Behnel Python 2 06-03-2006 12:32 PM
[ANN] lxml 0.9 is out! Stefan Behnel Python 0 03-20-2006 08:17 PM
ANN: MathDOM 0.5.2 - MathML in Python - now featuring lxml API! Stefan Behnel Python 0 10-17-2005 09:30 AM



Advertisments