Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > how can i use lxml with win32com?

Reply
Thread Tools

how can i use lxml with win32com?

 
 
Stefan Behnel
Guest
Posts: n/a
 
      10-25-2009
elca, 25.10.2009 08:46:
> im very sorry my english.


It's fairly common in this news-group that people do not have a good level
of English, so that's perfectly ok. But you should try to provide more
information in your posts. Be explicit about what you tried and what failed
(and how!), and provide short code examples and exact copies of failure
messages whenever possible. That will help others in understanding what is
going on on your side. Remember that we can't look at your screen, nor read
your mind.

Oh, and please don't top-post in replies.

Stefan
 
Reply With Quote
 
 
 
 
elca
Guest
Posts: n/a
 
      10-25-2009

Hello,
thanks for your reply.
actually what i want to parse website is some different language site.
so i was quote some common english website for easy understand.
by the way, is it possible to use with PAMIE and beautifulsoup work
together?
Thanks a lot



motoom wrote:
>
> elca wrote:
>
>> yes i want to extract this text 'CNN Shop' and linked page
>> 'http://www.turnerstoreonline.com'.

>
> Well then.
> First, we'll get the page using urrlib2:
>
> doc=urllib2.urlopen("http://www.cnn.com")
>
> Then we'll feed it into the HTML parser:
>
> soup=BeautifulSoup(doc)
>
> Next, we'll look at all the links in the page:
>
> for a in soup.findAll("a"):
>
> and when a link has the text 'CNN Shop', we have a hit,
> and print the URL:
>
> if a.renderContents()=="CNN Shop":
> print a["href"]
>
>
> The complete program is thus:
>
> import urllib2
> from BeautifulSoup import BeautifulSoup
>
> doc=urllib2.urlopen("http://www.cnn.com")
> soup=BeautifulSoup(doc)
> for a in soup.findAll("a"):
> if a.renderContents()=="CNN Shop":
> print a["href"]
>
>
> The example above can be condensed because BeautifulSoup's find function
> can also look for texts:
>
> print soup.find("a",text="CNN Shop")
>
> and since that's a navigable string, we can ascend to its parent and
> display the href attribute:
>
> print soup.find("a",text="CNN Shop").findParent()["href"]
>
> So eventually the whole program could be collapsed into one line:
>
> print
> BeautifulSoup(urllib2.urlopen("http://www.cnn.com")).find("a",text="CNN
> Shop").findParent()["href"]
>
> ...but I think this is very ugly!
>
>
> > im very sorry my english.

>
> You English is quite understandable. The hard part is figuring out what
> exactly you wanted to achieve
>
> I have a question too. Why did you think JavaScript was necessary to
> arrive at this result?
>
> Greetings,
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>


--
View this message in context: http://www.nabble.com/how-can-i-use-...p26045979.html
Sent from the Python - python-list mailing list archive at Nabble.com.

 
Reply With Quote
 
 
 
 
Michiel Overtoom
Guest
Posts: n/a
 
      10-25-2009
elca wrote:

> actually what i want to parse website is some different language site.


A different website? What website? What text? Please show your actual
use case, instead of smokescreens.


> so i was quote some common english website for easy understand.


And, did you learn something from it? Were you able to apply the
technique to the other website?


> by the way, is it possible to use with PAMIE and beautifulsoup work
> together?


If you define 'working together' as like 'PAMIE produces a HTML text and
BeautifulSoup parses it', then maybe yes.

Greetings,

--
"The ability of the OSS process to collect and harness
the collective IQ of thousands of individuals across
the Internet is simply amazing." - Vinod Valloppillil
http://www.catb.org/~esr/halloween/halloween4.html
 
Reply With Quote
 
elca
Guest
Posts: n/a
 
      10-25-2009

Hello,
actually what i want is,
if you run my script you can reach this page
'http://news.search.naver.com/search.naver?sm=tab_hty&where=news&query=korea+tim es&x=0&y=0'
that is korea portal site and i was search keyword using 'korea times'
and i want to scrap resulted to text name with 'blogscrap_save.txt'
if you run this script ,you can see
following article

"Yesan County: How do you like them apples?
코리아헤럴드 |
carp fishing at the Yedang Reservoir -
Korea`s biggest - taking a nice stroll...
During the curator`s recitation of Yun`s life and times as a resistance
and freedom fighter,
he would emphsize random ...
"

and also can see following article and so on ....
"
10,000 Nepalese Diaspora Emerging in Korea
코리아타임스 세계 | 2009.10.23 (금) 오후 9:31
Although the Nepalese community in Korea is worker dominated,
there are... yoga is popular among Nepalese. These festivals are the
times when expatriate Nepalese feel nostalgic for their... "

so actual process to scrap site is,
first i want to use keyword and want to save resulted article with only
text.


i was attached currently im making script but not so much good and can't
work well.
especially extract part is really hard for novice,such like for me
thanks in advance..




http://www.nabble.com/file/p26046215/untitled-1.py untitled-1.py


motoom wrote:
>
> elca wrote:
>
>> actually what i want to parse website is some different language site.

>
> A different website? What website? What text? Please show your actual
> use case, instead of smokescreens.
>
>
>> so i was quote some common english website for easy understand.

>
> And, did you learn something from it? Were you able to apply the
> technique to the other website?
>
>
>> by the way, is it possible to use with PAMIE and beautifulsoup work
>> together?

>
> If you define 'working together' as like 'PAMIE produces a HTML text and
> BeautifulSoup parses it', then maybe yes.
>
> Greetings,
>
> --
> "The ability of the OSS process to collect and harness
> the collective IQ of thousands of individuals across
> the Internet is simply amazing." - Vinod Valloppillil
> http://www.catb.org/~esr/halloween/halloween4.html
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>


--
View this message in context: http://www.nabble.com/how-can-i-use-...p26046215.html
Sent from the Python - python-list mailing list archive at Nabble.com.

 
Reply With Quote
 
paul
Guest
Posts: n/a
 
      10-25-2009
elca schrieb:
> Hello,

Hi,

> following is script source which can beautifulsoup and PAMIE work together.
> but if i run this script source error was happened.
>
> AttributeError: PAMIE instance has no attribute 'pageText'
> File "C:\test12.py", line 7, in <module>
> bs = BeautifulSoup(ie.pageText())

You could execute the script line by line in the python console, then
after the line "ie = PAMIE(url)" look at the "ie" object with "dir(ie)"
to check if it really looks like a healthy instance. ...got bored, just
tried it -- looks like pageText() has been renamed to getPageText().
Try:
text = PAMIE('http://www.cnn.com').getPageText()

cheers
Paul

>
> and following is orginal source until i was found in internet.
>
> from BeautifulSoup import BeautifulSoup
> from PAM30 import PAMIE
> url = 'http://www.cnn.com'
> ie = PAMIE(url)
> bs = BeautifulSoup(ie.pageText())
>
> if possible i really want to make it work together with beautifulsoup or
> lxml with PAMIE.
> sorry my bad english.
> thanks in advance.
>
>
>
>
>
>
> Stefan Behnel-3 wrote:
>> Hi,
>>
>> elca, 25.10.2009 02:35:
>>> hello...
>>> if anyone know..please help me !
>>> i really want to know...i was searched in google lot of time.
>>> but can't found clear soultion. and also because of my lack of python
>>> knowledge.
>>> i want to use IE.navigate function with beautifulsoup or lxml..
>>> if anyone know about this or sample.
>>> please help me!
>>> thanks in advance ..

>> You wrote a message with nine lines, only one of which gives a tiny hint
>> on
>> what you actually want to do. What about providing an explanation of what
>> you want to achieve instead? Try to answer questions like: Where does your
>> data come from? Is it XML or HTML? What do you want to do with it?
>>
>> This might help:
>>
>> http://www.catb.org/~esr/faqs/smart-questions.html
>>
>> Stefan
>> --
>> http://mail.python.org/mailman/listinfo/python-list
>>
>>

>


 
Reply With Quote
 
elca
Guest
Posts: n/a
 
      10-25-2009

Hi,
thanks a lot.
studying alone is tough thing
how can i improve my skill...


paul kölle wrote:
>
> elca schrieb:
>> Hello,

> Hi,
>
>> following is script source which can beautifulsoup and PAMIE work
>> together.
>> but if i run this script source error was happened.
>>
>> AttributeError: PAMIE instance has no attribute 'pageText'
>> File "C:\test12.py", line 7, in <module>
>> bs = BeautifulSoup(ie.pageText())

> You could execute the script line by line in the python console, then
> after the line "ie = PAMIE(url)" look at the "ie" object with "dir(ie)"
> to check if it really looks like a healthy instance. ...got bored, just
> tried it -- looks like pageText() has been renamed to getPageText().
> Try:
> text = PAMIE('http://www.cnn.com').getPageText()
>
> cheers
> Paul
>
>>
>> and following is orginal source until i was found in internet.
>>
>> from BeautifulSoup import BeautifulSoup
>> from PAM30 import PAMIE
>> url = 'http://www.cnn.com'
>> ie = PAMIE(url)
>> bs = BeautifulSoup(ie.pageText())
>>
>> if possible i really want to make it work together with beautifulsoup or
>> lxml with PAMIE.
>> sorry my bad english.
>> thanks in advance.
>>
>>
>>
>>
>>
>>
>> Stefan Behnel-3 wrote:
>>> Hi,
>>>
>>> elca, 25.10.2009 02:35:
>>>> hello...
>>>> if anyone know..please help me !
>>>> i really want to know...i was searched in google lot of time.
>>>> but can't found clear soultion. and also because of my lack of python
>>>> knowledge.
>>>> i want to use IE.navigate function with beautifulsoup or lxml..
>>>> if anyone know about this or sample.
>>>> please help me!
>>>> thanks in advance ..
>>> You wrote a message with nine lines, only one of which gives a tiny hint
>>> on
>>> what you actually want to do. What about providing an explanation of
>>> what
>>> you want to achieve instead? Try to answer questions like: Where does
>>> your
>>> data come from? Is it XML or HTML? What do you want to do with it?
>>>
>>> This might help:
>>>
>>> http://www.catb.org/~esr/faqs/smart-questions.html
>>>
>>> Stefan
>>> --
>>> http://mail.python.org/mailman/listinfo/python-list
>>>
>>>

>>

>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>


--
View this message in context: http://www.nabble.com/how-can-i-use-...p26046638.html
Sent from the Python - python-list mailing list archive at Nabble.com.

 
Reply With Quote
 
paul
Guest
Posts: n/a
 
      10-25-2009
elca schrieb:
> Hi,
> thanks a lot.
> studying alone is tough thing
> how can i improve my skill...

1. Stop top-posting.
2. Read documentation
3. Use the interactive prompt

cheers
Paul

>
>
> paul kölle wrote:
>> elca schrieb:
>>> Hello,

>> Hi,
>>
>>> following is script source which can beautifulsoup and PAMIE work
>>> together.
>>> but if i run this script source error was happened.
>>>
>>> AttributeError: PAMIE instance has no attribute 'pageText'
>>> File "C:\test12.py", line 7, in <module>
>>> bs = BeautifulSoup(ie.pageText())

>> You could execute the script line by line in the python console, then
>> after the line "ie = PAMIE(url)" look at the "ie" object with "dir(ie)"
>> to check if it really looks like a healthy instance. ...got bored, just
>> tried it -- looks like pageText() has been renamed to getPageText().
>> Try:
>> text = PAMIE('http://www.cnn.com').getPageText()
>>
>> cheers
>> Paul
>>
>>> and following is orginal source until i was found in internet.
>>>
>>> from BeautifulSoup import BeautifulSoup
>>> from PAM30 import PAMIE
>>> url = 'http://www.cnn.com'
>>> ie = PAMIE(url)
>>> bs = BeautifulSoup(ie.pageText())
>>>
>>> if possible i really want to make it work together with beautifulsoup or
>>> lxml with PAMIE.
>>> sorry my bad english.
>>> thanks in advance.
>>>
>>>
>>>
>>>
>>>
>>>
>>> Stefan Behnel-3 wrote:
>>>> Hi,
>>>>
>>>> elca, 25.10.2009 02:35:
>>>>> hello...
>>>>> if anyone know..please help me !
>>>>> i really want to know...i was searched in google lot of time.
>>>>> but can't found clear soultion. and also because of my lack of python
>>>>> knowledge.
>>>>> i want to use IE.navigate function with beautifulsoup or lxml..
>>>>> if anyone know about this or sample.
>>>>> please help me!
>>>>> thanks in advance ..
>>>> You wrote a message with nine lines, only one of which gives a tiny hint
>>>> on
>>>> what you actually want to do. What about providing an explanation of
>>>> what
>>>> you want to achieve instead? Try to answer questions like: Where does
>>>> your
>>>> data come from? Is it XML or HTML? What do you want to do with it?
>>>>
>>>> This might help:
>>>>
>>>> http://www.catb.org/~esr/faqs/smart-questions.html
>>>>
>>>> Stefan
>>>> --
>>>> http://mail.python.org/mailman/listinfo/python-list
>>>>
>>>>

>> --
>> http://mail.python.org/mailman/listinfo/python-list
>>
>>

>


 
Reply With Quote
 
elca
Guest
Posts: n/a
 
      10-25-2009



paul kölle wrote:
>
> elca schrieb:
>> Hi,
>> thanks a lot.
>> studying alone is tough thing
>> how can i improve my skill...

> 1. Stop top-posting.
> 2. Read documentation
> 3. Use the interactive prompt
>
> cheers
> Paul
>
>>
>>
>> paul kölle wrote:
>>> elca schrieb:
>>>> Hello,
>>> Hi,
>>>
>>>> following is script source which can beautifulsoup and PAMIE work
>>>> together.
>>>> but if i run this script source error was happened.
>>>>
>>>> AttributeError: PAMIE instance has no attribute 'pageText'
>>>> File "C:\test12.py", line 7, in <module>
>>>> bs = BeautifulSoup(ie.pageText())
>>> You could execute the script line by line in the python console, then
>>> after the line "ie = PAMIE(url)" look at the "ie" object with "dir(ie)"
>>> to check if it really looks like a healthy instance. ...got bored, just
>>> tried it -- looks like pageText() has been renamed to getPageText().
>>> Try:
>>> text = PAMIE('http://www.cnn.com').getPageText()
>>>
>>> cheers
>>> Paul
>>>
>>>> and following is orginal source until i was found in internet.
>>>>
>>>> from BeautifulSoup import BeautifulSoup
>>>> from PAM30 import PAMIE
>>>> url = 'http://www.cnn.com'
>>>> ie = PAMIE(url)
>>>> bs = BeautifulSoup(ie.pageText())
>>>>
>>>> if possible i really want to make it work together with beautifulsoup
>>>> or
>>>> lxml with PAMIE.
>>>> sorry my bad english.
>>>> thanks in advance.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Stefan Behnel-3 wrote:
>>>>> Hi,
>>>>>
>>>>> elca, 25.10.2009 02:35:
>>>>>> hello...
>>>>>> if anyone know..please help me !
>>>>>> i really want to know...i was searched in google lot of time.
>>>>>> but can't found clear soultion. and also because of my lack of python
>>>>>> knowledge.
>>>>>> i want to use IE.navigate function with beautifulsoup or lxml..
>>>>>> if anyone know about this or sample.
>>>>>> please help me!
>>>>>> thanks in advance ..
>>>>> You wrote a message with nine lines, only one of which gives a tiny
>>>>> hint
>>>>> on
>>>>> what you actually want to do. What about providing an explanation of
>>>>> what
>>>>> you want to achieve instead? Try to answer questions like: Where does
>>>>> your
>>>>> data come from? Is it XML or HTML? What do you want to do with it?
>>>>>
>>>>> This might help:
>>>>>
>>>>> http://www.catb.org/~esr/faqs/smart-questions.html
>>>>>
>>>>> Stefan
>>>>> --
>>>>> http://mail.python.org/mailman/listinfo/python-list
>>>>>
>>>>>
>>> --
>>> http://mail.python.org/mailman/listinfo/python-list
>>>
>>>

>>

>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>



hello,
im sorry ,also im not familiar with newsgroup.
so this position is bottom-posting position?
if wrong correct me..
thanks , in addition i was testing just before you sent

text = PAMIE('http://www.naver.com').getPageText()
i have some question...
how can i keep open only one windows? not open several windows.
following is my scenario.
after open www.cnn.com i want to go
http://www.cnn.com/2009/US/10/24/tee...doe/index.html
with keep only one windows.

text = PAMIE('http://www.cnn.com').getPageText()
sleep(5)
text = PAMIE('http://www.cnn.com/2009/US/10/24/teen.jane.doe/index.html')
thanks in advance


--
View this message in context: http://www.nabble.com/how-can-i-use-...p26046897.html
Sent from the Python - python-list mailing list archive at Nabble.com.

 
Reply With Quote
 
Michiel Overtoom
Guest
Posts: n/a
 
      10-25-2009
elca wrote:

> im sorry ,also im not familiar with newsgroup.


It's not a newsgroup, but a mailing list. And if you're new to a certain
community you're not familiar with, it's best to lurk a few days to see
how it is used.


> so this position is bottom-posting position?


It is, but you should also cut away any quoted text that is not directly
related to the answer.
Otherwise people have to scroll many screens full of text before they
can see the answer.


> how can i keep open only one windows? not open several windows.


The trick is to not instantiate multiple PAMIE objects, but only once,
and reuse that.
Like:

import time
import PAM30
ie=PAM30.PAMIE( )

ie.navigate("http://www.cnn.com")
text1=ie.getPageText()

ie.navigate("http://www.nu.nl")
text2=ie.getPageText()

ie.quit()
print len(text1), len(text2)


But still I think it's unnecessary to use Internet Explorer to get
simple web pages.
The standard library "urllib2.urlopen()" works just as well, and doesn't
rely on Internet Explorer to be present.

Greetings,


--
"The ability of the OSS process to collect and harness
the collective IQ of thousands of individuals across
the Internet is simply amazing." - Vinod Valloppillil
http://www.catb.org/~esr/halloween/halloween4.html
 
Reply With Quote
 
Irmen de Jong
Guest
Posts: n/a
 
      10-25-2009
Michiel Overtoom wrote:
> elca wrote:
>
>> im sorry ,also im not familiar with newsgroup.

>
> It's not a newsgroup, but a mailing list. And if you're new to a certain
> community you're not familiar with, it's best to lurk a few days to see
> how it is used.


Pot. Kettle. Black.
comp.lang.python really is a usenet news group. There is a mailing list that mirrors the
newsgroup though.

-irmen
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Re: lxml can't output right unicode result MRAB Python 0 09-07-2012 01:14 AM
[ANN] lxml 1.0 released Stefan Behnel Python 2 06-03-2006 12:32 PM
[ANN] lxml 0.9 is out! Stefan Behnel Python 0 03-20-2006 08:17 PM
ANN: MathDOM 0.5.2 - MathML in Python - now featuring lxml API! Stefan Behnel Python 0 10-17-2005 09:30 AM



Advertisments