Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > RE: Parsing/Crawler Questions - solution

Reply
Thread Tools

RE: Parsing/Crawler Questions - solution

 
 
bruce
Guest
Posts: n/a
 
      03-07-2009
.....

and this solution will somehow allow a user to create a web parsing/scraping
app for parising links, and javascript from a web page?


-----Original Message-----
From: python-list-bounces+bedouglas=(E-Mail Removed)
[mailtoython-list-bounces+bedouglas=(E-Mail Removed)]On Behalf
Of lkcl
Sent: Saturday, March 07, 2009 2:34 AM
To: http://www.velocityreviews.com/forums/(E-Mail Removed)
Subject: Re: Parsing/Crawler Questions - solution


On Mar 7, 12:19 am, (E-Mail Removed) wrote:
> So, it sounds like your update means that it is related to a specific
> url.
>
> I'm curious about this issue myself. I've often wondered how one
> could properly crawl anAJAX-ish site when you're not sure how quickly
> the data will be returned after the page has been.


you want to look at the webkit engine - no not the graphical browser
- the ParseTree example - and combine it with pywebkitgtk - no not the
"original" version, the one which has DOM-manipulation bindings
through webkit-glib.

the webkit parse tree example is, despite it being based on the GTK
"port" as they like to call it in webkit (which just means that it
links with GTK not QT4 or wxWidgets), is a console-based application.

in other words, despite it being GTK, it still does NOT output
graphical crap to the screen, yet it still *executes* the javascript
on the page.

dummy functions for "mouse", "keyboard", "console errors" are given as
examples and are left as an exercise for the application writer to
fill-in-the-blanks.

combining this parse tree example with pywebkitgtk (see
demobrowser.py) would provide a means by which web pages can be
executed AT THE CONSOLE NOT AS A GUI APP, then, thanks to the glib /
gobject bindings, a python app will be able to walk the DOM tree as
expected.

i _just_ fixed pyjamas-desktop's iterators in the pyjamas.DOM module
for someone, on the pyjamas-dev mailing list.


http://github.com/lkcl/pyjamas-deskt...51c3e3ced662a2
dd014540

so, actually, you may be better off starting from pyjamas-desktop and
then cutting out the "fire up the GTK window" bit, from pyjd.py.

pyjd.py is based on pywebkitgtk's demobrowser.py

the alternative to webkit is to use python-hulahop - it will do the
same thing, but just using python bindings to gecko instead of python-
bindings-to-glib-bindings-to-webkit.


l.
--
http://mail.python.org/mailman/listinfo/python-list

 
Reply With Quote
 
 
 
 
lkcl
Guest
Posts: n/a
 
      03-08-2009
On Mar 7, 9:56 pm, "bruce" <(E-Mail Removed)> wrote:
> ....
>
> and this solution will somehow allow a user to create a web parsing/scraping
> app for parising links, and javascript from a web page?



not just parsing the links and the "static" javascript, but:

* actually executing the javascript, giving the quotes page quotes a
chance to actually _look_ like it would if it was being viewed as a
quotes real quotes web browser.

so any XMLHTTPRequests will _actually_ get executed, _actually_
result in _actually_ having the content of the web page _properly_
modified.

so, e.g instead of seeing a "Loader" page on gmail you would
_actually_ see the user's email and the adverts (assuming you went to
the trouble of putting in the username/password) because the AJAX
would _actually_ get executed by the WebKit engine, and the DOM model
accessed thereafter.


* giving the user the opportunity to call DOM methods such as
getElementsByTagName and the opportunity to access properties such as
document.anchors.

in webkit-glib "gdom" bindings, that would be:

* anchor_list = gdom_document_get_elements_by_tag_name(doc, "a");

or

* g_object_get(doc, "anchors", &anchor_list, NULL);

which in pywebkitgtk (thanks to python-pygobject auto-generation of
python bindings from gobject bindings) translates into:

* doc.get_elements_by_tag_name("a")

or

* doc.props.anchors

which in pyjamas-desktop, a high-level abstraction on top of _that_,
turns into:

* from pyjamas import DOM
anchor_list = DOM.getElementsByTagName(doc, "a")

or

* from pyjamas import DOM
anchor_list = DOM.getAttribute(doc, "anchors")

answer: yes.

l.

> -----Original Message-----
> From: python-list-bounces+bedouglas=(E-Mail Removed)
>
> [mailtoython-list-bounces+bedouglas=(E-Mail Removed)]On Behalf
> Oflkcl
> Sent: Saturday, March 07, 2009 2:34 AM
> To: (E-Mail Removed)
> Subject: Re: Parsing/Crawler Questions - solution
>
> On Mar 7, 12:19 am, (E-Mail Removed) wrote:
> > So, it sounds like your update means that it is related to a specific
> > url.

>
> > I'm curious about this issue myself. I've often wondered how one
> > could properly crawl anAJAX-ish site when you're not sure how quickly
> > the data will be returned after the page has been.

>
> you want to look at the webkit engine - no not the graphical browser
> - the ParseTree example - and combine it with pywebkitgtk - no not the
> "original" version, the one which has DOM-manipulation bindings
> through webkit-glib.
>
> the webkit parse tree example is, despite it being based on the GTK
> "port" as they like to call it in webkit (which just means that it
> links with GTK not QT4 or wxWidgets), is a console-based application.
>
> in other words, despite it being GTK, it still does NOT output
> graphical crap to the screen, yet it still *executes* the javascript
> on the page.
>
> dummy functions for "mouse", "keyboard", "console errors" are given as
> examples and are left as an exercise for the application writer to
> fill-in-the-blanks.
>
> combining this parse tree example with pywebkitgtk (see
> demobrowser.py) would provide a means by which web pages can be
> executed AT THE CONSOLE NOT AS A GUI APP, then, thanks to the glib /
> gobject bindings, a python app will be able to walk the DOM tree as
> expected.
>
> i _just_ fixed pyjamas-desktop's iterators in the pyjamas.DOM module
> for someone, on the pyjamas-dev mailing list.
>
> http://github.com/lkcl/pyjamas-deskt...e5d1d3451c3e3c...
> dd014540
>
> so, actually, you may be better off starting from pyjamas-desktop and
> then cutting out the "fire up the GTK window" bit, from pyjd.py.
>
> pyjd.py is based on pywebkitgtk's demobrowser.py
>
> the alternative to webkit is to use python-hulahop - it will do the
> same thing, but just using python bindings to gecko instead of python-
> bindings-to-glib-bindings-to-webkit.
>
> l.
> --http://mail.python.org/mailman/listinfo/python-list


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
C++ solution for K & R(2nd Ed) Ex.6-4 - better solution needed subramanian100in@yahoo.com, India C++ 17 10-01-2007 09:00 AM
Solution file not in the solution folder =?Utf-8?B?Y2FzaGRlc2ttYWM=?= ASP .Net 2 09-12-2006 11:04 AM
A Solution using Tasks Re: [Stackless] Suggestion for a Solution ? Andrew Francis Python 0 06-28-2006 06:05 PM



Advertisments