Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > [ann] CGI Link Checker 0.1

Reply
Thread Tools

[ann] CGI Link Checker 0.1

 
 
Adayapalam Appaiah Kumaraswamy
Guest
Posts: n/a
 
      07-13-2004
Dear Python users,
I am new to Python. As I learnt a bit more on coding in Python, I
decided to try out a simple project: to write a CGI script in Python to
check links on a single HTML page on the web. Although I am just a hobby
programmer, I thought I could show it to others and ask for their
comments and suggestions. It is my first CGI script as well as my first
Python application, so you might find the coding immature. Please do
correct me wherever necessary.

I looked about around the net, but found only a few link-checking
details related to Python. So, I thought I could write a no-frills one
myself.

BTW the W3C Link Checker is written in Perl. I don't know Perl, so I
couldn't look at it for ideas.

I had to face the following problems:

1.Delayed responses for large pages: I worked around this by flushing
sys.stdout after every three links checked; that might lead to
inefficiency, but it does throw the results three at a time to the
impatient user. Otherwise, the Python interpreter would wait until the
output buffer is filled till dumping it to the web server's output.

2.Slow: I don't know how to make the script perform better. I've tried
to look into the code to make it run faster, but I couldn't do so. Also,
I think the hosting server's bandwidth may contribute to this. Still,
it takes only about 5 to 10 seconds more than the W3C validator for very
large pages, and 2 to 3 seconds more for smaller ones. Your results may
vary, I'd love to know.

3.HTML parsing: I have made no attempt to (and I do not propose to)
check pages with incorrect HTML/XHTML. This means that if the Python
HTMLParser fails, my script exits gracefully. An example of invalid HTML
is http://www.yahoo.com/.

Finally, since this is my first Python program, I might not have
properly adapted to the style of programming experienced Python users
may be accustomed to. So, I request you to please correct me in this
regard as well.

In all, it was an good experience, and gave me more than a glimpse of
the power offered by Python.

Please read the instructions on the page before entering your URL to
test the script. Remember to enter the link as http:// and don't forget
to add the slash (/) for those links which and in a directory, like

http://myserver/my/dir/

You can spawn the script from:

http://kumar.travisbsd.org/pyprogs/example.html

Personally, I have tried the following sites with this script:
http://www.w3.org/ - Works 100% perfect.
http://www.yahoo.com/ - Invalid HTML. Exits gracefully.

Source code only (meaning without the fancy images and CSS I have used):
http://kumar.travisbsd.org/pyprogs/cgilink.txt

If you want to try hosting the script on your own server, get this and
see the README (This includes all the images and fancy CSS):
http://kumar.travisbsd.org/pyprogs/cgilink-0.1.tar.gz

Thank you.
Kumar

--
Adayapalam Appaiah Kumaraswamy
(Kumar Appaiah)

Web: http://www.ee.iitm.ac.in/~ee03b091/



 
Reply With Quote
 
 
 
 
Christopher T King
Guest
Posts: n/a
 
      07-13-2004
On Tue, 13 Jul 2004, Adayapalam Appaiah Kumaraswamy wrote:

> Dear Python users,
> I am new to Python. As I learnt a bit more on coding in Python, I
> decided to try out a simple project: to write a CGI script in Python to
> check links on a single HTML page on the web. Although I am just a hobby
> programmer, I thought I could show it to others and ask for their
> comments and suggestions. It is my first CGI script as well as my first
> Python application, so you might find the coding immature. Please do
> correct me wherever necessary.


First off, good job for a first script! The interface looks very
professional, and your code is very clean.

> I had to face the following problems:
>
> 1.Delayed responses for large pages: I worked around this by flushing
> sys.stdout after every three links checked; that might lead to
> inefficiency, but it does throw the results three at a time to the
> impatient user. Otherwise, the Python interpreter would wait until the
> output buffer is filled till dumping it to the web server's output.


You could probably flush the buffer after each link is checked; this
shouldn't cause any noticable overhead (the time spent checking the links
will greatly overshadow the time spent flushing the buffer), but that's
assuming the web server doesn't do any per-buffer-flush processing (which
it might, if you are using server-side-includes).

> 2.Slow: I don't know how to make the script perform better. I've tried
> to look into the code to make it run faster, but I couldn't do so.


For the same reason as above (time is spent mostly checking the links) I
don't think tweaking the code will help much in this case. I was going to
suggest checking if urllib2 uses read-ahead buffering, but a quick check
reveals it doesn't do any... perhaps the culprit is in the HTML parsing?

> 3.HTML parsing: I have made no attempt to (and I do not propose to)
> check pages with incorrect HTML/XHTML. This means that if the Python
> HTMLParser fails, my script exits gracefully. An example of invalid HTML
> is http://www.yahoo.com/.


I've seen the BeautifulSoup module recommended before as a parser that
will gracefully handle malformed HTML. It may even be faster than
HTMLParser (but this is just a guess). The homepage is
http://www.crummy.com/software/BeautifulSoup/, but it doesn't seem to be
up right now.

> Finally, since this is my first Python program, I might not have
> properly adapted to the style of programming experienced Python users
> may be accustomed to. So, I request you to please correct me in this
> regard as well.


No corrections needed

 
Reply With Quote
 
 
 
 
Christopher T King
Guest
Posts: n/a
 
      07-13-2004
On Tue, 13 Jul 2004, Christopher T King wrote:

> > 2.Slow: I don't know how to make the script perform better. I've tried
> > to look into the code to make it run faster, but I couldn't do so.

>
> For the same reason as above (time is spent mostly checking the links) I
> don't think tweaking the code will help much in this case. I was going to
> suggest checking if urllib2 uses read-ahead buffering, but a quick check
> reveals it doesn't do any... perhaps the culprit is in the HTML parsing?


A further thought on the issue... the W3C's link checker might be
multithreaded, allowing it to check multiple links at the same time,
rather than waiting for each server to respond in turn. This may or may
not help in Python; Python doesn't play well with mulithreading (due to a
global interpreter lock), so whether or not you see a speedup using this
method is dependent on whether the socket module is smart enough to
release the interpreter lock (my guess is it is). Otherwise, to get the
same effect, you'd have to use the socket module directly for link
checking, in concert with the select module, which will likely get quite
messy.

 
Reply With Quote
 
Neil Hodgson
Guest
Posts: n/a
 
      07-13-2004
Christopher T King:

> A further thought on the issue... the W3C's link checker might be
> multithreaded, allowing it to check multiple links at the same time,
> rather than waiting for each server to respond in turn. This may or may
> not help in Python; Python doesn't play well with mulithreading (due to a
> global interpreter lock), so whether or not you see a speedup using this
> method is dependent on whether the socket module is smart enough to
> release the interpreter lock (my guess is it is).


Multithreading works well with sockets as the GIL is released during
blocking calls. I have used multithreading for link checking and host load
testing.

Neil


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Syntax Checker that's better than the normal syntax checker Jacob Grover Ruby 5 07-18-2008 05:07 AM
MSN BLOCK CHECKER-MSN STATUS CHECKER-MSN PROBLEMS Pager O Rama Digital Photography 0 04-04-2006 06:58 PM
MSN BLOCK CHECKER-MSN STATUS CHECKER-MSN PROBLEMS Pager O Rama ASP General 0 04-04-2006 06:41 PM
link checker web dev MCSD 0 12-20-2004 04:32 PM



Advertisments