Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   Want to extract the proxy list by using regexp. (http://www.velocityreviews.com/forums/t909565-want-to-extract-the-proxy-list-by-using-regexp.html)

Hongyi Zhao 01-29-2009 09:38 AM

Want to extract the proxy list by using regexp.
 
Hi all,

I want to extract the proxy list given in the following url:

http://www.cybersyndrome.net/pla5.html

which is in the following form:

---------------
[snipped]

202.99.29.27:80
221.11.27.110:8080
ip-72-55-191-6.static.privatedns.com:3128
114.30.47.10:80
116.52.155.237:80
204.73.37.112:80
220.227.90.154:8080
211.136.253.234:80
host04.wilsonareasdips.w.subnet.rcn.com:8080

[snipped]
-----------------

Firstly, I use wget to obtin the above webpage:

wget -c http://www.cybersyndrome.net/pla5.html -O pla5

Then I want to use some regular expressions to extract the proxy list,
who can give me some hints?

Regards,

--
..: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.

Tad J McClellan 01-29-2009 12:50 PM

Re: Want to extract the proxy list by using regexp.
 
Hongyi Zhao <hongyi.zhao@gmail.com> wrote:


> I want to extract the proxy list given in the following url:
>
> http://www.cybersyndrome.net/pla5.html



> Then I want to use some regular expressions to extract the proxy list,
> who can give me some hints?



Regular expressions are most often not the Right Tool for processing
HTML data.

A module that understands HTML is best for processing HTML data.


------------------------------
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TreeBuilder;
use LWP::Simple;

my $html = get 'http://www.cybersyndrome.net/pla5.html';
my $tree = HTML::TreeBuilder->new_from_content($html);

foreach my $elem ( $tree->find_by_attribute('onmouseout', 'd()') ) {
print $elem->as_text, "\n";
}
------------------------------


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"

Hongyi Zhao 01-29-2009 01:26 PM

Re: Want to extract the proxy list by using regexp.
 
On Thu, 29 Jan 2009 06:50:36 -0600, Tad J McClellan
<tadmc@seesig.invalid> wrote:

>Hongyi Zhao <hongyi.zhao@gmail.com> wrote:
>
>
>> I want to extract the proxy list given in the following url:
>>
>> http://www.cybersyndrome.net/pla5.html

>
>
>> Then I want to use some regular expressions to extract the proxy list,
>> who can give me some hints?

>
>
>Regular expressions are most often not the Right Tool for processing
>HTML data.
>
>A module that understands HTML is best for processing HTML data.
>
>
>------------------------------
>#!/usr/bin/perl
>use warnings;
>use strict;
>use HTML::TreeBuilder;
>use LWP::Simple;
>
>my $html = get 'http://www.cybersyndrome.net/pla5.html';
>my $tree = HTML::TreeBuilder->new_from_content($html);
>
>foreach my $elem ( $tree->find_by_attribute('onmouseout', 'd()') ) {
> print $elem->as_text, "\n";
>}
>------------------------------


Very good, thanks a lot.

--
..: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.


All times are GMT. The time now is 12:36 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.