Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > strip all html but links

Reply
Thread Tools

strip all html but links

 
 
Felix Smith
Guest
Posts: n/a
 
      01-11-2004
How would you go about removing all html tags from a Web page's source
code, except for links ? I've been successfully using the function
below to get rid of *all* html tags. But I need to keep links. Any
code you can post to help will be much appreciated.

Felix.

function I've been using:

sub html_to_ascii {
use HTML::TreeBuilder;
use HTML::FormatText;
$document = $_[0];
$html = HTML::TreeBuilder->new();
$html->parse($document);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 0);
$return = $formatter->format($html);
return $return;
}
 
Reply With Quote
 
 
 
 
A. Sinan Unur
Guest
Posts: n/a
 
      01-11-2004
http://www.velocityreviews.com/forums/(E-Mail Removed) (Felix Smith) wrote in news:901f024b.0401101704.51858e29
@posting.google.com:

> How would you go about removing all html tags from a Web page's source
> code, except for links?


See the hanchors example that comes with the HTML:arser module:

http://search.cpan.org/src/GAAS/HTML-Parser-3.35/eg/


--
A. Sinan Unur
(E-Mail Removed) (reverse each component for email address)
 
Reply With Quote
 
 
 
 
dominix
Guest
Posts: n/a
 
      01-11-2004
Felix Smith wrote:
> How would you go about removing all html tags from a Web page's source
> code, except for links ? I've been successfully using the function
> below to get rid of *all* html tags. But I need to keep links. Any
> code you can post to help will be much appreciated.
>
> Felix.
>
> function I've been using:
>
> sub html_to_ascii {
> use HTML::TreeBuilder;
> use HTML::FormatText;
> $document = $_[0];
> $html = HTML::TreeBuilder->new();
> $html->parse($document);
> $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 0);
> $return = $formatter->format($html);
> return $return;
> }



use strict;
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new( shift );

while ( my $token = $p->get_token ) {
print $token->as_is if $token->is_text;
print $token->return_attr->{"href"} if $token->is_start_tag( 'a' )
}

--
dominix


 
Reply With Quote
 
Felix
Guest
Posts: n/a
 
      01-11-2004
Thanks so much for helping with this. Can you tell me how to change
the code below so I can use it via a function called, say,
remove_tags, like this:

$stripped_content = remove_tags ($content_with tags);

Thank you very much again!

"dominix" <dominix@(E-Mail Removed)> wrote in message news:<4001015a$0$7143
>
> use strict;
> use HTML::TokeParser::Simple;
> my $p = HTML::TokeParser::Simple->new( shift );
>
> while ( my $token = $p->get_token ) {
> print $token->as_is if $token->is_text;
> print $token->return_attr->{"href"} if $token->is_start_tag( 'a' )
> }

 
Reply With Quote
 
dominix
Guest
Posts: n/a
 
      01-11-2004
Felix wrote:
> Thanks so much for helping with this. Can you tell me how to change
> the code below so I can use it via a function called, say,
> remove_tags, like this:
>
> $stripped_content = remove_tags ($content_with tags);
>
> Thank you very much again!
>
> "dominix" <dominix@(E-Mail Removed)> wrote in message
> news:<4001015a$0$7143
>>
>> use strict;
>> use HTML::TokeParser::Simple;
>> my $p = HTML::TokeParser::Simple->new( shift );
>>
>> while ( my $token = $p->get_token ) {
>> print $token->as_is if $token->is_text;
>> print $token->return_attr->{"href"} if $token->is_start_tag(
>> 'a' ) }


well, try something like (untested)

use strict;
use HTML::TokeParser::Simple;

sub whatever_you_want_the_name{
my $p = HTML::TokeParser::Simple->new( shift );
my $result;
while ( my $token = $p->get_token ) {
$result .= $token->as_is if $token->is_text;
$result .= $token->return_attr->{"href"} if $token->is_start_tag(
'a' )
}
return $result
}


 
Reply With Quote
 
Robin
Guest
Posts: n/a
 
      01-13-2004


"Felix Smith" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) om...
> How would you go about removing all html tags from a Web page's source
> code, except for links ? I've been successfully using the function
> below to get rid of *all* html tags. But I need to keep links. Any
> code you can post to help will be much appreciated.


instead use tr// or s//

> Felix.
>
> function I've been using:
>
> sub html_to_ascii {
> use HTML::TreeBuilder;
> use HTML::FormatText;
> $document = $_[0];
> $html = HTML::TreeBuilder->new();
> $html->parse($document);
> $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 0);
> $return = $formatter->format($html);
> return $return;
> }


that's a little slower than what I mentioned earlier...


--
Regards,
Robin
--
(E-Mail Removed)
--



 
Reply With Quote
 
Uri Guttman
Guest
Posts: n/a
 
      01-13-2004
>>>>> "R" == Robin <(E-Mail Removed)> writes:

R> "Felix Smith" <(E-Mail Removed)> wrote in message
R> news:(E-Mail Removed) om...
>> How would you go about removing all html tags from a Web page's source
>> code, except for links ? I've been successfully using the function
>> below to get rid of *all* html tags. But I need to keep links. Any
>> code you can post to help will be much appreciated.


R> instead use tr// or s//

ok, explain how you can remove any html with tr///?

and then explain how you can accurately remove html with s///? did you
read the FAQ on this? NOT!

>> sub html_to_ascii {
>> use HTML::TreeBuilder;
>> use HTML::FormatText;
>> $document = $_[0];
>> $html = HTML::TreeBuilder->new();
>> $html->parse($document);
>> $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 0);
>> $return = $formatter->format($html);
>> return $return;
>> }


R> that's a little slower than what I mentioned earlier...

and a whole lot more accurate. which is better, wrong and fast or slow
and accurate. remember, your entire programming career is depending on
your answer. think hard. then rethink what you answered above.

uri

--
Uri Guttman ------ (E-Mail Removed) -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
 
Reply With Quote
 
Jürgen Exner
Guest
Posts: n/a
 
      01-13-2004
Robin wrote:
> "Felix Smith" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed) om...
>> How would you go about removing all html tags from a Web page's
>> source code, except for links ? I've been successfully using the
>> function below to get rid of *all* html tags. But I need to keep
>> links. Any code you can post to help will be much appreciated.

>
> instead use tr// or s//


How come it doesn't surprise me that such an idiotic advice is coming from
you?

No, s// is absolutely not the right tool to parse/deal with HTML.

And suggesting tr// is just plain ridiculous. Please show me the code to
remove all HTML tags from a text but links using tr and I will send you a
100$ gift certificate for Barnes and Nobles, such that you can by yourself
some nice Perl books.

jue


 
Reply With Quote
 
Uri Guttman
Guest
Posts: n/a
 
      01-13-2004
>>>>> "JE" == Jürgen Exner <(E-Mail Removed)> writes:

JE> Robin wrote:
>>
>> instead use tr// or s//


JE> How come it doesn't surprise me that such an idiotic advice is coming from
JE> you?

JE> And suggesting tr// is just plain ridiculous. Please show me the code to
JE> remove all HTML tags from a text but links using tr and I will send you a
JE> 100$ gift certificate for Barnes and Nobles, such that you can by yourself
JE> some nice Perl books.

i will donate to that one. not a great risk

maybe like this:

<very rough pseudo code>

while ( $i < length $html ) {
$char = substr( $html, $i, 1 ) ;

if ( $char =~ tr/<>// ) {

$DIETY knows what code
}
else {

$DIETY knows what state
}
}

ain't tr useful!



uri

--
Uri Guttman ------ (E-Mail Removed) -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
 
Reply With Quote
 
Tassilo v. Parseval
Guest
Posts: n/a
 
      01-13-2004
Also sprach Uri Guttman:

>>>>>> "R" == Robin <(E-Mail Removed)> writes:

>
> R> "Felix Smith" <(E-Mail Removed)> wrote in message
> R> news:(E-Mail Removed) om...
> >> How would you go about removing all html tags from a Web page's source
> >> code, except for links ? I've been successfully using the function
> >> below to get rid of *all* html tags. But I need to keep links. Any
> >> code you can post to help will be much appreciated.

>
> R> instead use tr// or s//
>
> ok, explain how you can remove any html with tr///?


With a state-machine of course. Tss, Uri, don't you know anything?

Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus}) !JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexi ixesixeseg;y~\n~~dddd;eval
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
strip all but second second line from bottom and then strip that!!!! yelipolok Perl Misc 4 01-27-2010 08:14 AM
Is there an ASP command to auto strip all HTML tags out of a string? Laphan ASP General 1 06-18-2006 02:21 PM
strip and its evil brother strip! Aquila Ruby 35 03-31-2005 04:10 AM
Opening all links of a html page and saving the html pages java_seek Java 4 12-10-2004 04:33 PM
java.util.Zip adding file but strip directory Michael Trosen Java 2 06-18-2004 07:11 AM



Advertisments