Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Ruby (http://www.velocityreviews.com/forums/f66-ruby.html)
-   -   How to extract texts from html source? (http://www.velocityreviews.com/forums/t821657-how-to-extract-texts-from-html-source.html)

Sam Kong 05-09-2005 07:02 PM

How to extract texts from html source?
 
Hi, all!

Quite often, when I need to read a list of web pages, I download the
html sources and save them in a single file like a.html.
If they are mostly texts, I open the html using web browser, select all
and copy it to an editor and save it.
I want to make the process shorter.
How can I extract the text from html source?
I'm sure there're many parsers for it.
What is the most convenient one?

Thanks.
Sam


James Britt 05-09-2005 07:22 PM

Re: How to extract texts from html source?
 
Sam Kong wrote:
> Hi, all!
>
> Quite often, when I need to read a list of web pages, I download the
> html sources and save them in a single file like a.html.
> If they are mostly texts, I open the html using web browser, select all
> and copy it to an editor and save it.
> I want to make the process shorter.
> How can I extract the text from html source?
> I'm sure there're many parsers for it.
> What is the most convenient one?



Take a a look at Michael Neumann's WWW::Mechanize

http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
http://rubyforge.org/frs/?group_id=427&release_id=2014

Or install the gem


James

>
> Thanks.
> Sam
>
>
> .
>



--

http://www.ruby-doc.org
http://www.rubyxml.com
http://catapult.rubyforge.com
http://orbjson.rubyforge.com
http://ooo4r.rubyforge.com
http://www.jamesbritt.com



Brian Schröder 05-09-2005 07:37 PM

Re: How to extract texts from html source?
 
On 09/05/05, James Britt <james_b@neurogami.com> wrote:
> Sam Kong wrote:
> > Hi, all!
> >
> > Quite often, when I need to read a list of web pages, I download the
> > html sources and save them in a single file like a.html.
> > If they are mostly texts, I open the html using web browser, select all
> > and copy it to an editor and save it.
> > I want to make the process shorter.
> > How can I extract the text from html source?
> > I'm sure there're many parsers for it.
> > What is the most convenient one?

>
> Take a a look at Michael Neumann's WWW::Mechanize
>
> http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
> http://rubyforge.org/frs/?group_id=427&release_id=2014
>
> Or install the gem
>
> James
>
> >
> > Thanks.
> > Sam
> >
> >
> > .
> >

>
> --
>
> http://www.ruby-doc.org
> http://www.rubyxml.com
> http://catapult.rubyforge.com
> http://orbjson.rubyforge.com
> http://ooo4r.rubyforge.com
> http://www.jamesbritt.com
>
>


You don't need ruby for this:

$ apt-cache show w3m
Package: w3m
[snip]
Description: WWW browsable pager with excellent tables/frames support
w3m is a text-based World Wide Web browser with IPv6 support.
It features excellent support for tables and frames. It can be used
as a standalone file pager, too.

Sam Kong 05-09-2005 07:49 PM

Re: How to extract texts from html source?
 

James Britt wrote:
> Sam Kong wrote:
> > Hi, all!
> >
> > Quite often, when I need to read a list of web pages, I download

the
> > html sources and save them in a single file like a.html.
> > If they are mostly texts, I open the html using web browser, select

all
> > and copy it to an editor and save it.
> > I want to make the process shorter.
> > How can I extract the text from html source?
> > I'm sure there're many parsers for it.
> > What is the most convenient one?

>
>
> Take a a look at Michael Neumann's WWW::Mechanize
>
> http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
> http://rubyforge.org/frs/?group_id=427&release_id=2014
>
> Or install the gem


Thank James.
That looks cool.
However, it doesn't seem to have a function to extract texts from html.
(Or did I miss it?)
What I want is...

<table><tr><td>TEST</td></tr></table> => TEST

Is there a module that does this?

Regards,
Sam

>
>
> James
>
> >
> > Thanks.
> > Sam
> >
> >
> > .
> >

>
>
> --
>
> http://www.ruby-doc.org
> http://www.rubyxml.com
> http://catapult.rubyforge.com
> http://orbjson.rubyforge.com
> http://ooo4r.rubyforge.com
> http://www.jamesbritt.com



Sam Kong 05-09-2005 08:00 PM

Re: How to extract texts from html source?
 

Brian Schröder wrote:
> On 09/05/05, James Britt <james_b@neurogami.com> wrote:
> > Sam Kong wrote:
> > > Hi, all!
> > >
> > > Quite often, when I need to read a list of web pages, I download

the
> > > html sources and save them in a single file like a.html.
> > > If they are mostly texts, I open the html using web browser,

select all
> > > and copy it to an editor and save it.
> > > I want to make the process shorter.
> > > How can I extract the text from html source?
> > > I'm sure there're many parsers for it.
> > > What is the most convenient one?

> >
> > Take a a look at Michael Neumann's WWW::Mechanize
> >
> > http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
> > http://rubyforge.org/frs/?group_id=427&release_id=2014
> >
> > Or install the gem
> >
> > James
> >
> > >
> > > Thanks.
> > > Sam
> > >
> > >
> > > .
> > >

> >
> > --
> >
> > http://www.ruby-doc.org
> > http://www.rubyxml.com
> > http://catapult.rubyforge.com
> > http://orbjson.rubyforge.com
> > http://ooo4r.rubyforge.com
> > http://www.jamesbritt.com
> >
> >

>
> You don't need ruby for this:
>
> $ apt-cache show w3m
> Package: w3m
> [snip]
> Description: WWW browsable pager with excellent tables/frames support
> w3m is a text-based World Wide Web browser with IPv6 support.
> It features excellent support for tables and frames. It can be used
> as a standalone file pager, too.
> .
> * You can follow links and/or view images in HTML.
> * Internet message preview mode, you can browse HTML mail.
> * You can follow links in plain text if it includes URL forms.
> * With w3m-img, you can view image inline.
> .
> For more information,
> see http://sourceforge.net/projects/w3m
>
> $ w3m -dump http://ruby.brian-schroeder.de/quiz/mazes/ | head
> A ruby a day!


Oh, thanks.
I just realized that even lynx can do that.

Regards,
Sam

>
> Ruby Quiz Solutions (Amazing Mazes)
>
> Amazing Mazes
>
> For a full description see: (Amazing Mazes on Ruby Quiz

Homepage)[http://
> www.rubyquiz.com/quiz31.html]
>
> Another graph algorithm. Create a maze that is fully connected and

has only one
> $
>
> regards,
>
> Brian
>
> --
> http://ruby.brian-schroeder.de/
>
> multilingual _non rails_ ruby based vocabulary trainer:
> http://www.vocabulaire.org/ | http://www.gloser.org/ |

http://www.vokabeln.net/


Tom Reilly 05-10-2005 02:07 AM

Re: How to extract texts from html source?
 
Several years ago, one of the members of the group offered me this
routine which does a pretty good job of
extracting the text from a html page.

#--------------------------------------------------------------------
# Strip HTML Tags from Line
#--------------------------------------------------------------------

def striphtml(line)
line.gsub(/\n/, ' ').gsub(/<.*?>/, '')
end



James Britt 05-10-2005 02:49 AM

Re: How to extract texts from html source?
 
Sam Kong wrote:
> Thank James.
> That looks cool.
> However, it doesn't seem to have a function to extract texts from html.
> (Or did I miss it?)


No, it is a library for the (fairly) easy creation of HTML munging code.

Some coding is required, but it allows complete control (so you get just
the text of interest).


James



daz 05-10-2005 11:52 AM

Re: How to extract texts from html source?
 

Sam Kong wrote:
>
> [...] If they are mostly texts, I open the html using
> web browser, select all and copy it to an editor and save it.
>


Save As ... [text file].txt

- Removes all tags.
(Verified with Opera, Firefox & IE6, so I guess most browsers do this)
( e.g. test page: http://www.qurl.net/ )


daz



Sam Kong 05-10-2005 03:55 PM

Re: How to extract texts from html source?
 
Yes, that's right...:)
I just want to do it all with my ruby program...hehe
Thanks anyway.

Sam


Sam Kong 05-10-2005 03:59 PM

Re: How to extract texts from html source?
 

Tom Reilly wrote:
> Several years ago, one of the members of the group offered me this
> routine which does a pretty good job of
> extracting the text from a html page.
>
> #--------------------------------------------------------------------
> # Strip HTML Tags from Line
> #--------------------------------------------------------------------
>
> def striphtml(line)
> line.gsub(/\n/, ' ').gsub(/<.*?>/, '')
> end


Thank you for sharing the code.
However, this code works only for a simple line, right?
When I tested it with a page of html by looping line by line, the
result was not what I expected.
Probably, I need to get a DOM parser...:-(

Sam



All times are GMT. The time now is 10:20 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.