Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > How to extract texts from html source?

Reply
Thread Tools

How to extract texts from html source?

 
 
Sam Kong
Guest
Posts: n/a
 
      05-09-2005
Hi, all!

Quite often, when I need to read a list of web pages, I download the
html sources and save them in a single file like a.html.
If they are mostly texts, I open the html using web browser, select all
and copy it to an editor and save it.
I want to make the process shorter.
How can I extract the text from html source?
I'm sure there're many parsers for it.
What is the most convenient one?

Thanks.
Sam

 
Reply With Quote
 
 
 
 
James Britt
Guest
Posts: n/a
 
      05-09-2005
Sam Kong wrote:
> Hi, all!
>
> Quite often, when I need to read a list of web pages, I download the
> html sources and save them in a single file like a.html.
> If they are mostly texts, I open the html using web browser, select all
> and copy it to an editor and save it.
> I want to make the process shorter.
> How can I extract the text from html source?
> I'm sure there're many parsers for it.
> What is the most convenient one?



Take a a look at Michael Neumann's WWW::Mechanize

http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
http://rubyforge.org/frs/?group_id=427&release_id=2014

Or install the gem


James

>
> Thanks.
> Sam
>
>
> .
>



--

http://www.ruby-doc.org
http://www.rubyxml.com
http://catapult.rubyforge.com
http://orbjson.rubyforge.com
http://ooo4r.rubyforge.com
http://www.jamesbritt.com


 
Reply With Quote
 
 
 
 
Brian Schröder
Guest
Posts: n/a
 
      05-09-2005
On 09/05/05, James Britt <(E-Mail Removed)> wrote:
> Sam Kong wrote:
> > Hi, all!
> >
> > Quite often, when I need to read a list of web pages, I download the
> > html sources and save them in a single file like a.html.
> > If they are mostly texts, I open the html using web browser, select all
> > and copy it to an editor and save it.
> > I want to make the process shorter.
> > How can I extract the text from html source?
> > I'm sure there're many parsers for it.
> > What is the most convenient one?

>
> Take a a look at Michael Neumann's WWW::Mechanize
>
> http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
> http://rubyforge.org/frs/?group_id=427&release_id=2014
>
> Or install the gem
>
> James
>
> >
> > Thanks.
> > Sam
> >
> >
> > .
> >

>
> --
>
> http://www.ruby-doc.org
> http://www.rubyxml.com
> http://catapult.rubyforge.com
> http://orbjson.rubyforge.com
> http://ooo4r.rubyforge.com
> http://www.jamesbritt.com
>
>


You don't need ruby for this:

$ apt-cache show w3m
Package: w3m
[snip]
Description: WWW browsable pager with excellent tables/frames support
w3m is a text-based World Wide Web browser with IPv6 support.
It features excellent support for tables and frames. It can be used
as a standalone file pager, too.
 
Reply With Quote
 
Sam Kong
Guest
Posts: n/a
 
      05-09-2005

James Britt wrote:
> Sam Kong wrote:
> > Hi, all!
> >
> > Quite often, when I need to read a list of web pages, I download

the
> > html sources and save them in a single file like a.html.
> > If they are mostly texts, I open the html using web browser, select

all
> > and copy it to an editor and save it.
> > I want to make the process shorter.
> > How can I extract the text from html source?
> > I'm sure there're many parsers for it.
> > What is the most convenient one?

>
>
> Take a a look at Michael Neumann's WWW::Mechanize
>
> http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
> http://rubyforge.org/frs/?group_id=427&release_id=2014
>
> Or install the gem


Thank James.
That looks cool.
However, it doesn't seem to have a function to extract texts from html.
(Or did I miss it?)
What I want is...

<table><tr><td>TEST</td></tr></table> => TEST

Is there a module that does this?

Regards,
Sam

>
>
> James
>
> >
> > Thanks.
> > Sam
> >
> >
> > .
> >

>
>
> --
>
> http://www.ruby-doc.org
> http://www.rubyxml.com
> http://catapult.rubyforge.com
> http://orbjson.rubyforge.com
> http://ooo4r.rubyforge.com
> http://www.jamesbritt.com


 
Reply With Quote
 
Sam Kong
Guest
Posts: n/a
 
      05-09-2005

Brian Schröder wrote:
> On 09/05/05, James Britt <(E-Mail Removed)> wrote:
> > Sam Kong wrote:
> > > Hi, all!
> > >
> > > Quite often, when I need to read a list of web pages, I download

the
> > > html sources and save them in a single file like a.html.
> > > If they are mostly texts, I open the html using web browser,

select all
> > > and copy it to an editor and save it.
> > > I want to make the process shorter.
> > > How can I extract the text from html source?
> > > I'm sure there're many parsers for it.
> > > What is the most convenient one?

> >
> > Take a a look at Michael Neumann's WWW::Mechanize
> >
> > http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
> > http://rubyforge.org/frs/?group_id=427&release_id=2014
> >
> > Or install the gem
> >
> > James
> >
> > >
> > > Thanks.
> > > Sam
> > >
> > >
> > > .
> > >

> >
> > --
> >
> > http://www.ruby-doc.org
> > http://www.rubyxml.com
> > http://catapult.rubyforge.com
> > http://orbjson.rubyforge.com
> > http://ooo4r.rubyforge.com
> > http://www.jamesbritt.com
> >
> >

>
> You don't need ruby for this:
>
> $ apt-cache show w3m
> Package: w3m
> [snip]
> Description: WWW browsable pager with excellent tables/frames support
> w3m is a text-based World Wide Web browser with IPv6 support.
> It features excellent support for tables and frames. It can be used
> as a standalone file pager, too.
> .
> * You can follow links and/or view images in HTML.
> * Internet message preview mode, you can browse HTML mail.
> * You can follow links in plain text if it includes URL forms.
> * With w3m-img, you can view image inline.
> .
> For more information,
> see http://sourceforge.net/projects/w3m
>
> $ w3m -dump http://ruby.brian-schroeder.de/quiz/mazes/ | head
> A ruby a day!


Oh, thanks.
I just realized that even lynx can do that.

Regards,
Sam

>
> Ruby Quiz Solutions (Amazing Mazes)
>
> Amazing Mazes
>
> For a full description see: (Amazing Mazes on Ruby Quiz

Homepage)[http://
> www.rubyquiz.com/quiz31.html]
>
> Another graph algorithm. Create a maze that is fully connected and

has only one
> $
>
> regards,
>
> Brian
>
> --
> http://ruby.brian-schroeder.de/
>
> multilingual _non rails_ ruby based vocabulary trainer:
> http://www.vocabulaire.org/ | http://www.gloser.org/ |

http://www.vokabeln.net/

 
Reply With Quote
 
Tom Reilly
Guest
Posts: n/a
 
      05-10-2005
Several years ago, one of the members of the group offered me this
routine which does a pretty good job of
extracting the text from a html page.

#--------------------------------------------------------------------
# Strip HTML Tags from Line
#--------------------------------------------------------------------

def striphtml(line)
line.gsub(/\n/, ' ').gsub(/<.*?>/, '')
end


 
Reply With Quote
 
James Britt
Guest
Posts: n/a
 
      05-10-2005
Sam Kong wrote:
> Thank James.
> That looks cool.
> However, it doesn't seem to have a function to extract texts from html.
> (Or did I miss it?)


No, it is a library for the (fairly) easy creation of HTML munging code.

Some coding is required, but it allows complete control (so you get just
the text of interest).


James


 
Reply With Quote
 
daz
Guest
Posts: n/a
 
      05-10-2005

Sam Kong wrote:
>
> [...] If they are mostly texts, I open the html using
> web browser, select all and copy it to an editor and save it.
>


Save As ... [text file].txt

- Removes all tags.
(Verified with Opera, Firefox & IE6, so I guess most browsers do this)
( e.g. test page: http://www.qurl.net/ )


daz


 
Reply With Quote
 
Sam Kong
Guest
Posts: n/a
 
      05-10-2005
Yes, that's right...
I just want to do it all with my ruby program...hehe
Thanks anyway.

Sam

 
Reply With Quote
 
Sam Kong
Guest
Posts: n/a
 
      05-10-2005

Tom Reilly wrote:
> Several years ago, one of the members of the group offered me this
> routine which does a pretty good job of
> extracting the text from a html page.
>
> #--------------------------------------------------------------------
> # Strip HTML Tags from Line
> #--------------------------------------------------------------------
>
> def striphtml(line)
> line.gsub(/\n/, ' ').gsub(/<.*?>/, '')
> end


Thank you for sharing the code.
However, this code works only for a simple line, right?
When I tested it with a page of html by looping line by line, the
result was not what I expected.
Probably, I need to get a DOM parser...

Sam

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
How do i extract vidios when winrar wont extract them??? help plzzzzzzzz smuttdog@sc.rr.com Computer Support 2 12-23-2007 07:03 AM
How to parse a XML doc with HTML tags within the texts Francesco Moi XML 8 02-21-2005 01:40 PM
Bone up on VHDL & Verilog with these great reference texts at 60% off Amazon HDL Book Seller VHDL 0 08-06-2004 05:00 PM
Looking for top Verilog, VHDL reference texts? HDL Book Seller VHDL 0 05-27-2004 03:02 PM



Advertisments