Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > need script: convert html-text to text

Reply
Thread Tools

need script: convert html-text to text

 
 
keal
Guest
Posts: n/a
 
      01-04-2006
i have html-text. i have to convert this text to simple text without
html-tags.

--
Posted via http://www.ruby-forum.com/.


 
Reply With Quote
 
 
 
 
Gene Tani
Guest
Posts: n/a
 
      01-04-2006

keal wrote:
> i have html-text. i have to convert this text to simple text without
> html-tags.
>
> --
> Posted via http://www.ruby-forum.com/.


path o'least resistance

lynx -dump www.myurl
or use links2 ## or w3m -dump www.myurl

or high-falutin solution
http://groups.google.com/group/comp....cd5e35a1ffb8d7

 
Reply With Quote
 
 
 
 
Ross Bamford
Guest
Posts: n/a
 
      01-04-2006
On Wed, 04 Jan 2006 10:30:03 -0000, keal <(E-Mail Removed)> wrote:

> i have html-text. i have to convert this text to simple text without
> html-tags.
>


It's tricky, there's more to it than you'd think. The best way is probably
to use Lynx, or another browser, to do it for you, e.g.:

def plain(url)
`lynx -dump "#{url}"`
end

p = plain('http://www.google.com/')
puts p

Outputs:

[1]Personalised Home | [2]Sign in

[3]A picture of the Braille letters spelling out "Google." Happy Birthday
Louis Braille!

Web [4]Images [5]Groups [6]News [7]Froogle [8]more ยป

> ... [snip] ...


Of course you'll need lynx for that to work, but you can use others too.
Try a Google search.

Cheers,

--
Ross Bamford - http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      01-04-2006
keal wrote:
> i have html-text. i have to convert this text to simple text without
> html-tags.


This is a very low cost variant - I guess the lynx approach is much more
effective and complete:

ruby -pe 'gsub! %r{</?.*?>}, ""' index.html

Kind regards

robert

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to convert markup text to plain text in python? geoffbache Python 8 02-11-2008 10:02 AM
To convert to J2SE 6 or not to convert, that is the question... Jaap Java 4 07-10-2006 09:03 AM
Controlling text in a Text Area or Text leo ASP General 1 12-05-2005 01:13 AM
convert list of strings to set of regexes; convert list of strings to trie Klaus Neuner Python 7 07-26-2004 07:25 AM
Do I need to Convert with Convert.ToInt32(session("myNumber")) ? Andreas Klemt ASP .Net 1 07-23-2003 02:59 PM



Advertisments