Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > character encoding question

Reply
Thread Tools

character encoding question

 
 
Amishera Amishera
Guest
Posts: n/a
 
      03-26-2010
I have an html file which is encoded in UTF-8. The file contains the
following text:

It's a wonderful life

now the character code 39 is for aphostrohpe in UTF8. so suppose I got
the 39 out of the text using:

s="It's a wonderful life"

s.gsub(/&#(\d+);/, '\1')

The output is

It39s a wonderful life

So firstly I am having trouble making it

It\39s a wonderful life

Secondly I manually did this in test_utf8.rb:

puts "It\39s a wonderful life"

and ran it

ruby test_utf8.rb > utf8.txt

but by opening it in the open office by setting the encoding to utf-8
the output is

It#9s a wonderful life

So how to correctly parse the collect and convert html character
reference to encoded charcters in utf-8 and then save file?

Thanks.
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
David Springer
Guest
Posts: n/a
 
      03-26-2010

> s="It's a wonderful life"


I stumbled across this:
-----------------------

require 'cgi'
s=CGI.unescapeHTML("It's a wonderful life")


-----------------------
David
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
David Springer
Guest
Posts: n/a
 
      03-26-2010
try something like this:
-------------------------------------
require 'cgi'
s="UPPERCASE Russian Alphabet\n".encode('utf-8')
s+=CGI.unescapeHTML("АБВГ".encode('utf-8'))
s+=CGI.unescapeHTML("ДЕЖЗ".encode('utf-8'))
s+=CGI.unescapeHTML("ИЙКЛ".encode('utf-8'))
s+=CGI.unescapeHTML("МНОП".encode('utf-8'))
s+=CGI.unescapeHTML("РСТУ".encode('utf-8'))
s+=CGI.unescapeHTML("ФХЦЧ".encode('utf-8'))
s+=CGI.unescapeHTML("ШЩЪЫ".encode('utf-8'))
s+=CGI.unescapeHTML("ЬЭЮЯ".encode('utf-8'))
puts s
-------------------------------------
David
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
character encoding +missing character sequence raavi Java 2 03-02-2006 05:01 AM
newbie question about character encoding: what does 0xC0 0x8A have in common with 0xE0 0x80 0x8A??? Jake Barnes XML 4 12-05-2005 04:42 AM
xml, character encoding, asp question Mark ASP General 7 05-05-2005 10:21 AM
Encoding.Default and Encoding.UTF8 Hardy Wang ASP .Net 5 06-09-2004 04:04 PM
question: reading character for character from stdin KwikRick Python 1 08-22-2003 05:56 PM



Advertisments