Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > [ENCODING] UTF8 hell

Reply
Thread Tools

[ENCODING] UTF8 hell

 
 
Xavier NoŽlle
Guest
Posts: n/a
 
      02-02-2010
Hello,
I'm trying to deal with Ruby flaws with encoding, which I thought
would be almost past with Ruby 1.9. I managed to find a solution for
Ruby 1.8 and thought I did for Ruby 1.9...but in fact, no !

I fetch rows from an UTF8 database and try to work with the string. To
do so, I would like it to be UTF8 encoded.

"str.encoding()" gives me "ASCII-8BIT"...so, I thought one of these
lines would solve the problem
str.replace(Iconv.iconv("UTF8", "ascii", self).join())
OR
self.encode!('UTF-8')

But they don't !
First one: in `iconv': "\xE8te pour luth" (Iconv::IllegalSequence)
Second one: in `encode!': "\xE8" from ASCII-8BIT to UTF-8
(Encoding::UndefinedConversionError)

The base string is "Oeuvre compl=E8te pour luth" and displays well in PHPMy=
Admin.

Any idea ?
TIA,

--=20
Xavier NOELLE

 
Reply With Quote
 
 
 
 
Stefano Crocco
Guest
Posts: n/a
 
      02-02-2010
On Tuesday 02 February 2010, Xavier No=EBlle wrote:
> |Hello,
> |I'm trying to deal with Ruby flaws with encoding, which I thought
> |would be almost past with Ruby 1.9. I managed to find a solution for
> |Ruby 1.8 and thought I did for Ruby 1.9...but in fact, no !
> |
> |I fetch rows from an UTF8 database and try to work with the string. To
> |do so, I would like it to be UTF8 encoded.
> |
> |"str.encoding()" gives me "ASCII-8BIT"...so, I thought one of these
> |lines would solve the problem
> |str.replace(Iconv.iconv("UTF8", "ascii", self).join())
> |OR
> |self.encode!('UTF-8')
> |
> |But they don't !
> |First one: in `iconv': "\xE8te pour luth" (Iconv::IllegalSequence)
> |Second one: in `encode!': "\xE8" from ASCII-8BIT to UTF-8
> |(Encoding::UndefinedConversionError)
> |
> |The base string is "Oeuvre compl=E8te pour luth" and displays well in
> |PHPMyAdmin.
> |
> |Any idea ?
> |TIA,


I'm not sure, but basing on my experience, it may be that the string are=20
indeed stored as UTF-8, but the library you use to read from the database=20
doesn't take care of informing ruby of the fact, so ruby assumes it is a=20
generic array of bytes (which means, ruby thinks the string has encoding=20
ASCII-8BIT, which is the same as BINARY).

If this is the case, you don't need to transcode the string (which is what=
=20
encode does), but simply tell ruby which is the correct encoding, using the=
=20
force_encoding method.

I hope this helps

Stefano

 
Reply With Quote
 
 
 
 
David Palm
Guest
Posts: n/a
 
      02-02-2010
> I fetch rows from an UTF8 database and try to work with the string. To
> do so, I would like it to be UTF8 encoded.


There are several pieces to this. Even if the DB encoding and collation is utf8, doublecheck that the client connection is utf8 as well ("encoding: utf8" in database.yml for a Rails app I think).

> self.encode!('UTF-8')


str.force_encoding('UTF-8') is what you want to use I think.



 
Reply With Quote
 
Xavier NoŽlle
Guest
Posts: n/a
 
      02-02-2010
2010/2/2 David Palm <(E-Mail Removed)>:
> There are several pieces to this. Even if the DB encoding and collation is utf8, doublecheck that the client connection is utf8 as well ("encoding: utf8" in database.yml for a Rails app I think).


Not a Rails app

> str.force_encoding('UTF-8') is what you want to use I think.


I already tried this method, but it lead me to the following error: in
`downcase!': invalid byte sequence in UTF-8 (ArgumentError).

This is due to a call to str.downcase!() later in the application.

Any idea to solve this ?

--
Xavier NOELLE

 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      02-02-2010
2010/2/2 Xavier No=EBlle <(E-Mail Removed)>:
> 2010/2/2 David Palm <(E-Mail Removed)>:
>> There are several pieces to this. Even if the DB encoding and collation =

is utf8, doublecheck that the client connection is utf8 as well ("encoding:=
utf8" in database.yml for a Rails app I think).
>
> Not a Rails app
>
>> str.force_encoding('UTF-8') is what you want to use I think.

>
> I already tried this method, but it lead me to the following error: in
> `downcase!': invalid byte sequence in UTF-8 (ArgumentError).
>
> This is due to a call to str.downcase!() later in the application.
>
> Any idea to solve this ?


You probably first want to find out whether the byte sequence is valid
UTF-8 or not. For that you would need to look at the bytes in the
String. I guess chances are that your String's byte sequence is NOT
valid UTF-8 OR you have a character in the string that has no
lowercase representation.

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

 
Reply With Quote
 
Xavier NoŽlle
Guest
Posts: n/a
 
      02-23-2010
2010/2/2 Robert Klemme <(E-Mail Removed)>:
> You probably first want to find out whether the byte sequence is valid
> UTF-8 or not. =A0For that you would need to look at the bytes in the
> String. =A0I guess chances are that your String's byte sequence is NOT
> valid UTF-8 OR you have a character in the string that has no
> lowercase representation.
>
> Kind regards
>
> robert


I dug into the problem and ended up with this line: self.force_encoding('UT=
F-8')
Believing that the string #encoding was right was a wrong choice, then
I assumed the database provided valid UTF8 strings.

BUT (because, there's a but...), for some reason I don't understand,
some strings are unwilling to work:

Example:
puts self =3D> m=E9dicals
self.each_byte {|b| print "#{b} "} =3D> 109 233 100 105 99 97 108 115

233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
byte sequence in UTF-8 (ArgumentError).

Where am I wrong ?

TIA,

--=20
Xavier NOELLE

 
Reply With Quote
 
Marc Heiler
Guest
Posts: n/a
 
      02-23-2010
How does python solve this?
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Rick DeNatale
Guest
Posts: n/a
 
      02-23-2010
On Tue, Feb 23, 2010 at 9:41 AM, Yukihiro Matsumoto <(E-Mail Removed)> wr=
ote:
> Hi,
>
> In message "Re: [ENCODING] UTF8 hell"
> =A0 =A0on Tue, 23 Feb 2010 20:10:20 +0900, Xavier No=EBlle <xavier.noelle=

@gmail.com> writes:
>
> |self.each_byte {|b| print "#{b} "} =3D> 109 233 100 105 99 97 108 115
> |
> |233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
> |self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
> |byte sequence in UTF-8 (ArgumentError).
>
> 233 is not a valid UTF-8 character. =A0The byte sequence for m=E9dicals i=

s
> <109 195 169 100 105 99 97 108 115>.


233 for e accent acute would be valid for ISO-8859-1 encoding, not UTF-8.


--=20
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
WWR: http://www.workingwithrails.com/pers...-rick-denatale
LinkedIn: http://www.linkedin.com/in/rickdenatale

 
Reply With Quote
 
Xavier NoŽlle
Guest
Posts: n/a
 
      02-23-2010
2010/2/23 Yukihiro Matsumoto <(E-Mail Removed)>:
> 233 is not a valid UTF-8 character. =A0The byte sequence for m=E9dicals i=

s
> <109 195 169 100 105 99 97 108 115>.


Indeed. In the meantime, I changed the code with this one:
def isUTF8()
begin
self.unpack('U*')
rescue
return false
end
return true
end

if isUTF8()
self.force_encoding('UTF-8')
else
self.force_encoding('ISO-8859-1')
self.encode!('UTF-8')
end

This (ugly) quickfix works for what I need, but I don't know if this
problem can be somehow resolved in another way. The problem being that
my SQL database has a VARBINARY column with an unknown encoding. Is
there a way to deal with the various possible encoding or to ask MySQL
to return UTF8 converted data, or is it necessary to clean data before
inserting them ?

--=20
Xavier NOELLE

 
Reply With Quote
 
Perry Smith
Guest
Posts: n/a
 
      02-23-2010
> A general hint for debugging encoding troubles: the UTF-8 encoding
> *guarantees* that every Unicode codepoint is *either* encoded into a
> *single* octet with its most significant bit cleared to 0 (i.e. a
> decimal value between 0 and 127) *or* into a *sequence* of 2 to 6
> octets, *all* of which have their MSB set to 1 (i.e. a decimal value
> between 128 and 255).


Question: The sequence of 2 to 6 octets: is it always even? i.e. 2, 4,
or 6 but not 3 nor 5 octects?

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
given char* utf8, how to read unicode line by line, and output utf8 gry C++ 2 03-13-2012 04:32 AM
UTF8 to Unicode conversion Spamtrap Perl 6 07-31-2004 04:59 AM
open with encoding(utf8) takes forever Erik Sandblom Perl 0 05-28-2004 02:01 PM
LWP::Simple and utf8 problem Thomas =?ISO-8859-15?Q?G=F6tz?= Perl 0 04-19-2004 09:48 AM
Cmenu, Text Interfaces, and UTF8 shade Perl 1 08-11-2003 11:24 AM



Advertisments