Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Forcing a string to valid UTF-8

Reply
Thread Tools

Forcing a string to valid UTF-8

 
 
Phrogz
Guest
Posts: n/a
 
      04-26-2010
I have some legacy text data that's gone through several databases and
web services in its life, playing promiscuously with dirty web
servers, browsers, and encodings.

It's coming out of the source database as ASCII-8bit. I'm trying to
bring it all into UTF-8. I've found ways to coerce many of the bad
entries into compliance, but now I've hit one that is simply bad. I
want to just delete the minimum necessary to make it valid UTF-8. What
I'm trying isn't working. Here's my code:

if new_value.is_a? String
begin
utf8 = new_value.force_encoding('UTF-8')
if utf8.valid_encoding?
new_value = utf8
else
new_value.encode!( 'UTF-8', 'Windows-1252' )
end
rescue EncodingError => e
puts "Bad encoding: #{old_table}.#{pk}:#{old_row[pk]} -
#{new_value.inspect}"
new_value.encode!( 'UTF-8', invalid: :replace, undef: :replace,
replace: '' )
p new_value.encoding unless new_value.valid_encoding?
end
end

When I fall into the rescue clause, I'm getting out:
Bad encoding: bugs.id:2469 - "Indexing C:\\\\コピ\xE3\x81E \x81E
\x81EZCa_zu5.264"
#<Encoding:UTF-8>
The conversion resulted in an invalid UTF-8 string (that happens to be
the same as the original, as far as I can tell.) I'm surprised,
because I thought the purpose of invalid/undef replace was to clean
things up.

How do I force it into a valid UTF-8 encoding, losing as little data
as possible but happily throwing out the senseless bits?
 
Reply With Quote
 
 
 
 
Brian Candler
Guest
Posts: n/a
 
      04-27-2010
Gavin Kistner wrote:
> How do I force it into a valid UTF-8 encoding, losing as little data
> as possible but happily throwing out the senseless bits?


AFAICS, the trouble with your rescue clause is that the string failed to
be encoded into Windows-1252, so it remains with its existing UTF-8 tag,
and so an attempt to "re-encode" as UTF-8 is silently ignored because
it's already UTF-8, even though it contains invalid characters.

For example, this doesn't do anything:

>> a = "abc\xffdef".force_encoding("UTF-8")

=> "abc\xFFdef"
>> b = a.encode("UTF-8", :invalid=>:replace, :replace=>"?")

=> "abc\xFFdef"

but this does:

>> b = a.encode("UTF-16BE", :invalid=>:replace, :replace=>"?").encode("UTF-8")

=> "abc?def"

Proviso: ruby 1.9 string handling is undocumented and subject to
continuous change. I tested the above with

>> RUBY_DESCRIPTION

=> "ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]"

so it may or may not work with your version, or with future versions of
Ruby.
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
Phrogz
Guest
Posts: n/a
 
      04-27-2010
On Apr 27, 4:19*am, Brian Candler <(E-Mail Removed)> wrote:
> Gavin Kistner wrote:
> > How do I force it into a valid UTF-8 encoding, losing as little data
> > as possible but happily throwing out the senseless bits?

>
> AFAICS, the trouble with your rescue clause is that the string failed to
> be encoded into Windows-1252, so it remains with its existing UTF-8 tag,
> and so an attempt to "re-encode" as UTF-8 is silently ignored because
> it's already UTF-8, even though it contains invalid characters.


Excellent point. Fixing that led me to a similar error earlier: I had
assumed that
s2 = s1.force_encoding(...)
left s1 intact. In fact, it modifies and returns s1. Thank you very
much, Brian.

For those that care or stumble upon this via Google, here's a modified
version that works:

# Converting ASCII-8BIT to UTF-8 based domain-specific guesses
if new_value.is_a? String
begin
# Try it as UTF-8 directly
cleaned = new_value.dup.force_encoding('UTF-8')
unless cleaned.valid_encoding?
# Some of it might be old Windows code page
cleaned = new_value.encode( 'UTF-8', 'Windows-1252' )
end
new_value = cleaned
rescue EncodingError
# Force it to UTF-8, throwing out invalid bits
new_value.encode!( 'UTF-8', invalid: :replace, undef: :replace )
end
end

> Proviso: ruby 1.9 string handling is undocumented and subject to
> continuous change. I tested the above with


FWIW my new code works on ruby 1.9.1p243 (2009-07-16 revision 24175)
[i386-mingw32]

Thanks again!
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
'String was not recognized as a valid DateTime Pial ASP .Net 6 04-28-2005 12:21 PM
String was not recognized as a valid Boolean =?Utf-8?B?TGVzdGVyIExlZQ==?= ASP .Net 3 02-04-2004 03:58 AM
Cast from type 'DBNull' to type 'String' is not valid. Elmo Watson ASP .Net 3 12-25-2003 03:30 AM
Operator is not valid for type 'ArrayList' and string "". Marc Bishop ASP .Net 1 11-06-2003 08:37 PM
Valid list of HTTPResponse.ContentType string types Robert Hanson ASP .Net 4 08-01-2003 02:48 AM



Advertisments