Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Encoding problems .. ruby 1.9.2

Reply
Thread Tools

Encoding problems .. ruby 1.9.2

 
 
Bhay Zone
Guest
Posts: n/a
 
      09-26-2010
I am pretty new to ruby and am trying to read text data coming from a
backend which can only be queried using proprietary Command Line
Interface commands.

The problem is that this text data contains non-ascii characters...I
don't know what these characters are .. and nor do I know the encoding.

Earlier, when we were using ruby 1.8.7 we had some code that handled
these characters pretty well. Now after switching to ruby 1.9.2, the
same code breaks with encoding errors like "invalid multibyte sequence"
in gsub.

Here is the code we were using to replace the non-ascii characters which
is breaking now. The code it breaks at the first line.

content.gsub!( "\221", '')
content.gsub!( "\222", '')
content.gsub!( "\223", '')
content.gsub!( "\224", '')
content.gsub!( "\246", '')
content.gsub!( "\247", '')
content.gsub!( "\237", '')
content.gsub!( "\377", '')
content.gsub!( "\226", '')
content.gsub!( "\227", '')
content.gsub!( "\\000", "?")
content.gsub!( "\\001", "?")
content.gsub!( "\FB01", "")
content.gsub!(/[\x80-\xFF]/,'')
content.gsub!(/[\x00-\x08]/,'')
content.gsub!(/[\x0B-\x0C]/,'')
content.gsub!(/[\x0E-\x1F]/,'')

I just cannot figure how to fix this problem and any help would be
greatly appreciated.
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
Caleb Clausen
Guest
Posts: n/a
 
      09-26-2010
On 9/26/10, Bhay Zone <(E-Mail Removed)> wrote:
> I am pretty new to ruby and am trying to read text data coming from a
> backend which can only be queried using proprietary Command Line
> Interface commands.
>
> The problem is that this text data contains non-ascii characters...I
> don't know what these characters are .. and nor do I know the encoding.
>
> Earlier, when we were using ruby 1.8.7 we had some code that handled
> these characters pretty well. Now after switching to ruby 1.9.2, the
> same code breaks with encoding errors like "invalid multibyte sequence"
> in gsub.
>
> Here is the code we were using to replace the non-ascii characters which
> is breaking now. The code it breaks at the first line.
>
> content.gsub!( "\221", '')
> content.gsub!( "\222", '')
> content.gsub!( "\223", '')
> content.gsub!( "\224", '')
> content.gsub!( "\246", '')
> content.gsub!( "\247", '')
> content.gsub!( "\237", '')
> content.gsub!( "\377", '')
> content.gsub!( "\226", '')
> content.gsub!( "\227", '')
> content.gsub!( "\\000", "?")
> content.gsub!( "\\001", "?")
> content.gsub!( "\FB01", "")
> content.gsub!(/[\x80-\xFF]/,'')
> content.gsub!(/[\x00-\x08]/,'')
> content.gsub!(/[\x0B-\x0C]/,'')
> content.gsub!(/[\x0E-\x1F]/,'')
>
> I just cannot figure how to fix this problem and any help would be
> greatly appreciated.


In 1.9, every string (and regular expression) has an encoding attached
to it. If there are any byte sequences in your string that don't match
the encoding, it causes errors. 1.8 was much more permissive about its
strings, allowing arbitrary binary data in any string, which is why it
worked better for you. You can get back the 1.8 behavior under 1.9 by
setting the encoding of your string objects to 'binary'.

My first suggestion would be to set the encoding of the string in the
variable content to binary before doing any of the gsub!s:
content.force_encoding('binary')

However, a better way would be to set the encoding of the IO object
the strings are read from. That way you don't need to force_encoding
each string as it comes in.

Even better is to figure out what the encoding this external tool is
using and set the IO's encoding to that. Then perhaps a lot of this
hacky string manglich could go away.

But this is still only half the story. You also have to consider the
encoding of the strings and regexps which get passed as the first
argument to gsub. Those string (and regexp) literals default to the
same encoding as the source file they're contained in. If no explicit
encoding is declared for a specific source file, ruby guesses an
encoding based on your environment (using the LOCALE env var and some
others that I can't remember right now). Often, this means ruby
assumes your sources
are utf-8 encoded.

You can declare a specific encoding explicitly by putting something
like this as the very first line in your source:
#encoding: binary
(or the second line if the first line is a shebang line).

I used the binary encoding in the example line above because that's
probably the one which will work best for you under the circumstances.
Declaring the source encoding to be binary is a bit hackish, but
probably the easiest way to get you where you want to go. If you
figure out what encoding your data is in, you're probably better off
declaring the source encoding to be the same thing, but there may be
more work involved there.

PS: there is some redundancy in the sequence of gsub!s you posted. The
first 10 (for "\221" thru "\227") are special cases of the 14th (for
/[\x80-\xFF]/) and can safely be deleted. Also, "\FB01" is the same
thing as "FB01" in both ruby 1.8 and 1.9 and probably not what you
wanted. (Maybe "\xFB\x01" is what you actually meant?)

HTH

 
Reply With Quote
 
 
 
 
Brian Candler
Guest
Posts: n/a
 
      09-27-2010
Bhay Zone wrote:
> I am pretty new to ruby and am trying to read text data coming from a
> backend which can only be queried using proprietary Command Line
> Interface commands.
>
> The problem is that this text data contains non-ascii characters...I
> don't know what these characters are .. and nor do I know the encoding.


How are you interfacing with this interface - a TCP socket? IO.popen?
Backticks? Something else? If you show the code which opens the
connection, we can show how to fix it.

TCP sockets default to "ASCII-8BIT" encoding, but for other methods,
unless you tell ruby what encoding to use, it will guess based on
environment variables on your PC. That is, the same program may work
fine on one PC but fail on another.

To avoid these problems, there are magic incantations you can add to
force ruby not to guess. e.g.

IO.popen: add "b" to the mode string

Backticks or %x: res = `foo`; res.force_encoding("ASCII-8BIT")

Or try running ruby with -Kn flag.

> I just cannot figure how to fix this problem and any help would be
> greatly appreciated.


It's probably possible to fix your code, as above. However, sticking
with ruby 1.8.7 is also a reasonable solution if you don't want to have
to deal with this sort of nonsense.

I had a go at reverse-engineering the string encoding behaviour of ruby
1.9. I gave up after documenting about 200 behaviours:
http://github.com/candlerb/string19/...er/string19.rb

I'm sticking with 1.8, because 1.9 makes my brain hurt.
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Bhay Zone
Guest
Posts: n/a
 
      09-27-2010
Caleb, Brian - Thank you for your replies.

The source of this data is a bug tracking tool known as GNATS. Now this
tool also comes with a client which provides a command line util known
as query-pr to query GNATS. The output of query-pr is delimited text. If
you run query-pr from the linux shell, it prints the output on the
screen.

I invoke query-pr from my ruby program as follows (note the opening and
closing (``) characters.

result=`query-pr --expr 'Status="closed"'`
# parse the result and take appropriate action.

I am not very sure, but my guess is that the GNATS client uses TCP
sockets to interface with the GNATS DB.

Thanks for pointing out the redundancy, i'll fix that in my code.

Right now I have "# coding: utf-8" as the first line in the ruby file. I
found that while trying to figure out this problem and hoped it would
make magic ... but well ...

I'll also try out the "# coding: binary" to see if that works for my
case.

I'm not sure if going back to ruby 1.8.7 is an option .. will keep that
as a last option.
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Brian Candler
Guest
Posts: n/a
 
      09-27-2010
Bhay Zone wrote:
> I invoke query-pr from my ruby program as follows (note the opening and
> closing (``) characters.
>
> result=`query-pr --expr 'Status="closed"'`
> # parse the result and take appropriate action.


That's backticks. Follow that line with:

result.force_encoding("ASCII-8BIT")

when running with ruby 1.9, before you start doing your substitutions.

> I am not very sure, but my guess is that the GNATS client uses TCP
> sockets to interface with the GNATS DB.


Maybe, but that's irrelevant here. Ruby is reading the output of
query-pr, as a string, and has decided to give it some arbitrary guessed
encoding.

> Right now I have "# coding: utf-8" as the first line in the ruby file. I
> found that while trying to figure out this problem and hoped it would
> make magic ... but well ...
>
> I'll also try out the "# coding: binary" to see if that works for my
> case.


It won't. It will only affect the coding of quoted string literals
within your code.
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Bhay Zone
Guest
Posts: n/a
 
      09-27-2010
After 'result.force_encoding("ASCII-8BIT"), are the gsubs necessary?
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Brian Candler
Guest
Posts: n/a
 
      09-28-2010
Bhay Zone wrote:
> After 'result.force_encoding("ASCII-8BIT"), are the gsubs necessary?


Why do you do them in the ruby 1.8.7 version? If they served a purpose
there, then presumably they still serve a purpose.

All the force_encoding business is doing is preventing these lines from
crashing ruby 1.9. The bytes in the string from query-pr will still be
the same.
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Reading Text File Encoding and converting to Perls internal UTF-8 encoding sln@netherlands.com Perl Misc 2 04-17-2009 11:22 PM
Bug with Ruby/Tk encoding (ruby-1.9.1-rc1) mdiam Ruby 6 01-12-2009 05:29 PM
#!/usr/bin/ruby , #!/usr/bin/ruby -w , #!/usr/bin/ruby -T?, #!/usr/bin/ruby -T1... anne001 Ruby 1 04-23-2006 03:02 PM
changing JVM encoding; setting -Dfile.encoding doesn't work pasmol@plusnet.pl Java 1 10-08-2004 09:50 PM
Encoding.Default and Encoding.UTF8 Hardy Wang ASP .Net 5 06-09-2004 04:04 PM



Advertisments