Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > encoding problem with tr() and hash keys

Reply
Thread Tools

encoding problem with tr() and hash keys

 
 
Do One
Guest
Posts: n/a
 
      02-21-2009
Please help to understand solution to this problem (ruby 1.9.1):

In utf-8 environment I do:

irb(main):121:0> h = {"a" => 1, "\u0101" => 2}
=> {"a"=>1, "ā"=>2}
irb(main):122:0> h.key? "a".tr("z", "\u0101")
=> false <--- wrong!
irb(main):123:0> h.key? "\u0101".tr("z", "\u0101")
=> true

So after I change utf-8 string without extended chars in it with tr(),
where second character set is having extended chars, new string is not
found in hash.

Boths string are same in Marshal encoding:

irb(main):124:0> Marshal.dump "a".tr("\u0101", "\u0101")
=> "\x04\bI\"\x06a\x06:\rencoding\"\nUTF-8"
irb(main):126:0> Marshal.dump "a"
=> "\x04\bI\"\x06a\x06:\rencoding\"\nUTF-8"


Question is how I should code using tr() that new string will be found
in hash?

And I think this is bug in ruby, because it is completely not expected
behavior.
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
7stud --
Guest
Posts: n/a
 
      02-21-2009
Do One wrote:
> Please help to understand solution to this problem (ruby 1.9.1):
>
> In utf-8 environment I do:
>
> irb(main):121:0> h = {"a" => 1, "\u0101" => 2}
> => {"a"=>1, "ā"=>2}
> irb(main):122:0> h.key? "a".tr("z", "\u0101")
> => false <--- wrong!
>



h = {"a" => 1, "b" => 2}

p "a".tr("z", "\u0101") #"a"

puts h.key?("a".tr("z", "x")) #true

ruby 1.8.2

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
7stud --
Guest
Posts: n/a
 
      02-21-2009
7stud -- wrote:
> h = {"a" => 1, "b" => 2}
>
> p "a".tr("z", "\u0101") #"a"
>
> puts h.key?("a".tr("z", "x")) #true
>
> ruby 1.8.2


Whoops. Make that:


h = {"a" => 1, "\u0101" => 2}

p "a".tr("z", "\u0101") #=>"a"

puts h.key?("a".tr("z", "\u0101")) #=>true

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Do One
Guest
Posts: n/a
 
      02-22-2009
Problem described is under modern ruby 1.9.1 in utf-8 environment.

7stud -- wrote:
>> ruby 1.8.2

>
> Whoops. Make that:


ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-linux]
irb(main):001:0> h = {"a" => 1, "\u0101" => 2}
=> {"a"=>1, "u0101"=>2}

See? It even dont understand unicode escape sequence \uXXXX.


Do One wrote:
> Please help to understand solution to this problem (ruby 1.9.1):
>
> In utf-8 environment I do:
>
> irb(main):121:0> h = {"a" => 1, "\u0101" => 2}
> => {"a"=>1, "ā"=>2}
> irb(main):122:0> h.key? "a".tr("z", "\u0101")
> => false <--- wrong!
> irb(main):123:0> h.key? "\u0101".tr("z", "\u0101")
> => true
>
> So after I change utf-8 string without extended chars in it with tr(),
> where second character set is having extended chars, new string is not
> found in hash.
>
> Boths string are same in Marshal encoding:
>
> irb(main):124:0> Marshal.dump "a".tr("\u0101", "\u0101")
> => "\x04\bI\"\x06a\x06:\rencoding\"\nUTF-8"
> irb(main):126:0> Marshal.dump "a"
> => "\x04\bI\"\x06a\x06:\rencoding\"\nUTF-8"
>
>
> Question is how I should code using tr() that new string will be found
> in hash?
>
> And I think this is bug in ruby, because it is completely not expected
> behavior.


--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Brian Candler
Guest
Posts: n/a
 
      02-23-2009
Do One wrote:
> Please help to understand solution to this problem (ruby 1.9.1):
>
> In utf-8 environment I do:
>
> irb(main):121:0> h = {"a" => 1, "\u0101" => 2}
> => {"a"=>1, "ā"=>2}
> irb(main):122:0> h.key? "a".tr("z", "\u0101")
> => false <--- wrong!
> irb(main):123:0> h.key? "\u0101".tr("z", "\u0101")
> => true


Perhaps describe your environment in more detail? It works for me:

$ irb19
irb(main):001:0> h = {"a" => 1, "\u0101" => 2}
=> {"a"=>1, "ā"=>2}
irb(main):002:0> h.key?("a")
=> true
irb(main):003:0> h.key?("\u0101")
=> true
irb(main):004:0> h.key?("a".tr("z", "\u0101"))
=> true
irb(main):005:0> h.key? "a".tr("z", "\u0101")
=> true
irb(main):006:0> h.key? "z".tr("z", "\u0101")
=> true
irb(main):007:0>

This is Ubuntu Hardy, ruby 1.9.1 (2008-12-01 revision 2043
[i686-linux] compiled from source. I think this is 1.9.1-preview2 rather
than 1.9.1-p0.

To eliminate problems with encoding, maybe try writing this as a script
and running it from the command line:

p h = {"a" => 1, "\u0101" => 2}
p h.key?("a")
p h.key?("\u0101")
p h.key?("a".tr("z", "\u0101"))
p h.key? "a".tr("z", "\u0101")
p h.key? "z".tr("z", "\u0101")

ruby19 test.rb
ruby19 -Ku test.rb
ruby19 --encoding UTF-8:UTF-8 test.rb

to see if this makes any difference. On my machine at least, the -K and
--encoding flags are not recognised by irb.
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Brian Candler
Guest
Posts: n/a
 
      02-23-2009
Just built 1.9.1p0, and there's no difference here (all 'true'
responses)
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Tom Link
Guest
Posts: n/a
 
      02-23-2009
> Just built 1.9.1p0, and there's no difference here (all 'true'
> responses)


AFAIK it was fixed here:
http://groups.google.com/group/ruby-...4ea46905c1c6af

 
Reply With Quote
 
Do One
Guest
Posts: n/a
 
      02-24-2009
Yes it was fixed yesterday with two consecutive patches, first one was
not fixing it completely, but before I found how to reproduce a bug it
is got fixed second time. (ruby 1.9.2 svn trunk)

> Perhaps describe your environment in more detail? It works for me:


How to reproduce a bug (to understand its traps) -

1. utf-8 env:

$ ruby -v
ruby 1.9.1p0 (2009-01-30 revision 21907) [i686-linux]
$ export LC_CTYPE=en_US.utf-8
$ irb
irb(main):001:0> {"a" => 1}.key? "a".tr("z", "\u0101")
=> false

Reproduced. Without utf-8 env you just don't see it:

$ export LC_CTYPE=en_US
$ irb
irb(main):001:0> {"a" => 1}.key? "a".tr("z", "\u0101")
=> true

2. Even if your env is not utf-8 but your script have "encoding: utf-8"
magic comment then bug will be there:

$ cat a.rb
#encoding: utf-8
p ({"a" => 1}).key?("a".tr("z", "\u0101"))
$ ruby a.rb
false

3. Or you are using -KU switch:

$ ruby -KU -e 'p ({"a" => 1}).key?("a".tr("z", "\u0101"))'
false


I stuck on this by parsing word lists where some words having
diacritical marks, some words getting worked out differently then
others, code was correct and it was just plain crazy.
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Brian Candler
Guest
Posts: n/a
 
      02-24-2009
Do One wrote:
> Yes it was fixed yesterday with two consecutive patches, first one was
> not fixing it completely, but before I found how to reproduce a bug it
> is got fixed second time. (ruby 1.9.2 svn trunk)


It looks like this craziness is core behaviour for ruby 1.9,
unfortunately.

Notice that in your script which reproduces the problem, the encodings
of the two strings match. Results shown are for ruby 1.9.1p0 (2009-01-30
revision 21907) [i686-linux]

#encoding: utf-8
a = "a"
b = a.tr("z", "\u0101")
h = {a => 1}
p h.key?(a) #true
p h.key?(b) #false !!

p a #"a"
p b #"a"
p a.encoding #<Encoding:UTF-8>
p b.encoding #<Encoding:UTF-8>

p a == b #true
p a.hash #137519702
p b.hash #137519703 AHA!

So two strings, with identical byte sequences and identical encodings,
calculate different hashes. So there must be some hidden internal state
in the string which affects the calculation of the hash. I presume this
is the flag ENC_CODERANGE_7BIT.

It's hard to test whether this flag has been set correctly, if
String#encoding doesn't show it, so you have to use indirect methods
like String#hash.

But now I think I understand the problem, it's easy to find more
examples of the same brokenness. Here's one:

#encoding: utf-8
a = "a"
b = "aß"
b = b.delete("ß")
h = {a => 1}
p h.key?(a) #true
p h.key?(b) #false !!

p a #"a"
p b #"a"
p a.encoding #<Encoding:UTF-8>
p b.encoding #<Encoding:UTF-8>

p a == b #true
p a.hash #-590825394
p b.hash #-590825393


I wonder just how many other string methods are broken in this way? And
how many extension writers are going to set this hidden flag correctly
in their strings, if even the ruby core developers don't always do it?

It looks like this flag is a bad optimisation.

* It needs recalculating every time a string is modified (thus negating
the benefits of the optimisation)

* It introduces hidden state, which affects behaviour but cannot be
directly tested

* If the state is not set correctly *every* time a string is generated
or modified - and this includes in all extension modules - then things
break.

Regards,

Brian.
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Do One
Guest
Posts: n/a
 
      02-24-2009
Brian Candler wrote:
> But now I think I understand the problem, it's easy to find more
> examples of the same brokenness. Here's one:
>
> #encoding: utf-8
> a = "a"
> b = "aß"
> b = b.delete("ß")
> h = {a => 1}
> p h.key?(a) #true
> p h.key?(b) #false !!


This is still false even in "fixed" 1.9.2dev. Probably you should report
it.


> I wonder just how many other string methods are broken in this way? And
> how many extension writers are going to set this hidden flag correctly
> in their strings, if even the ruby core developers don't always do it?


Scary.
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Hash key types and equality of hash keys Tim McDaniel Perl Misc 2 03-01-2012 06:32 PM
hash of hash of hash of hash in c++ rp C++ 1 11-10-2011 04:45 PM
hash.keys and hash.values Mage Ruby 14 08-15-2006 08:44 PM
Hash#values and Hash#keys order Alex Fenton Ruby 1 04-15-2006 05:45 AM
No Keys, nor other hash methods on multidimensional hash Xeno Campanoli Ruby 16 08-25-2005 07:04 AM



Advertisments