Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Unicode in irb on windows (respectively script/console in instantrails)

Reply
Thread Tools

Unicode in irb on windows (respectively script/console in instantrails)

 
 
michael.raidel@gmail.com
Guest
Posts: n/a
 
      11-07-2006
Hi everyone!

I have a problem with Unicode in irb on Windows. I recognized it when
trying to save an attribute of an ActiveRecord-Model with an umlaut
(for example "") in script/console. If the database connection is
encoded in utf8, everything after the umlaut gets truncated, in the
default encoding I get funny characters back. It doesn't matter if the
$KCODE is set to UTF8 or NONE, the character number stays the same
(also on plain irb)!

Does anyone has a hint on how to solve this? Of course I could try
things such as Cygwin, but I am trying to find an elegant solution for
Windows-Users, which eventually could merge in the next
InstantRails-release, if Curt agrees.

Thanks a lot,

Michael

 
Reply With Quote
 
 
 
 
Austin Ziegler
Guest
Posts: n/a
 
      11-07-2006
On 11/7/06, http://www.velocityreviews.com/forums/(E-Mail Removed) <(E-Mail Removed)> wrote:
> I have a problem with Unicode in irb on Windows. I recognized it when
> trying to save an attribute of an ActiveRecord-Model with an umlaut
> (for example "=FC") in script/console. If the database connection is
> encoded in utf8, everything after the umlaut gets truncated, in the
> default encoding I get funny characters back. It doesn't matter if the
> $KCODE is set to UTF8 or NONE, the character number stays the same
> (also on plain irb)!


The windows console -- also used by cygwin -- doesn't recognise UTF-8.
(That is, it's not possible to properly display UTF-8 in cmd.exe, at
least so far as I can tell.)

-austin
--=20
Austin Ziegler * (E-Mail Removed) * http://www.halostatue.ca/
* (E-Mail Removed) * http://www.halostatue.ca/feed/
* (E-Mail Removed)

 
Reply With Quote
 
 
 
 
Chilkat Software
Guest
Posts: n/a
 
      11-07-2006

A DOS console displays characters according to the OEM code page. Here is
an example showing how to properly display a=20
string with 8bit chars (e.g. characters
with diacritics, or accent marks)...

# file: oemCodePage.rb

require 'chilkat'

# (The CkString class is freeware)
myStr =3D Chilkat::CkString.new()

# A DOS console does NOT display this correctly:
print "=E9 =F4 =E0 =E7\n"

# What we need is the OEM (DOS) code page...
# OEM code pages are listed here:
#=20
http://msdn.microsoft.com/library/de...-us/intl/unic=
ode_81rn.asp
myStr.appendAnsi("=E9 =F4 =E0 =E7\n")

# Emit the string in the character encoding of your choice:
# ibm850 is the OEM code page for Latin1
print myStr.getEnc("ibm850")

# Chilkat supports these:
# us-ascii
# unicode
# unicodefffe
# iso-8859-1
# iso-8859-2
# iso-8859-3
# iso-8859-4
# iso-8859-5
# iso-8859-6
# iso-8859-7
# iso-8859-8
# iso-8859-9
# iso-8859-13
# iso-8859-15
# windows-874
# windows-1250
# windows-1251
# windows-1252
# windows-1253
# windows-1254
# windows-1255
# windows-1256
# windows-1257
# windows-1258
# utf-7
# utf-8
# utf-32
# utf-32be
# shift_jis
# gb2312
# ks_c_5601-1987
# big5
# iso-2022-jp
# iso-2022-kr
# euc-jp
# euc-kr
# macintosh
# x-mac-japanese
# x-mac-chinesetrad
# x-mac-korean
# x-mac-arabic
# x-mac-hebrew
# x-mac-greek
# x-mac-cyrillic
# x-mac-chinesesimp
# x-mac-romanian
# x-mac-ukrainian
# x-mac-thai
# x-mac-ce
# x-mac-icelandic
# x-mac-turkish
# x-mac-croatian
# asmo-708
# dos-720
# dos-862
# ibm037
# ibm437
# ibm500
# ibm737
# ibm775
# ibm850
# ibm852
# ibm855
# ibm857
# ibm00858
# ibm860
# ibm861
# ibm863
# ibm864
# ibm865
# cp866
# ibm869
# ibm870
# cp875
# koi8-r
# koi8-u



At 05:07 PM 11/7/2006, you wrote:

>On 11/7/06, (E-Mail Removed) <(E-Mail Removed)> wrote:
>>I have a problem with Unicode in irb on Windows. I recognized it when
>>trying to save an attribute of an ActiveRecord-Model with an umlaut
>>(for example "=FC") in script/console. If the database connection is
>>encoded in utf8, everything after the umlaut gets truncated, in the
>>default encoding I get funny characters back. It doesn't matter if the
>>$KCODE is set to UTF8 or NONE, the character number stays the same
>>(also on plain irb)!

>
>The windows console -- also used by cygwin -- doesn't recognise UTF-8.
>(That is, it's not possible to properly display UTF-8 in cmd.exe, at
>least so far as I can tell.)
>
>-austin
>--
>Austin Ziegler * (E-Mail Removed) * http://www.halostatue.ca/
> * (E-Mail Removed) * http://www.halostatue.ca/feed/
> * (E-Mail Removed)
>
>
>
>
>--
>No virus found in this incoming message.
>Checked by AVG Free Edition.
>Version: 7.1.409 / Virus Database: 268.13.31/522 - Release Date: 11/7/2006



--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.13.31/522 - Release Date: 11/7/2006



 
Reply With Quote
 
Austin Ziegler
Guest
Posts: n/a
 
      11-08-2006
On 11/7/06, Austin Ziegler <(E-Mail Removed)> wrote:
> On 11/7/06, (E-Mail Removed) <(E-Mail Removed)> wrote:
> > I have a problem with Unicode in irb on Windows. I recognized it when
> > trying to save an attribute of an ActiveRecord-Model with an umlaut
> > (for example "=FC") in script/console. If the database connection is
> > encoded in utf8, everything after the umlaut gets truncated, in the
> > default encoding I get funny characters back. It doesn't matter if the
> > $KCODE is set to UTF8 or NONE, the character number stays the same
> > (also on plain irb)!

> The windows console -- also used by cygwin -- doesn't recognise UTF-8.
> (That is, it's not possible to properly display UTF-8 in cmd.exe, at
> least so far as I can tell.)


Ack my bad. I had forgotten: you can specify the UTF-8 codepage (CP_UTF w=
ith:

chcp 65001

There are some caveats, of course:

http://blogs.msdn.com/michkap/archiv...06/544251.aspx

-austin
--=20
Austin Ziegler * (E-Mail Removed) * http://www.halostatue.ca/
* (E-Mail Removed) * http://www.halostatue.ca/feed/
* (E-Mail Removed)

 
Reply With Quote
 
David Vallner
Guest
Posts: n/a
 
      11-08-2006
--------------enig5BAD7457B47BBDA592CE45D0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Austin Ziegler wrote:
>=20
> Ack my bad. I had forgotten: you can specify the UTF-8 codepage
> (CP_UTF with:
>=20
> chcp 65001
>=20
> There are some caveats, of course:
>=20
> http://blogs.msdn.com/michkap/archiv...06/544251.aspx
>=20


Also the good old combo of "mode con codepage select=3D65001".

http://msdn.microsoft.com/library/de.../en-us/intl/u=
nicode_81rn.asp
lists pretty much all the numbers you can use. (The pain of navigating
to that on the MSDN website.)

Amusingly enough, none of those are even present anymore on WinXP Pro
x64. For yet more hilarity, the console is by default set to the DOS OEM
codepage of the given locale, instead of the newer ANSI ones that are
ISO extensions, which causes great fun when trying to use software
that's ever so smart and autodetects my locale as my preferred language
(Postgres, assorted GNU stuff being too clever by half) instead of using
the OS language version.

And "there are some caveats" is an understatement, the UTF-8 support in
the console is a sham - I couldn't get a trivial C program using
arbitrary combinations of tchar.h, wchar.h, -DUNICODE, cmd.exe, the
Windows console, a Cygwin and an MSYS rxvt to do something as daunting
as input random characters that aren't shared between Latin1 and Latin2
codepages, store them as multibyte internally, and then write them out
to a text file and to the console successfully without one step
breaking. The fact whole of CMD broke down in tears from changing that
setting is also worth noting - IIRC, had problems doing output
redirection to a file and whatnot (I can't play around with this without
setting up a virtual machine with a 32bit XP). Basically, the Path Less
Annoying is to only use the console for working in your "native"
codepage, and use a non-console tool for everything else.

end # of rant

David Vallner


--------------enig5BAD7457B47BBDA592CE45D0
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)

iD8DBQFFUT+dy6MhrS8astoRAmPfAJoCUln9FPx8DYExQi7e9m sv1vOUNgCfaoXR
xcbu7raVVAoX95XQGwpwRLQ=
=WsAE
-----END PGP SIGNATURE-----

--------------enig5BAD7457B47BBDA592CE45D0--

 
Reply With Quote
 
michael.raidel@gmail.com
Guest
Posts: n/a
 
      11-08-2006
> Ack my bad. I had forgotten: you can specify the UTF-8 codepage (CP_UTF with:
>
> chcp 65001


Thank you Austin for the nice hint!

The problem is, that as soon as I switch the codepage, irb (and also
script/console) stops working (it doesn't even start anymore, it just
quits immediately without an error-message).

Michael

 
Reply With Quote
 
Austin Ziegler
Guest
Posts: n/a
 
      11-09-2006
On 11/8/06, (E-Mail Removed) <(E-Mail Removed)> wrote:
> > Ack my bad. I had forgotten: you can specify the UTF-8 codepage (CP_UTF with:
> >
> > chcp 65001

>
> Thank you Austin for the nice hint!
>
> The problem is, that as soon as I switch the codepage, irb (and also
> script/console) stops working (it doesn't even start anymore, it just
> quits immediately without an error-message).


That's one of the caveats mentioned: batch files no longer work.
I don't know why. However, if you have Ruby installed in C:\Ruby, you can do:

copy C:\Ruby\bin\irb C:\Ruby\bin\irb.rb
irb.rb

Or:

ruby C:\Ruby\bin\irb

And you'll get a working irb.

-austin
--
Austin Ziegler * (E-Mail Removed) * http://www.halostatue.ca/
* (E-Mail Removed) * http://www.halostatue.ca/feed/
* (E-Mail Removed)

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
irb require ... where does irb look? what path? anne001 Ruby 1 06-27-2006 12:07 PM
unicode wrap unicode object? ygao Python 6 04-08-2006 09:54 AM
irb question - variable definitions when calling irb from a script problem Nuralanur@aol.com Ruby 1 10-26-2005 09:13 PM
[ANN] irb-history 1.0.0: Persistent, shared Readline history for IRB Sam Stephenson Ruby 1 06-18-2005 08:56 AM
Unicode + jsp + mysql + tomcat = unicode still not displaying Robert Mark Bram Java 0 09-28-2003 05:37 AM



Advertisments