Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Converting to UCS-2 or UTF-16 for use by a C extension

Thread Tools

Converting to UCS-2 or UTF-16 for use by a C extension

Wincent Colaiuta
Posts: n/a
I'm working on a C extension that embeds an ANTLR parser, and I need
to convert a Ruby input string into UCS-2 or possibly UTF-16 encoding.

I've got a working implementation but I suspect that it is flawed and
just wanted to ask if this is the right way to do it. The basic idea
is as follows (in pseudo-code):

// 1. unpack to array of UTF8 characters
utf8 = input.unpack("C*");

// 2. repack
packed = utf8.pack("U*");

// 3. convert using Iconv
ucs2 = Iconv.iconv("UCS-2", "UTF-8", packed).first

// 4. freeze

// 5. get pointer, and length (in 16 bit words)
pointer = StringValuePtr(ucs2); // this bit in C
count = ucs.length / 2;

// 6. hand off to the parser...

My doubts are basically as follows:

- I'm doing the unpack/repack because I am not sure that my string is
encoded internally as UTF-8... it *seems* to be, because if I type a
string like "€" in irb then I can see that it's composed of three
bytes in UTF-8 ("\342\202\254")

- Is it in UTF-8 only because my system's locale is set that way?
might it be different on other people's machines? (and if so, how
would I find out what the encoding is?)

- In the case that the encoding is *not* UTF-8, does my "round-trip"
unpack/pack trick actually get it into UTF-8? (I don't think it will!
In which case the rount-trip is a waste of time)

- And once I've got the String in UCS-2, does StringValuePtr give me
access to the raw UCS-2 encoded data like I think it does? (seems to)

- Does calling length on the UCS-2 encoded string always give the
result in bytes? (I am almost certain that it does)

- Is there some more elegant way to get an arbitrary Ruby string into
UCS-2 so that it can be handed off the C parser?


Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
newWizards as an extension and extension point? Elhanan Java 0 01-23-2007 08:11 AM
How to convert a .txt file extension to a .xls file extension? Steve ASP .Net 3 08-25-2006 05:43 PM
New extension? saw extension .emf is it safe to open Jer Computer Support 5 10-08-2005 04:43 PM
C extension=> pow(2,1) gives DIFFERENT answers in different parts of C extension!?!?! Any ideas why? Christian Seberino Python 3 02-05-2004 04:36 AM
Unload extension modules when python22.dll unloads... [using C extension interpreter] Anand Python 3 11-08-2003 05:50 AM