Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > utf-8 and ctypes

Reply
Thread Tools

utf-8 and ctypes

 
 
Brendan Miller
Guest
Posts: n/a
 
      09-28-2010
I'm using python 2.5.

Currently I have some python bindings written in ctypes. On the C
side, my strings are in utf-8. On the python side I use
ctypes.c_char_p to convert my strings to python strings. However, this
seems to break for non-ascii characters.

It seems that characters not in the ascii subset of UTF-8 are
discarded by c_char_p during the conversion, or at least they don't
print out when I go to print the string.

Does python not support utf-8 strings? Is there some other way I
should be doing the conversion?

Thanks,
Brendan
 
Reply With Quote
 
 
 
 
Lawrence D'Oliveiro
Guest
Posts: n/a
 
      09-29-2010
In message <(E-Mail Removed)>, Brendan
Miller wrote:

> It seems that characters not in the ascii subset of UTF-8 are
> discarded by c_char_p during the conversion ...


Not a chance.

> ... or at least they don't print out when I go to print the string.


So it seems there’s a problem on the printing side. What happens when you
construct a UTF-8-encoded string directly in Python and try printing it the
same way?
 
Reply With Quote
 
 
 
 
Brendan Miller
Guest
Posts: n/a
 
      09-29-2010
2010/9/29 Lawrence D'Oliveiro <(E-Mail Removed)_zealand>:
> In message <(E-Mail Removed)>, Brendan
> Miller wrote:
>
>> It seems that characters not in the ascii subset of UTF-8 are
>> discarded by c_char_p during the conversion ...

>
> Not a chance.
>
>> ... or at least they don't print out when I go to print the string.

>
> So it seems there$B!G(Bs a problem on the printing side. What happens when you
> construct a UTF-8-encoded string directly in Python and try printing it the
> same way?


Doing this seems to confirm something is broken in ctypes w.r.t. UTF-8...

if I enter:
str = "$BF|K\8l$N%F%9%H(B"

Then:
print str
$BF|K\8l$N%F%9%H(B

However, when I create a string buffer, pass it into my c++ code, and
write the same UTF-8 string into it, python seems to discard pretty
much all the text. The same code works for pure ascii strings.

Python code:
_std_string_size = _lib_mbxclient.std_string_size
_std_string_size.restype = c_long
_std_string_size.argtypes = [c_void_p]

_std_string_copy = _lib_mbxclient.std_string_copy
_std_string_copy.restype = None
_std_string_copy.argtypes = [c_void_p, POINTER(c_char)]

# This function works for ascii, but breaks on strings with UTF-8!
def std_string_to_string(str_ptr):
buf = create_string_buffer(_std_string_size(str_ptr))
_std_string_copy(str_ptr, buf)
return buf.raw

C++ code:

extern "C"
long std_string_size(string* str)
{
return str->size();
}

extern "C"
void std_string_copy(string* str, char* buf)
{
std::copy(str->begin(), str->end(), buf);
}
 
Reply With Quote
 
MRAB
Guest
Posts: n/a
 
      09-29-2010
On 29/09/2010 19:33, Brendan Miller wrote:
> 2010/9/29 Lawrence D'Oliveiro<(E-Mail Removed)_zealand>:
>> In message<(E-Mail Removed)>, Brendan
>> Miller wrote:
>>
>>> It seems that characters not in the ascii subset of UTF-8 are
>>> discarded by c_char_p during the conversion ...

>>
>> Not a chance.
>>
>>> ... or at least they don't print out when I go to print the string.

>>
>> So it seems there$B!G(Bs a problem on the printing side. What happens when you
>> construct a UTF-8-encoded string directly in Python and try printing it the
>> same way?

>
> Doing this seems to confirm something is broken in ctypes w.r.t. UTF-8...
>
> if I enter:
> str = "$BF|K\8l$N%F%9%H(B"
>
> Then:
> print str
> $BF|K\8l$N%F%9%H(B
>
> However, when I create a string buffer, pass it into my c++ code, and
> write the same UTF-8 string into it, python seems to discard pretty
> much all the text. The same code works for pure ascii strings.
>
> Python code:
> _std_string_size = _lib_mbxclient.std_string_size
> _std_string_size.restype = c_long
> _std_string_size.argtypes = [c_void_p]
>
> _std_string_copy = _lib_mbxclient.std_string_copy
> _std_string_copy.restype = None
> _std_string_copy.argtypes = [c_void_p, POINTER(c_char)]
>
> # This function works for ascii, but breaks on strings with UTF-8!
> def std_string_to_string(str_ptr):
> buf = create_string_buffer(_std_string_size(str_ptr))
> _std_string_copy(str_ptr, buf)
> return buf.raw
>
> C++ code:
>
> extern "C"
> long std_string_size(string* str)
> {
> return str->size();
> }
>
> extern "C"
> void std_string_copy(string* str, char* buf)
> {
> std::copy(str->begin(), str->end(), buf);
> }


It might have something to do with the character encoding of your
source files.

Also, try printing out the character codes of the string and the size
of the string's character in the C++ code.
 
Reply With Quote
 
Mark Tolonen
Guest
Posts: n/a
 
      09-30-2010

"Brendan Miller" <(E-Mail Removed)> wrote in message
news:AANLkTi=(E-Mail Removed)...
> 2010/9/29 Lawrence D'Oliveiro <(E-Mail Removed)_zealand>:
>> In message <(E-Mail Removed)>,
>> Brendan
>> Miller wrote:
>>
>>> It seems that characters not in the ascii subset of UTF-8 are
>>> discarded by c_char_p during the conversion ...

>>
>> Not a chance.
>>
>>> ... or at least they don't print out when I go to print the string.

>>
>> So it seems there$B!G(Bs a problem on the printing side. What happens when
>> you
>> construct a UTF-8-encoded string directly in Python and try printing it
>> the
>> same way?

>
> Doing this seems to confirm something is broken in ctypes w.r.t. UTF-8...
>
> if I enter:
> str = "$BF|K\8l$N%F%9%H(B"
>
> Then:
> print str
> $BF|K\8l$N%F%9%H(B
>
> However, when I create a string buffer, pass it into my c++ code, and
> write the same UTF-8 string into it, python seems to discard pretty
> much all the text. The same code works for pure ascii strings.
>
> Python code:
> _std_string_size = _lib_mbxclient.std_string_size
> _std_string_size.restype = c_long
> _std_string_size.argtypes = [c_void_p]
>
> _std_string_copy = _lib_mbxclient.std_string_copy
> _std_string_copy.restype = None
> _std_string_copy.argtypes = [c_void_p, POINTER(c_char)]
>
> # This function works for ascii, but breaks on strings with UTF-8!
> def std_string_to_string(str_ptr):
> buf = create_string_buffer(_std_string_size(str_ptr))
> _std_string_copy(str_ptr, buf)
> return buf.raw
>
> C++ code:
>
> extern "C"
> long std_string_size(string* str)
> {
> return str->size();
> }
>
> extern "C"
> void std_string_copy(string* str, char* buf)
> {
> std::copy(str->begin(), str->end(), buf);
> }


I didn't see what OS you are using, but I fleshed out your example code and
have a working example for Windows. Below is the code for the DLL and
script:

--------- x.cpp [cl /LD /EHsc /W4
x.cpp] ----------------------------------------------------
#include <string>
#include <algorithm>
using namespace std;

extern "C" __declspec(dllexport) long std_string_size(string* str)
{
return str->size();
}

extern "C" __declspec(dllexport) void std_string_copy(string* str, char*
buf)
{
std::copy(str->begin(), str->end(), buf);
}

extern "C" __declspec(dllexport) void* make(const char* s)
{
return new string(s);
}

extern "C" __declspec(dllexport) void destroy(void* s)
{
delete (string*)s;
}
---- x.py ---------------------------------------------------------
# coding: utf8
from ctypes import *
_lib_mbxclient = CDLL('x')

_std_string_size = _lib_mbxclient.std_string_size
_std_string_size.restype = c_long
_std_string_size.argtypes = [c_void_p]

_std_string_copy = _lib_mbxclient.std_string_copy
_std_string_copy.restype = None
_std_string_copy.argtypes = [c_void_p, c_char_p]

make = _lib_mbxclient.make
make.restype = c_void_p
make.argtypes = [c_char_p]

destroy = _lib_mbxclient.destroy
destroy.restype = None
destroy.argtypes = [c_void_p]

# This function works for ascii, but breaks on strings with UTF-8!
def std_string_to_string(str_ptr):
buf = create_string_buffer(_std_string_size(str_ptr))
_std_string_copy(str_ptr, buf)
return buf.raw

s = make(u'$B2f@'H~9q?M!#(B'.encode('utf8'))
print std_string_to_string(s).decode('utf8')
------------------------------------------------------

And output (in Pythonwin...US Windows console doesn't support Chinese):

$B2f@'H~9q?M!#(B

I used c_char_p instead of POINTER(c_char) and added functions to create and
destroy a std::string for Python's use, but it is otherwise the same as your
code.

Hope this helps you work it out,
-Mark




 
Reply With Quote
 
Diez B. Roggisch
Guest
Posts: n/a
 
      09-30-2010
Brendan Miller <(E-Mail Removed)> writes:

> 2010/9/29 Lawrence D'Oliveiro <(E-Mail Removed)_zealand>:
>> In message <(E-Mail Removed)>, Brendan
>> Miller wrote:
>>
>>> It seems that characters not in the ascii subset of UTF-8 are
>>> discarded by c_char_p during the conversion ...

>>
>> Not a chance.
>>
>>> ... or at least they don't print out when I go to print the string.

>>
>> So it seems there’s a problem on the printing side. What happens when you
>> construct a UTF-8-encoded string directly in Python and try printing it the
>> same way?

>
> Doing this seems to confirm something is broken in ctypes w.r.t. UTF-8...
>
> if I enter:
> str = "日本語のテスト"


What is this? Which encoding is used by your editor to produce this
byte-string?

If you want to be sure you have the right encoding, you need to do this:

- put a coding: utf-8 (or actually whatever your editor uses) in the
first or second line
- use unicode literals. That are the funny little strings with a "u" in
front of them. They will be *decoded* using the declared encoding.
- when passing this to C, explicitly *encode* with utf-8 first.

Diez
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
WindowsXP/ CTypes - How to convert ctypes array to a string? dudeja.rajat@gmail.com Python 0 08-19-2008 10:20 AM
ctypes, accessing uInt32 and pointer to uInt32 Andrew Markebo Python 1 11-17-2004 03:27 AM
RE: [ctypes-users] [Ann] ctypes 0.9.0 released Henk Punt Python 0 07-23-2004 10:34 PM
RE: using autoit and ctypes Jimmy Retzlaff Python 0 11-19-2003 10:06 PM
using autoit and ctypes Rob McMonigal Python 0 11-19-2003 09:53 PM



Advertisments