Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > How should I handle the multibyte char set string in C++?

Reply
Thread Tools

How should I handle the multibyte char set string in C++?

 
 
Dancefire
Guest
Posts: n/a
 
      04-29-2007
Hi, everyone,

I'm writing a program using wstring(wchar_t) as internal string.

The problem is raised when I convert the multibyte char set string
with different encoding to wstring(which is Unicode, UCS-2LE(BMP) in
Win32, and UCS4 in Linux?).

I have 2 ways to do the job:

1) use std::locale, set std::locale::global() and use mbstowcs() and
wcstombs() do the conversion.

2) use platform dependent functions to do the job, such as libiconv in
Linux, or MultiByteToWideChar() and WideCharToMultiByte() in Win32.

At first glance, it might be definitely to choose the solution 1) to
do the job. Since it's really C++ favor, and in details, the codecvt
facet is actually wrap the function by calling libiconv in Linux, and
MultiByteToWideChar() or WideCharToMultiByte() in Win32 (by different
STL implementation) to do the real job.(if my understanding is
correct).

However, I have 2 problems.

First, I have to set the global locale before I do the conversion.

There are 2 side effects, the first effect is when I do the multi-
thread program, changing the global setting will affect the other
thread using different encoding to do the conversion. Yes, I can lock
the conversion, but it make no sense to do, and cause really low
performance.

The second effect is every time I set std::locale::global() is time
consuming, create a locale object and set it to global locale is not a
light job, it does cause a low performance.

Second problem, looks like the system dependent conversion functions
support much more encoding than std::locale() by each STL
implementation. For example, libiconv support UCS-2LE encoding, but g+
+'s locale() doesn't support it. MultiByteToWideChar() support UTF8
conversion, but MSVC(8.0)'s STL std::locale() doesn't support ".65001"
for code page 65001 which is UTF8.

The locale string is not same on different platform might be the third
problem, but I can easily ignore it by #ifdef #endif.

So, back to beginning question, how should I handle the MBCS string in
C++?

Thanks.

 
Reply With Quote
 
 
 
 
James Kanze
Guest
Posts: n/a
 
      04-29-2007
On Apr 29, 4:40 pm, Dancefire <Dancef...@gmail.com> wrote:

> I'm writing a program using wstring(wchar_t) as internal string.


> The problem is raised when I convert the multibyte char set string
> with different encoding to wstring(which is Unicode, UCS-2LE(BMP) in
> Win32, and UCS4 in Linux?).


> I have 2 ways to do the job:


> 1) use std::locale, set std::locale::global() and use mbstowcs() and
> wcstombs() do the conversion.


Why not std::codecvt? A facet which you can obtain from a
locale.

> 2) use platform dependent functions to do the job, such as libiconv in
> Linux, or MultiByteToWideChar() and WideCharToMultiByte() in Win32.


> At first glance, it might be definitely to choose the solution 1) to
> do the job. Since it's really C++ favor, and in details, the codecvt
> facet is actually wrap the function by calling libiconv in Linux, and
> MultiByteToWideChar() or WideCharToMultiByte() in Win32 (by different
> STL implementation) to do the real job.(if my understanding is
> correct).


> However, I have 2 problems.


> First, I have to set the global locale before I do the conversion.


Why? You can get a facet from any locale. That's the one
advantage C++ locales have over the C stuff.

[...]
> Second problem, looks like the system dependent conversion functions
> support much more encoding than std::locale() by each STL
> implementation.


That's a problem with the C++ library implementation. A quality
implementation will support all of the code sets that are
installed on the system.

> For example, libiconv support UCS-2LE encoding, but g++'s
> locale() doesn't support it. MultiByteToWideChar() support
> UTF8 conversion, but MSVC(8.0)'s STL std::locale() doesn't
> support ".65001" for code page 65001 which is UTF8.


Finding what locales are available and work can be a bit of a
game. And how they are named, if you're not under Unix.

> The locale string is not same on different platform might be the third
> problem, but I can easily ignore it by #ifdef #endif.


> So, back to beginning question, how should I handle the MBCS string in
> C++?


The official answer is std::codecvt. In practice, I roll my
own.

--
James Kanze (Gabi Software) email:
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

 
Reply With Quote
 
 
 
 
Dancefire
Guest
Posts: n/a
 
      04-30-2007
> Why not std::codecvt? A facet which you can obtain from a
> locale.


oops, I miss the std::codecvt. Thank you.

After I tried std::codecvt, I have 2 more questions.

1) Should we initialize mbstate_t variable? And how to initialize the
mbstate_t portable and in C++ way?

Many sample code I saw on the net, didn't initialize the mbstate_t
variable. Such as:

http://incubator.apache.org/stdcxx/d...cvt.html#sec12

std::mbstate_t state;

And sample in MSDN with Visual Studio 2005.

mbstate_t state;

They just declare it and use it, never assign any initial value to the
state. And I did get a problem in VC80 without initialize the state to
zero during I try (the first character always mass up in debug mode,
the follow up is ok).

But the online version of MSDN do initialize the mbstate_t variable:
http://msdn2.microsoft.com/en-us/lib...58(VS.80).aspx

mbstate_t state = {0};

And I do find a code using memset() to set all range to zero, but I
don't think it's c++'s way.
How should I make the initial portable?

2) I can know the wchar_t* buf length for codecvt.in() by
codecvt.length(), but how should I know the char * buffer length for
codecvt.out()?

I can pass 0 pointer to mbstowcs() or wcstombs() to get the length of
the output buffer I need. but I don't know how to do the same thing by
using codecvt<>.

> > For example, libiconv support UCS-2LE encoding, but g++'s
> > locale() doesn't support it. MultiByteToWideChar() support
> > UTF8 conversion, but MSVC(8.0)'s STL std::locale() doesn't
> > support ".65001" for code page 65001 which is UTF8.

>
> Finding what locales are available and work can be a bit of a
> game. And how they are named, if you're not under Unix.
>


I use "locale -l" list all the locale string supportted in Linux, and
use the following link to find the locale string in Windows:

http://msdn2.microsoft.com/en-us/lib...78(vs.80).aspx

However, I still cannot handle "UCS-2"/"UTF16" in Linux or
"UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
this?

>
> The official answer is std::codecvt. In practice, I roll my
> own.
>



Thanks again, you do help me.

 
Reply With Quote
 
sebor@roguewave.com
Guest
Posts: n/a
 
      04-30-2007
On Apr 30, 4:56 am, Dancefire <Dancef...@gmail.com> wrote:
[...]
> 1) Should we initialize mbstate_t variable? And how to initialize the
> mbstate_t portable and in C++ way?
>
> Many sample code I saw on the net, didn't initialize the mbstate_t
> variable. Such as:
>
> http://incubator.apache.org/stdcxx/d...cvt.html#sec12
>
> std::mbstate_t state;


Strictly speaking you should zero-initialize the state. It doesn't
matter
in the trivial example shown in the Apache stdcxx documentation but
in general the state must be either zeroed out (i.e., to represent the
initial shift state) or be the result of a prior conversion.

I have corrected the example program to initialize the state variable,
see: http://svn.apache.org/viewvc?view=rev&revision=533806. I'll fix
the docs next.

>

[...]
> mbstate_t state = {0};
>
> And I do find a code using memset() to set all range to zero, but I
> don't think it's c++'s way.
> How should I make the initial portable?


Like so:

mbstate_t state = mbstate_t ();

>
> 2) I can know the wchar_t* buf length for codecvt.in() by
> codecvt.length(), but how should I know the char * buffer length for
> codecvt.out()?


codecvt::length() returns the number of extern_type characters (i.e.,
narrow chars for codecvt<wchar_t, char>).

>

[...]
> However, I still cannot handle "UCS-2"/"UTF16" in Linux or
> "UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
> this?


In the Apache C++ Standard Library you can do it using
a codecvt_byname facet constructed with the name "UTF-8@UCS"
as an argument, although it's not mentioned on the documentation page:
http://incubator.apache.org/stdcxx/d...vt-byname.html
Let me look into adding it.

 
Reply With Quote
 
Dancefire
Guest
Posts: n/a
 
      05-01-2007
> > 1) Should we initialize mbstate_t variable? And how to initialize the
> > mbstate_t portable and in C++ way?

>
> > Many sample code I saw on the net, didn't initialize the mbstate_t
> > variable. Such as:

>
> >http://incubator.apache.org/stdcxx/d...cvt.html#sec12

>
> > std::mbstate_t state;

>
> Strictly speaking you should zero-initialize the state. It doesn't
> matter
> in the trivial example shown in the Apache stdcxx documentation but
> in general the state must be either zeroed out (i.e., to represent the
> initial shift state) or be the result of a prior conversion.
>
> I have corrected the example program to initialize the state variable,
> see:http://svn.apache.org/viewvc?view=rev&revision=533806. I'll fix
> the docs next.
>
>


Yes, the example in Apache stdcxx documentation works, since it
doesn't try to handle MBCS in CJK encoding. If the state is not zero,
the code will get problem to handle MBCS string, and the first 1-2
bytes in the MBCS will parse to a wrong result if they a greater than
0x80, and the follow up byte might be parsed correct, and if the first
1-2 char is < 0x80, it might just simply return with an error.

Thank you very much for correct the code and the doc, it will make
others much clear and avoid the problem I faced.

> [...]
> > mbstate_t state = {0};

>
> > And I do find a code using memset() to set all range to zero, but I
> > don't think it's c++'s way.
> > How should I make the initial portable?

>
> Like so:
>
> mbstate_t state = mbstate_t ();
>


I get it, thank you very much.

>
>
> > 2) I can know the wchar_t* buf length for codecvt.in() by
> > codecvt.length(), but how should I know the char * buffer length for
> > codecvt.out()?

>
> codecvt::length() returns the number of extern_type characters (i.e.,
> narrow chars for codecvt<wchar_t, char>).
>


I'm a little confuse here, even after read the document. Could you
give me a piece of code as example how to do same thing as below's
code:

===================================
string str("\xba\xba\xd6\xd7");
size_t len = mbstowcs(0, str, str.length());
wchar_t* wstr = new wchar_t[len+1];
mbstowcs(wstr, str, len);
===================================
And the reverse version:

===================================
wstring wstr(L"\xbaba\xd6d7");
size_t len = wcstombs(0, wstr, wstr.length());
char* str = new char[len+1];
wcstombs(str, wstr, len);
===================================

The point is I need to get the length for the output buffer, so I can
new the buffer in a safe way. How can I get the buffer's length for
both codecvt::in() and codecvt:ut()?

BTW, am I correct in above code? I mean at the second time call for
wcstombs() or mbstowcs() which use "len" as the length rather than as
the first call which are use "wstr.length()" or "str.length()" as the
length?

>
> [...]
> > However, I still cannot handle "UCS-2"/"UTF16" in Linux or
> > "UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
> > this?

>
> In the Apache C++ Standard Library you can do it using
> a codecvt_byname facet constructed with the name "UTF-8@UCS"
> as an argument, although it's not mentioned on the documentation page:http://incubator.apache.org/stdcxx/d...vt-byname.html
> Let me look into adding it.


Thank you, I know how to handle this in Apache C++ Standard Library
now. I will try that.
Do you know the how can I use g++'s STL do this? I mean, conversion
between wchar_t*, which contain UCS-4 string, and char*, which contain
UCS-2 or UTF16 string.

The problem is raised when I try to do a project can be portable
between Windows and Linux. I try to write the unicode string to a
file.

When I choose UTF8 to write, I get 2 problems,

1) VC80's STL doesn't support UTF8's locale, (althought Win32 api
support it, but use win32 api will make some of the code non-portable)
2) All of the string is CJK characters, so UTF8 will cost at least 3
bytes to store, enlarge 50% for storage which is unnecessary if I
store just use UCS-2. And I'm sure all the characters is in BMP of
ISO-10646. So I'd better just use 16bit to store it in the file.

However, If I choose UCS2LE, just like what stored in wchar_t in VC, I
got problem of reading the file at Linux, which g++'s STL looks like
doesn't support UCS-2LE locale, and wchar_t in Linux is UCS4 rather
than UCS2, so I cannot directly read the content. (same kind of story,
since libiconv support UCS-2LE, but if I use libiconv it will make the
part of the code non-portable and I have to let mycode depends on
libiconv).

So, What should I do in this case?

 
Reply With Quote
 
P.J. Plauger
Guest
Posts: n/a
 
      05-01-2007
"Dancefire" <> wrote in message
news: oups.com...

> .....
>> [...]
>> > However, I still cannot handle "UCS-2"/"UTF16" in Linux or
>> > "UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
>> > this?

>>
>> In the Apache C++ Standard Library you can do it using
>> a codecvt_byname facet constructed with the name "UTF-8@UCS"
>> as an argument, although it's not mentioned on the documentation
>> page:http://incubator.apache.org/stdcxx/d...vt-byname.html
>> Let me look into adding it.

>
> Thank you, I know how to handle this in Apache C++ Standard Library
> now. I will try that.
> Do you know the how can I use g++'s STL do this? I mean, conversion
> between wchar_t*, which contain UCS-4 string, and char*, which contain
> UCS-2 or UTF16 string.
>
> The problem is raised when I try to do a project can be portable
> between Windows and Linux. I try to write the unicode string to a
> file.
>
> When I choose UTF8 to write, I get 2 problems,
>
> 1) VC80's STL doesn't support UTF8's locale, (althought Win32 api
> support it, but use win32 api will make some of the code non-portable)
> 2) All of the string is CJK characters, so UTF8 will cost at least 3
> bytes to store, enlarge 50% for storage which is unnecessary if I
> store just use UCS-2. And I'm sure all the characters is in BMP of
> ISO-10646. So I'd better just use 16bit to store it in the file.
>
> However, If I choose UCS2LE, just like what stored in wchar_t in VC, I
> got problem of reading the file at Linux, which g++'s STL looks like
> doesn't support UCS-2LE locale, and wchar_t in Linux is UCS4 rather
> than UCS2, so I cannot directly read the content. (same kind of story,
> since libiconv support UCS-2LE, but if I use libiconv it will make the
> part of the code non-portable and I have to let mycode depends on
> libiconv).
>
> So, What should I do in this case?


Everything you need is included in our Compleat Libraries, for both
VC++ and gcc. But they cost $.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


 
Reply With Quote
 
Dancefire
Guest
Posts: n/a
 
      05-01-2007
On May 1, 7:46 pm, "P.J. Plauger" <p...@dinkumware.com> wrote:
> "Dancefire" <Dancef...@gmail.com> wrote in message
>
> news: oups.com...
>
>
>
> > .....
> >> [...]
> >> > However, I still cannot handle "UCS-2"/"UTF16" in Linux or
> >> > "UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
> >> > this?

>
> >> In the Apache C++ Standard Library you can do it using
> >> a codecvt_byname facet constructed with the name "UTF-8@UCS"
> >> as an argument, although it's not mentioned on the documentation
> >> page:http://incubator.apache.org/stdcxx/d...vt-byname.html
> >> Let me look into adding it.

>
> > Thank you, I know how to handle this in Apache C++ Standard Library
> > now. I will try that.
> > Do you know the how can I use g++'s STL do this? I mean, conversion
> > between wchar_t*, which contain UCS-4 string, and char*, which contain
> > UCS-2 or UTF16 string.

>
> > The problem is raised when I try to do a project can be portable
> > between Windows and Linux. I try to write the unicode string to a
> > file.

>
> > When I choose UTF8 to write, I get 2 problems,

>
> > 1) VC80's STL doesn't support UTF8's locale, (althought Win32 api
> > support it, but use win32 api will make some of the code non-portable)
> > 2) All of the string is CJK characters, so UTF8 will cost at least 3
> > bytes to store, enlarge 50% for storage which is unnecessary if I
> > store just use UCS-2. And I'm sure all the characters is in BMP of
> > ISO-10646. So I'd better just use 16bit to store it in the file.

>
> > However, If I choose UCS2LE, just like what stored in wchar_t in VC, I
> > got problem of reading the file at Linux, which g++'s STL looks like
> > doesn't support UCS-2LE locale, and wchar_t in Linux is UCS4 rather
> > than UCS2, so I cannot directly read the content. (same kind of story,
> > since libiconv support UCS-2LE, but if I use libiconv it will make the
> > part of the code non-portable and I have to let mycode depends on
> > libiconv).

>
> > So, What should I do in this case?

>
> Everything you need is included in our Compleat Libraries, for both
> VC++ and gcc. But they cost $.
>
> P.J. Plauger
> Dinkumware, Ltd.http://www.dinkumware.com



Yes, the Compleat Libraries is cool. but before I pay it, I need to
make sure there is no way to do it easily.
I'm developing an open source project, for portability reason, I'd
better depends on existing STL in VC80 Express for windows, and libstdc
++ for Linux(or other).
I'm trying to find the common encoding for Unicode in both VC80
Express STL and libstdc++.

 
Reply With Quote
 
P.J. Plauger
Guest
Posts: n/a
 
      05-01-2007
"Dancefire" <> wrote in message
news: oups.com...

> On May 1, 7:46 pm, "P.J. Plauger" <p...@dinkumware.com> wrote:
>> "Dancefire" <Dancef...@gmail.com> wrote in message
>>
>> news: oups.com...
>>
>>
>>
>> > .....
>> >> [...]
>> >> > However, I still cannot handle "UCS-2"/"UTF16" in Linux or
>> >> > "UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
>> >> > this?

>>
>> >> In the Apache C++ Standard Library you can do it using
>> >> a codecvt_byname facet constructed with the name "UTF-8@UCS"
>> >> as an argument, although it's not mentioned on the documentation
>> >> page:http://incubator.apache.org/stdcxx/d...vt-byname.html
>> >> Let me look into adding it.

>>
>> > Thank you, I know how to handle this in Apache C++ Standard Library
>> > now. I will try that.
>> > Do you know the how can I use g++'s STL do this? I mean, conversion
>> > between wchar_t*, which contain UCS-4 string, and char*, which contain
>> > UCS-2 or UTF16 string.

>>
>> > The problem is raised when I try to do a project can be portable
>> > between Windows and Linux. I try to write the unicode string to a
>> > file.

>>
>> > When I choose UTF8 to write, I get 2 problems,

>>
>> > 1) VC80's STL doesn't support UTF8's locale, (althought Win32 api
>> > support it, but use win32 api will make some of the code non-portable)
>> > 2) All of the string is CJK characters, so UTF8 will cost at least 3
>> > bytes to store, enlarge 50% for storage which is unnecessary if I
>> > store just use UCS-2. And I'm sure all the characters is in BMP of
>> > ISO-10646. So I'd better just use 16bit to store it in the file.

>>
>> > However, If I choose UCS2LE, just like what stored in wchar_t in VC, I
>> > got problem of reading the file at Linux, which g++'s STL looks like
>> > doesn't support UCS-2LE locale, and wchar_t in Linux is UCS4 rather
>> > than UCS2, so I cannot directly read the content. (same kind of story,
>> > since libiconv support UCS-2LE, but if I use libiconv it will make the
>> > part of the code non-portable and I have to let mycode depends on
>> > libiconv).

>>
>> > So, What should I do in this case?

>>
>> Everything you need is included in our Compleat Libraries, for both
>> VC++ and gcc. But they cost $.
>>
>> P.J. Plauger
>> Dinkumware, Ltd.http://www.dinkumware.com

>
>
> Yes, the Compleat Libraries is cool. but before I pay it, I need to
> make sure there is no way to do it easily.
> I'm developing an open source project, for portability reason, I'd
> better depends on existing STL in VC80 Express for windows, and libstdc
> ++ for Linux(or other).
> I'm trying to find the common encoding for Unicode in both VC80
> Express STL and libstdc++.


Well, you can encode Unicode as:

-- UTF-8 in an array of char

-- UTF-16 in an array of short (or wchar_t under VC++)

-- UCS-2 in an array of short (if you're willing to settle for the common
65K Unicode subset)

-- UTF-32 or UCS-4 in an array of long (or wchar_t under gcc)

We supply a whole slew of interconversions between these forms, and
the appropriate endian versions in files, in our Code Conversions
library (part of the Compleat Libraries). See:

file:///C:/htm_cplt/temp/index_cvt.html

for an essay on code conversions and the list of facets we supply.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


 
Reply With Quote
 
Dancefire
Guest
Posts: n/a
 
      05-02-2007
>
> Well, you can encode Unicode as:
>
> -- UTF-8 in an array of char
>
> -- UTF-16 in an array of short (or wchar_t under VC++)
>
> -- UCS-2 in an array of short (if you're willing to settle for the common
> 65K Unicode subset)
>
> -- UTF-32 or UCS-4 in an array of long (or wchar_t under gcc)
>
> We supply a whole slew of interconversions between these forms, and
> the appropriate endian versions in files, in our Code Conversions
> library (part of the Compleat Libraries). See:
>
> file:///C:/htm_cplt/temp/index_cvt.html
>
> for an essay on code conversions and the list of facets we supply.
>
> P.J. Plauger
> Dinkumware, Ltd.http://www.dinkumware.com


Thanks, but I can't see the link, it's local...

And one more question about the codecvt. I'm not familiar with
codecvt, I need some help here.

> > 2) I can know the wchar_t* buf length for codecvt.in() by
> > codecvt.length(), but how should I know the char * buffer length for
> > codecvt.out()?


> codecvt::length() returns the number of extern_type characters (i.e.,
> narrow chars for codecvt<wchar_t, char>).


I'm a little confuse here, even after read the document. Could you
give me a piece of code as example how to do same thing as below's
code:

===================================
string str("\xba\xba\xd6\xd7");
size_t len = mbstowcs(0, str, str.length());
wchar_t* wstr = new wchar_t[len+1];
mbstowcs(wstr, str, len);
===================================
And the reverse version:

===================================
wstring wstr(L"\xbaba\xd6d7");
size_t len = wcstombs(0, wstr, wstr.length());
char* str = new char[len+1];
wcstombs(str, wstr, len);
===================================

The point is I need to get the length for the output buffer, so I can
new the buffer in a safe way. How can I get the buffer's length for
both codecvt::in() and codecvt:ut()?

BTW, am I correct in above code? I mean at the second time call for
wcstombs() or mbstowcs() which use "len" as the length rather than as
the first call which are use "wstr.length()" or "str.length()" as the
length?

Thanks

 
Reply With Quote
 
sebor@roguewave.com
Guest
Posts: n/a
 
      05-04-2007
On May 1, 1:18 am, Dancefire <Dancef...@gmail.com> wrote:
[...]
> I'm a little confuse here, even after read the document. Could you
> give me a piece of code as example how to do same thing as below's
> code:


I don't blame you for being confused. You can't use length() for this
(or for much else, I'm afraid). It's really not a very useful
function.

>
> ===================================
> string str("\xba\xba\xd6\xd7");
> size_t len = mbstowcs(0, str, str.length());
> wchar_t* wstr = new wchar_t[len+1];
> mbstowcs(wstr, str, len);


Here's an implementation of mbstowcs() using codecvt. I'll probably
put it up on the Apache stdcxx site or include it in the documentation
but I'm pasting it here for reference (let me know if you run into any
problems with it). The reverse (i.e., wcstombs()) is analogous and
I'll leave its implementation as an exercise for interested
readers

std::size_t
my_mbstowcs (std::mbstate_t *pstate,
wchar_t *dst,
const char *src,
std::size_t size)
{
const std::locale global;

typedef std::codecvt<wchar_t, char, std::mbstate_t> CodeCvt;

// retrieve the codecvt facet from the global locale
const CodeCvt &cvt = std::use_facet<CodeCvt>(global);

// use local shift state when pstate is null
std::mbstate_t state = std::mbstate_t ();
if (0 == pstate)
pstate = &state;

// use a small local buffer when dst is null and ignore size
wchar_t buf [32];
if (0 == dst) {
dst = buf;
size = sizeof buf / sizeof *buf;
}

const char *from = src;
const char *from_end = from + std::strlen (from);
const char *from_next = from;

wchar_t *to = dst;
wchar_t *to_end = to + size;
wchar_t *to_next;

// number of non-NUL wide characters stored in destination buffer
std::size_t nconv = 0;

for ( ; from_next != from_end && to_next != to_end; ) {

const std::codecvt_base::result res =
cvt.in (*pstate,
from, from_end, from_next,
to, to_end, to_next);

switch (res) {

case std::codecvt_base::error:
return std::size_t (-1);

case std::codecvt_base::noconv:
// should not happen (bad codecvt facet)
return std::size_t (-1);

case std::codecvt_base:k:
case std::codecvt_base:artial:

nconv += to_next - to;

if (from_next == from || dst != buf)
return nconv;

from = from_next;
to = dst;
to_end = dst + size;

break;
}
}

return nconv;
}

[...]
> BTW, am I correct in above code? I mean at the second time call for
> wcstombs() or mbstowcs() which use "len" as the length rather than as
> the first call which are use "wstr.length()" or "str.length()" as the
> length?


I don't think that's correct. The last argument specifies the size of
the
destination buffer.

>

[...]
> Thank you, I know how to handle this in Apache C++ Standard Library
> now. I will try that.
> Do you know the how can I use g++'s STL do this? I mean, conversion
> between wchar_t*, which contain UCS-4 string, and char*, which contain
> UCS-2 or UTF16 string.


You should be able to use the same code to convert between UCS and
UTF-8 across all implementations. The only thing that may be different
is the name of the locale. I don't know of a portable way to do UTF-16
(not to be confused with UCS-2), or UCS-2 on platforms where wchar_t
isn't 2 bytes wide (or, conversely, UCS-4 where wchar_t is 2 bytes).

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
(const char *cp) and (char *p) are consistent type, (const char **cpp) and (char **pp) are not consistent lovecreatesbeauty C Programming 1 05-09-2006 08:01 AM
/usr/bin/ld: ../../dist/lib/libjsdombase_s.a(BlockGrouper.o)(.text+0x98): unresolvable relocation against symbol `std::basic_ostream<char, std::char_traits<char> >& std::endl<char, std::char_traits<char> >(std::basic_ostre silverburgh.meryl@gmail.com C++ 3 03-09-2006 12:14 AM
Hi,how to Convert unicode string to MultiByte in VB.net? Billow ASP .Net 2 12-01-2005 03:19 PM
java multibyte char array kaith Java 3 08-21-2003 04:13 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57