> > 1) Should we initialize mbstate_t variable? And how to initialize the
> > mbstate_t portable and in C++ way?
>
> > Many sample code I saw on the net, didn't initialize the mbstate_t
> > variable. Such as:
>
> >http://incubator.apache.org/stdcxx/d...cvt.html#sec12
>
> > std::mbstate_t state;
>
> Strictly speaking you should zero-initialize the state. It doesn't
> matter
> in the trivial example shown in the Apache stdcxx documentation but
> in general the state must be either zeroed out (i.e., to represent the
> initial shift state) or be the result of a prior conversion.
>
> I have corrected the example program to initialize the state variable,
> see:http://svn.apache.org/viewvc?view=rev&revision=533806. I'll fix
> the docs next.
>
>
Yes, the example in Apache stdcxx documentation works, since it
doesn't try to handle MBCS in CJK encoding. If the state is not zero,
the code will get problem to handle MBCS string, and the first 1-2
bytes in the MBCS will parse to a wrong result if they a greater than
0x80, and the follow up byte might be parsed correct, and if the first
1-2 char is < 0x80, it might just simply return with an error.
Thank you very much for correct the code and the doc, it will make
others much clear and avoid the problem I faced.
> [...]
> > mbstate_t state = {0};
>
> > And I do find a code using memset() to set all range to zero, but I
> > don't think it's c++'s way.
> > How should I make the initial portable?
>
> Like so:
>
> mbstate_t state = mbstate_t ();
>
I get it, thank you very much.
>
>
> > 2) I can know the wchar_t* buf length for codecvt.in() by
> > codecvt.length(), but how should I know the char * buffer length for
> > codecvt.out()?
>
> codecvt::length() returns the number of extern_type characters (i.e.,
> narrow chars for codecvt<wchar_t, char>).
>
I'm a little confuse here, even after read the document. Could you
give me a piece of code as example how to do same thing as below's
code:
===================================
string str("\xba\xba\xd6\xd7");
size_t len = mbstowcs(0, str, str.length());
wchar_t* wstr = new wchar_t[len+1];
mbstowcs(wstr, str, len);
===================================
And the reverse version:
===================================
wstring wstr(L"\xbaba\xd6d7");
size_t len = wcstombs(0, wstr, wstr.length());
char* str = new char[len+1];
wcstombs(str, wstr, len);
===================================
The point is I need to get the length for the output buffer, so I can
new the buffer in a safe way. How can I get the buffer's length for
both codecvt::in() and codecvt:

ut()?
BTW, am I correct in above code? I mean at the second time call for
wcstombs() or mbstowcs() which use "len" as the length rather than as
the first call which are use "wstr.length()" or "str.length()" as the
length?
>
> [...]
> > However, I still cannot handle "UCS-2"/"UTF16" in Linux or
> > "UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
> > this?
>
> In the Apache C++ Standard Library you can do it using
> a codecvt_byname facet constructed with the name "UTF-8@UCS"
> as an argument, although it's not mentioned on the documentation page:http://incubator.apache.org/stdcxx/d...vt-byname.html
> Let me look into adding it.
Thank you, I know how to handle this in Apache C++ Standard Library
now. I will try that.
Do you know the how can I use g++'s STL do this? I mean, conversion
between wchar_t*, which contain UCS-4 string, and char*, which contain
UCS-2 or UTF16 string.
The problem is raised when I try to do a project can be portable
between Windows and Linux. I try to write the unicode string to a
file.
When I choose UTF8 to write, I get 2 problems,
1) VC80's STL doesn't support UTF8's locale, (althought Win32 api
support it, but use win32 api will make some of the code non-portable)
2) All of the string is CJK characters, so UTF8 will cost at least 3
bytes to store, enlarge 50% for storage which is unnecessary if I
store just use UCS-2. And I'm sure all the characters is in BMP of
ISO-10646. So I'd better just use 16bit to store it in the file.
However, If I choose UCS2LE, just like what stored in wchar_t in VC, I
got problem of reading the file at Linux, which g++'s STL looks like
doesn't support UCS-2LE locale, and wchar_t in Linux is UCS4 rather
than UCS2, so I cannot directly read the content. (same kind of story,
since libiconv support UCS-2LE, but if I use libiconv it will make the
part of the code non-portable and I have to let mycode depends on
libiconv).
So, What should I do in this case?