Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Converting between Unicode and default locale

Reply
Thread Tools

Converting between Unicode and default locale

 
 
Keith MacDonald
Guest
Posts: n/a
 
      09-26-2003
Hello,

Is there a portable (at least for VC.Net and g++) method to convert text
between
wchar_t and char, using the standard library? I may have missed something
obvious, but the section on codecvt, in Josuttis' "The Standard C++
Library", did not help, and I'm still awaiting delivery of Langer's
"Standard C++ IOStreams and Locales".

Thanks,
Keith MacDonald
[snip, before replying directly]


 
Reply With Quote
 
 
 
 
Mike Wahler
Guest
Posts: n/a
 
      09-26-2003

"Keith MacDonald" <(E-Mail Removed)> wrote in message
news:bl274i$9h8$1$(E-Mail Removed)...
> Hello,
>
> Is there a portable (at least for VC.Net and g++) method to convert text
> between
> wchar_t and char, using the standard library? I may have missed something
> obvious, but the section on codecvt, in Josuttis' "The Standard C++
> Library", did not help, and I'm still awaiting delivery of Langer's
> "Standard C++ IOStreams and Locales".


I read in my copy of L&K that there is no built-in support
for wide character streams. Type 'wchar_t' is only used
to implement multibyte stream i/o.

Also note that depending upon your platform's byte size,
not all Unicode values will necessarily fit into type
'char'.

-Mike


 
Reply With Quote
 
 
 
 
Ron Natalie
Guest
Posts: n/a
 
      09-26-2003

"Mike Wahler" <(E-Mail Removed)> wrote in message newsk1db.5400$(E-Mail Removed) ink.net...

> I read in my copy of L&K that there is no built-in support
> for wide character streams. Type 'wchar_t' is only used
> to implement multibyte stream i/o.


Mulstibyte is using more than one char to encode a character.
wchar_t is fixed size wide characters. But I knew what you
meant.

Yes, it's a major defect in the internationalization support.
I have lobbied in comp.std.C++ to fix this (adding wchar_t
interfaces to the few places that are sorely lacking it
like the filenames in fstreams, etc...). Unfortunately,
I get a lot of bitching and moaning from rest of the
standard community who haven't seriously dealt with
some of the more problematic character encodings such as Japanese.


 
Reply With Quote
 
Aaron Isotton
Guest
Posts: n/a
 
      09-26-2003
On Fri, 26 Sep 2003 21:21:38 +0100, Keith MacDonald wrote:

> Hello,
>
> Is there a portable (at least for VC.Net and g++) method to convert text
> between
> wchar_t and char, using the standard library? I may have missed something
> obvious, but the section on codecvt, in Josuttis' "The Standard C++
> Library", did not help, and I'm still awaiting delivery of Langer's
> "Standard C++ IOStreams and Locales".


Try mbstowcs/wcstombs.
--
Aaron Isotton
http://www.isotton.com/

 
Reply With Quote
 
Gianni Mariani
Guest
Posts: n/a
 
      09-26-2003
Ron Natalie wrote:
> "Mike Wahler" <(E-Mail Removed)> wrote in message newsk1db.5400$(E-Mail Removed) ink.net...
>
>
>>I read in my copy of L&K that there is no built-in support
>>for wide character streams. Type 'wchar_t' is only used
>>to implement multibyte stream i/o.

>
>
> Mulstibyte is using more than one char to encode a character.
> wchar_t is fixed size wide characters. But I knew what you
> meant.
>
> Yes, it's a major defect in the internationalization support.
> I have lobbied in comp.std.C++ to fix this (adding wchar_t
> interfaces to the few places that are sorely lacking it
> like the filenames in fstreams, etc...). Unfortunately,
> I get a lot of bitching and moaning from rest of the
> standard community who haven't seriously dealt with
> some of the more problematic character encodings such as Japanese.


Except that some vendors use utf-16 and some use ucs-4 as their what_t
type. UTF-16 usually breaks a whole bunch of assumptions on what a
whar_t type is supposed to be.

On platforms that use utf-16, the complexity of processing ucs-4 or
utf-16 characters is equivalent so it makes sense to only support utf-8.

If you know your code is ONLY dealing with utf-8 characters, you can
make processing utf-8 characters very efficient by inlining some of the
code thats deals with utf-8.


 
Reply With Quote
 
Mike Wahler
Guest
Posts: n/a
 
      09-26-2003
"Ron Natalie" <(E-Mail Removed)> wrote in message
news:3f74a4e1$0$143$(E-Mail Removed). ..
>
> "Mike Wahler" <(E-Mail Removed)> wrote in message

newsk1db.5400$(E-Mail Removed) ink.net...
>
> > I read in my copy of L&K that there is no built-in support
> > for wide character streams. Type 'wchar_t' is only used
> > to implement multibyte stream i/o.

>
> Mulstibyte is using more than one char to encode a character.


Right.

> wchar_t is fixed size wide characters.


Right.

>But I knew what you
> meant.


I meant what I said. (Actually I suppose L&K meant it,
I'm only repeating it).

What they were explaining is that of course a multibyte
file's contents cannot be stored with type 'char' objects
without losing information, so the multibyte characters
are converted (via a facet) to/from a wide character
encoding interally to the stream. The transport
layer actually accesses the file in 'char'-size
objects.

Ref: Langer & Kreft 2.3, p 113

If you feel I'm misunderstanding, please do clarify.


> Yes, it's a major defect in the internationalization support.


Yes, I agree. Didn't folks work hard to create a
standard character set which could accomodate virtually
all written languages?

> I have lobbied in comp.std.C++ to fix this (adding wchar_t
> interfaces to the few places that are sorely lacking it
> like the filenames in fstreams, etc...). Unfortunately,
> I get a lot of bitching and moaning from rest of the
> standard community who haven't seriously dealt with
> some of the more problematic character encodings such as Japanese.


I haven't had to deal with international issues yet, but I
know that it's only a matter of time, and I'd sure like
some Unicode support so I can practice ahead of time.

Any time I spend more than a few minutes with my nose
inside the L&K book, I come away with my head swimming.

-Mike


 
Reply With Quote
 
Ron Natalie
Guest
Posts: n/a
 
      09-26-2003

"Aaron Isotton" <(E-Mail Removed)> wrote in message news(E-Mail Removed)...

> > Is there a portable (at least for VC.Net and g++) method to convert text
> > between
> > wchar_t and char, using the standard library? I may have missed something
> > obvious, but the section on codecvt, in Josuttis' "The Standard C++
> > Library", did not help, and I'm still awaiting delivery of Langer's
> > "Standard C++ IOStreams and Locales".

>
> Try mbstowcs/wcstombs.
> --

Unfortunately that is not adequate for the windows environment.
In actuality, it is impossible to properly use UNICODE filenames with
the standard C++ library on windows.

I have not been able to make any inroads with the standardization people
about doing something about this.


 
Reply With Quote
 
Ron Natalie
Guest
Posts: n/a
 
      09-26-2003

"Gianni Mariani" <(E-Mail Removed)> wrote in message news:bl29u5$(E-Mail Removed)...

>
> Except that some vendors use utf-16 and some use ucs-4 as their what_t
> type. UTF-16 usually breaks a whole bunch of assumptions on what a
> whar_t type is supposed to be.


Immaterial to the problem. The standard library is broken even if your
wchar_t is 32 bits.

> On platforms that use utf-16, the complexity of processing ucs-4 or
> utf-16 characters is equivalent so it makes sense to only support utf-8.


I do not agree. And windows doesn't provide an implicit char to wchar_t
translation in the system interfaces (utf- or otherwise. It's immaterial
to the fact that wchar_t might become a multi-wide-byte encoding. The
standard library does not provide the hooks necessary to fully support
wchar_t such as you might have.

> If you know your code is ONLY dealing with utf-8 characters, you can
> make processing utf-8 characters very efficient by inlining some of the
> code thats deals with utf-8.


The WIN32 interfaces do not support utf-8. Yoiu have to feed them the
16 bit values if you want to use other than the base codetable. We've
had to write our own bloody fstreams that does a UTF-8 to wchar_t
conversion (essentially reimplimenting fstream to work properly)
but that ought not to be necessary. It's a defect in the language.


 
Reply With Quote
 
Ron Natalie
Guest
Posts: n/a
 
      09-26-2003

"Mike Wahler" <(E-Mail Removed)> wrote in message newsX1db.5447$(E-Mail Removed) ink.net...

> What they were explaining is that of course a multibyte
> file's contents cannot be stored with type 'char' objects
> without losing information, so the multibyte characters
> are converted (via a facet) to/from a wide character
> encoding interally to the stream. The transport
> layer actually accesses the file in 'char'-size
> objects.


I'm not understanding what you are saying. There's no reason
why a multibyte (in char) encoding of a wchar_t loses any information.
UTF-8 will encode 32 bit UNICODE in some number between 1 and
6 char's.


> Ref: Langer & Kreft 2.3, p 113


I don't have the book.

Don't even get me started that the "basic character type" and
the "smallest addressable unit of storage" really should be
distinct types and not overloaded on char. This is the
price we pay for working in an American-centric industry
I guess.


 
Reply With Quote
 
Mike Wahler
Guest
Posts: n/a
 
      09-26-2003
"Ron Natalie" <(E-Mail Removed)> wrote in message
news:3f74ad55$0$175$(E-Mail Removed). ..
>
> "Mike Wahler" <(E-Mail Removed)> wrote in message

newsX1db.5447$(E-Mail Removed) ink.net...
>
> > What they were explaining is that of course a multibyte
> > file's contents cannot be stored with type 'char' objects
> > without losing information, so the multibyte characters
> > are converted (via a facet) to/from a wide character
> > encoding interally to the stream. The transport
> > layer actually accesses the file in 'char'-size
> > objects.

>
> I'm not understanding what you are saying.


I'm not sure I'm conveying the info correctly.
I've include a quote from L&K below.

> There's no reason
> why a multibyte (in char) encoding of a wchar_t loses any information.
> UTF-8 will encode 32 bit UNICODE in some number between 1 and
> 6 char's.
>
>
> > Ref: Langer & Kreft 2.3, p 113

>
> I don't have the book.


Angelika Langer & Klaus Kreft,
"Standard C++ IOStreams and Locales,"
Chapter 2, "The Architecture of IOStreams"
Section 2.3, "Character Types and Character Traits",
page 113:

<quote>

MULTIBYTE FILES

CHARACTER TYPE. Multibye files contain characters in a
multibyte encoding. Different from one-byte or wide-character
encodings, multibyte characters do not have the same size.
A single multibyte character can have a length of 1, 2, 3, or
more bytes. Obviously, none of the built-in character types,
char or wchar_t, is large enough to hold any character of a
given multibyte encoding. For this reason, multibyte characters
contained in a multibyte file are chopped into units of one
byte each. The wide-character file stream extracts data from
the multibyte file byte by byte, interprets the byte sequence,
finds out which and how many bytes form a multibyte character,
identifies the character, and translates it to a wide-character <<===
encoding.

Due to the decomposition of the multibytes into one- byte
units, the type of characters exchanged between the transport
layer and a multibyte file is char.

CHARACTER ENCODING. The encoding of characters exchanged
between the transport layer and a multibyte file can be any
multibyte encoding. Ite depends wholly on the content of the
multibyte file. As wide-character file streams internally
represent characters as units of type wchar_t encoded in the
programming environment's wide-character encoding, a code
conversion is always necessary. The code conversion is per-
formed by the stream buffer's code conversion facet. There
is no default conversion defined. It all depends on the code
conversion facet contained in the stream buffer's locale object,
which initially is the current global locale.

In sum, the external character representation of wide-
character file streams is that of the units transferred to and
from a multibyte file. Its character type is char, and the
encoding depends on the stream's code conversion facet.
</quote>


The above implies to me that in order to access a multibyte
file, one needs to use a basic(i/o)stream<wchar_t>. Am I
missing something or assuming too much?

> Don't even get me started that the "basic character type" and
> the "smallest addressable unit of storage"


I don't think that's part of this issue. They describe
abstract 'character types', about which a stream obtains
pertinent information via 'character traits' types.

>really should be
> distinct types and not overloaded on char.


I don't know what you mean here. I don't see L&K
mention either "basic character type" or "smallest
addressible unit of storage," or "overloading on char."
They talk about how iostreams is templatized on a
'character type', which can be either of the built-in
types char or wchar_t, or some other invented character
type which meets the requirements imposed by iostreams
(defines EOF value, etc).

> This is the
> price we pay for working in an American-centric industry
> I guess.


What about this do you feel is "American-centric"?

Thanks for your input.

-Mike


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
the relation between C++ locale and C locale zade C++ 1 03-05-2010 06:04 PM
Re: List of locale values for locale.setlocale() under Windows Gabriel Genellina Python 0 02-18-2009 12:00 AM
Create C++ std::locale without changing C locale dertopper@web.de C++ 4 08-26-2008 01:15 PM
i18n problem, involving Locale.getDisplayLanguage and Locale.getDisplayCountry Maurice Hulsman Java 1 07-25-2004 06:11 PM
locale.nl_langinfo(RADIXCHAR) vs locale.localeconv()['decimal_point'] Jeff Epler Python 2 08-31-2003 02:18 PM



Advertisments