In article <>,
Kaz Kylheku <> wrote:
> On 2011-11-21, Lauri Alanko <> wrote:
> > 199 for (i = (es->want == 0) ? 1 : 0; i < MIN(want, n); i++) {
> > 200 if (*s == '\0') {
> > 201 errno = EILSEQ;
> > 202 return ((size_t)-1);
> > 203 }
> > 204 wc = (wc <<
| (unsigned char)*s++;
> > 205 }
>
> So it's obvious here that a wchar_t does not have an encoding. Some other
> encoding is being decoded, and that becomes the value of wchar_t.
That is a very strange way of putting it. Certainly wchar_t has _an_
encoding, that is, a mapping between abstract characters and integer
values. (In Unicode terminology, it's a "coded character set".)
The euc.c module is a bit of a complex example, since it is
parameterized (as there are many variants of EUC):
http://www.gsp.com/cgi-bin/man.cgi?section=5&topic=euc
Even the man page explicitly says that the encoding of wchar_t is
dependent on the precise definition of the locale. For instance,
character for love (U+611B), which is encoded in EUC-JP as "\xb0\xa6"
is represented by the wchar_t value 0xb0a6.
> > That is, in the EUC locale, the wchar_t value of a character consists
> > of just the bits of the variable-width encoding of that character in
> > EUC. From quick perusing of the source, other variable-width encodings
> > seem to work the same way, except for utf8.c, which decodes the code
> > point and stores that in wchar_t.
>
> But is that wrong?
No _single_ encoding is wrong, the problem is that these different
locales have different encodings for wchar_t. In the utf-8 locale, the
character for love is represented by the wchar_t value 0x611b. So now
if I want my library to input and output wchar_t values, _I need to
know which locale they were produced on_ in order to know how to
interpret them.
> This code is inside the mbrtowc function. Of course mbrtowc is
> locale-dependent, by design. It converts multibyte strings to wchar_t, and it
> has to do so according to an encoding! This function is locale-dependent,
> not the wchar_t type.
The standard library functions, and wide string literals, are what
imbue wchar_t values with an indended interpretation as characters.
Without the intended interpretation, wchar_t would just be a plain
integer type that wouldn't fulfill any function that other integer
types wouldn't already.
> Definitely, it's a good idea to do your own encoding and decoding, for
> portability, at least in some kinds of programs.
I'm not concerned with external encodings (other than UTF-8, which is
used by a certain file format I process). I can let the user of my
library worry about those. I'm concerned with the API, and the choice
of representation for strings. It's not only a question of choosing a
type, there must also be an interpretation for values of that type.
And for wchar_t, it seems, the interpretation can be quite volatile.
> If you don't want to do localization using the C library, just don't
> call setlocale, and do all your own converting from external formats.
I'm writing a _library_. As I explained earlier, a library cannot
control, or constrain, the current locale. Perhaps someone would like
to plug the library into a legacy application that needs to be run
in a certain locale. As a library writer, it's my job to make sure
that this is possible without pain.
> You can still use wchar_t. Just don't use wide streams, don't use
> mbstowcs, etc.
I indeed do not need to use those, but the user of the library
presumably might. Now suppose someone calls a function in my library,
and I wish to return the character for love as a wchar_t. Now how can
I know which wchar_t value I should return?
> I've handled the internationalization of the program by restricting
> all I/O to utf-8 and using wchar_t to store characters. On Cygwin and Win32,
> text is resticted to U+0000 through U+FFFF. Users who find that
> lacking can use a better OS. Problem solved.
It's curious that you find this particular limitation of Windows to be
significant. It's a nuisance, sure, but I don't see why it would be so
important to have a single wchar_t value represent a whole code point.
The only important operations on individual wchar_t's are those in
<wctype.h>, but if you need to classify code points at all, you are soon
likely to need more detailed access to Unicode character properties
that goes beyond what <wctype.h> provides.
And if you need to split a piece of text into discrete units, I don't
see why code points, especially of unnormalized or NFC-normalized
text, would be any more important units than, say, grapheme clusters.
> > Frankly, I cannot understand how platforms like these could support
> > C1X where wide string literals (whose encoding has to be decided at
> > compile time before any locale is selected) can contain unicode
> > escapes.
>
> Simply by treating all conversions to wchar_t as targetting a common
> representation (Unicode).
You mean, rewriting all those locale modules so that wchar_t always
has a consistent value (the unicode code point) for a given character,
regardless of the way it is encoded in the current module?
That's effectively what I was saying: those platforms, as they
currently stand, cannot have locale-independent unicode literals, so
they have to be modified.
But actually, I'm not quite sure if C1X really requires unicode
literals to be locale-independent. The text on character constants,
string literals and universal character names is really confusing, and
there's talk about "an implementation-dependent current locale", so it
might be that even C1X allows the meaning of wide string literals to
vary between locales. It'd be a shame if this is true.
Lauri