Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   C Programming (http://www.velocityreviews.com/forums/f42-c-programming.html)
-   -   wchar_t is useless (http://www.velocityreviews.com/forums/t806149-wchar_t-is-useless.html)

Lauri Alanko 11-21-2011 02:14 PM

wchar_t is useless
 
I have recently written a number of posts regarding C's wide character
support. It now turns out that my investigation has been in vain:
wchar_t is useless in portable C programming, although I'm not quite
sure whether the standard or implementations are to blame for this. Most
likely both: the standard has sanctioned the implementations'
deficiencies.

I'm working on a library that deals with multilingual strings. The
library only does computation, and doesn't have need for very fancy I/O,
so I'm trying to avoid any unnecessary platform dependencies and make
the library as portable as possible.

One question I'm facing is what kind of representation to use for the
multilingual strings in the public API of the library. Internally, the
library reads some binary data containing UTF-8 strings, so the obvious
answer would be for the public library functions to accept and return
strings in a standard unicode format, either UTF-8 or UTF-32.

But this is not very C-ish. Since C has standard ways to represent
multilingual strings, it's more convenient for the API to use those
standard ways rather than introducing yet another string representation
type. I thought.

So I considered the options. Multibyte strings are not a viable choice,
since their encoding is locale-dependent. If the library communicated
via multibyte strings, then the locale would have to be set to something
that made it possible to represent all the strings that the library had
to deal with.

But a library cannot make requirements on the global locale: libraries
should be components that can be plugged together, and if they begin to
make any requirements on the locale, then they cannot be used together
if the requirements conflict.

I cannot understand why C still only has a global locale. C++ came up
with first-class locales ages ago, and surely nowadays everyone should
know that anything global wreaks havoc to interoperability and
re-entrancy.

So I looked at wchar_t. If __STDC_ISO_10646__ is defined and wchar_t
represents a unicode code point, this would be just perfect. But that's
not the case on all platforms. But that's okay, I thought, as long as I
can (with some platform-dependent magic) convert between unicode code
points and wchar_t.

On Windows, it turns out, wchar_t represents a UTF-16 code unit, so a
code point can require two wchar_t's. That's ugly (and makes <wctype.h>
useless), but not very crucial for my purposes. The important thing is
that sequences of code points can still be encoded to and from wide
_strings_. I could have lived with this.

But then I found out about the killer: on FreeBSD (and Solaris?) the
encoding used by wchar_t is locale-dependent! That is, a single wchar_t
can represent any code point supported by the current locale, but the
same wchar_t value may be used to represent different code points in
different locales. So adopting wchar_t as the representation type would
again make the capabilities of the library dependent on the current
locale, which might be constrained by other parts of the application.
(Also, the locale-dependent wchar_t encodings are quite undocumented, so
the required platform-dependent magic would be magic indeed.)

To recap: C's multibyte strings are in a locale-dependent, possibly
variable-width encoding. On Windows, the wchar_t string encoding is
variable-width, on FreeBSD and Solaris it is locale-dependent. So for
portable C code, wchar_t doesn't provide any advantages over multibyte
strings.

So screw it all, I'll just use UTF-32 like I should have from the
beginning.


Lauri

Kaz Kylheku 11-21-2011 04:17 PM

Re: wchar_t is useless
 
On 2011-11-21, Lauri Alanko <la@iki.fi> wrote:
> I have recently written a number of posts regarding C's wide character
> support. It now turns out that my investigation has been in vain:
> wchar_t is useless in portable C programming, although I'm not quite


That is false; it is useful.

> So I looked at wchar_t. If __STDC_ISO_10646__ is defined and wchar_t
> represents a unicode code point, this would be just perfect. But that's
> not the case on all platforms. But that's okay, I thought, as long as I
> can (with some platform-dependent magic) convert between unicode code
> points and wchar_t.


wchar_t is an integral type which represents an integral value. It does not
represent a code point any more than "char" represents an ASCII value.

> On Windows, it turns out, wchar_t represents a UTF-16 code unit, so a
> code point can require two wchar_t's. That's ugly (and makes <wctype.h>


This is a limitation of Windows. The Windows API uses 16 bit wide characters,
so you can't get away from this no matter what language you write in on
Windows.

Redmond has decided that characters outside of the Unicode BMP (basic
multilingual plane) are unimportant for its user base. So, if your programs has
customers who are Windows users, you can safely assume that they have already
swallowed this pill.

You get a lot of internationalization mileage out of the BMP. Actually all the
mileage. Above U+FFFF there is only academic crap. Anyone who cares about those
characters is likely also going to be some kind of "freetard" who won't pay a
dime for software.

> But then I found out about the killer: on FreeBSD (and Solaris?) the
> encoding used by wchar_t is locale-dependent!


I would expect this "locale dependent" to mean that if, say, Japanese user is
working with Shift-JIS files, then he or she can set that up in the locale such
that when these files are processed by your program, the characters being read
and written map to sane values of wchar_t (where sane == based on Unicode!).

wchar_t does not have an encoding; it's an integral type. The encoding
of wchar_t is binary enumeration: 000...0101 encodes 5, etc.

Do you have some quotes from FreeBSD or Solaris documentation on this matter
that are giving you concern? Post them.

> So screw it all, I'll just use UTF-32 like I should have from the
> beginning.


But that just means you have to write your own library instead of just using
C95 functions like wscspn, wcscpy, etc. What if you want to do printf-like
formatting to a wide string? Can't use swprintf.

Here is a better idea: just use wchar_t, forget about U+1XXXX on Windows
because Microsoft has decided that one for your users already, and if
locale-dependent streams give you an allergic reaction, handle your own
decoding/encoding for doing I/O.

James Kuyper 11-21-2011 05:27 PM

Re: wchar_t is useless
 
On 11/21/2011 11:17 AM, Kaz Kylheku wrote:
....
> You get a lot of internationalization mileage out of the BMP. Actually all the
> mileage. Above U+FFFF there is only academic crap.


Academics have a need for software support too. One of my old friends
has a husband who works mainly with dead languages; when I met him in
1990 he could read and write 14 of them; he's probably added more since
then. I suspect he would find software that supported Plane 1 useful.

The sources I checked didn't give any indication how adequate the BMP
characters are for representing Chinese text. If the Unified Han
Ideographs in Plane 2 are in fact needed for some purpose, there's a
very large number of Chinese who would need them. That's hardly an
academic issue.

> ... Anyone who cares about those
> characters is likely also going to be some kind of "freetard" who won't pay a
> dime for software.


As a professional software developer myself, I fully agree with the idea
of paying people for their work. However, why should anyone buy software
when they have free software available that is of acceptable quality,
containing all the features they require?

The only reason I can make money developing software is that there's no
one willing to give away software that does what mine does, and that's
just the way it should be.

Ben Pfaff 11-21-2011 06:42 PM

Re: wchar_t is useless
 
Lauri Alanko <la@iki.fi> writes:

> To recap: C's multibyte strings are in a locale-dependent, possibly
> variable-width encoding. On Windows, the wchar_t string encoding is
> variable-width, on FreeBSD and Solaris it is locale-dependent. So for
> portable C code, wchar_t doesn't provide any advantages over multibyte
> strings.


I agree with you.

The libunistring manual has a section that says pretty much what
you did in your message, by the way:
http://www.gnu.org/software/libunist...har_005ft-mess
--
char a[]="\n .CJacehknorstu";int putchar(int);int main(void){unsigned long b[]
={0x67dffdff,0x9aa9aa6a,0xa77ffda9,0x7da6aa6a,0xa6 7f6aaa,0xaa9aa9f6,0x11f6},*p
=b,i=24;for(;p+=!*p;*p/=4)switch(0[p]&3)case 0:{return 0;for(p--;i--;i--)case+
2:{i++;if(i)break;else default:continue;if(0)case 1:putchar(a[i&15]);break;}}}

Jack McCue 11-21-2011 07:18 PM

Re: wchar_t is useless
 
Ben Pfaff <blp@cs.stanford.edu> wrote:
> Lauri Alanko <la@iki.fi> writes:
>
>> To recap: C's multibyte strings are in a locale-dependent, possibly

<snip>
>
> I agree with you.

ditto
>
> The libunistring manual has a section that says pretty much what
> you did in your message, by the way:
> http://www.gnu.org/software/libunist...har_005ft-mess


Thanks for the URL, I struggled with wchar_t
on AIX for a bit then ended up writing a set of
small functions, my needs were simple at the time.
At least I know why I had a hard time, thought I was
missing something. maybe I still am :)

Regards,
Jack


Kaz Kylheku 11-21-2011 09:02 PM

Re: wchar_t is useless
 
On 2011-11-21, Ben Pfaff <blp@cs.stanford.edu> wrote:
> Lauri Alanko <la@iki.fi> writes:
>
>> To recap: C's multibyte strings are in a locale-dependent, possibly
>> variable-width encoding. On Windows, the wchar_t string encoding is
>> variable-width, on FreeBSD and Solaris it is locale-dependent. So for
>> portable C code, wchar_t doesn't provide any advantages over multibyte
>> strings.

>
> I agree with you.
>
> The libunistring manual has a section that says pretty much what
> you did in your message, by the way:
> http://www.gnu.org/software/libunist...har_005ft-mess


It probably pretty much says the same thing, because quite likely that text is
the source for Lauri's opinion, or both have some other common source. For
instance, look at this:

``On Solaris and FreeBSD, the wchar_t encoding is locale dependent and undocumented.''

Eerie similarity!

I don't agree with this libunistring manual. The wchar_t type is useful and
just fine.

The are right about the limitation of Windows, but nobody ever went wrong in
accepting the limitations of Microsoft Windows in order to write software for
users of Windows who have also accepted those limitations.

If you want to do processing with rare languages on Windows, install a virtual
machine running GNU/Linux and you have 32 bit wchar_t. GNU/Linux is more
likely than Redmond to have fonts to display your rare languages, too.

Cleraly the libunistring authors they don't understand what Solaris and FreeBSD
means by "encoding" (and they do not care whether they are right or wrong because,
after all, they have a library which will fix the FreeBSD or Solaris problem,
regardless of whether it is real or imagined.) Hey, a user who needlessly uses your
library is still a user!

And undocumented, by the way? Uh, use the source, Luke?

Oh, and The Single Unix Specification, Issue 6, says this about wchar_t:

wchar_t

Integer type whose range of values can represent distinct wide-character
codes for all members of the largest character set specified among the
locales supported by the compilation environment: the null character has
the code value 0 and each member of the portable character set has a code
value equal to its value when used as the lone character in an integer
character constant.

I very much doubt that FreeBSD and Solaris go against the grain on this one
in any way.

Lauri Alanko 11-21-2011 10:07 PM

Re: wchar_t is useless
 
In article <20111121125000.483@kylheku.com>,
Kaz Kylheku <kaz@kylheku.com> wrote:
> On 2011-11-21, Ben Pfaff <blp@cs.stanford.edu> wrote:
> http://www.gnu.org/software/libunist...har_005ft-mess
>
> It probably pretty much says the same thing, because quite likely that text is
> the source for Lauri's opinion, or both have some other common source.


That is indeed where I learned about the locale-dependency of wchar_t.
I found it hard to believe myself, so I checked.

http://svnweb.freebsd.org/base/head/...19&view=markup

Here we have the following:

199 for (i = (es->want == 0) ? 1 : 0; i < MIN(want, n); i++) {
200 if (*s == '\0') {
201 errno = EILSEQ;
202 return ((size_t)-1);
203 }
204 wc = (wc << 8) | (unsigned char)*s++;
205 }

That is, in the EUC locale, the wchar_t value of a character consists
of just the bits of the variable-width encoding of that character in
EUC. From quick perusing of the source, other variable-width encodings
seem to work the same way, except for utf8.c, which decodes the code
point and stores that in wchar_t.

As for solaris, I tried it out:


$ uname -a
SunOS kruuna.helsinki.fi 5.10 Generic_127111-05 sun4u sparc
$ cat wc.c
#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main(int argc, char* argv[]) {
setlocale(LC_CTYPE, argv[1]);
wchar_t wc = fgetwc(stdin);
printf("%08lx\n", (unsigned long) wc);
return 0;
}
$ echo -e '\xa4' | ./wc fi_FI.ISO8859-1 # U+00A4 CURRENCY SIGN
30000024
$ echo -e '\xa4' | ./wc fi_FI.ISO8859-15 # U+20AC EURO SIGN
30000024
$ echo -e '\xa4' | iconv -f iso-8859-1 -t utf-8 | ./wc fi_FI.UTF-8
000000a4
$ echo -e '\xa4' | iconv -f iso-8859-15 -t utf-8 | ./wc fi_FI.UTF-8
000020ac


Frankly, I cannot understand how platforms like these could support
C1X where wide string literals (whose encoding has to be decided at
compile time before any locale is selected) can contain unicode
escapes.


Lauri

Kaz Kylheku 11-21-2011 11:54 PM

Re: wchar_t is useless
 
On 2011-11-21, Lauri Alanko <la@iki.fi> wrote:
> In article <20111121125000.483@kylheku.com>,
> Kaz Kylheku <kaz@kylheku.com> wrote:
>> On 2011-11-21, Ben Pfaff <blp@cs.stanford.edu> wrote:
>> http://www.gnu.org/software/libunist...har_005ft-mess
>>
>> It probably pretty much says the same thing, because quite likely that text is
>> the source for Lauri's opinion, or both have some other common source.

>
> That is indeed where I learned about the locale-dependency of wchar_t.
> I found it hard to believe myself, so I checked.
>
> http://svnweb.freebsd.org/base/head/...19&view=markup
>
> Here we have the following:
>
> 199 for (i = (es->want == 0) ? 1 : 0; i < MIN(want, n); i++) {
> 200 if (*s == '\0') {
> 201 errno = EILSEQ;
> 202 return ((size_t)-1);
> 203 }
> 204 wc = (wc << 8) | (unsigned char)*s++;
> 205 }


So it's obvious here that a wchar_t does not have an encoding. Some other
encoding is being decoded, and that becomes the value of wchar_t.

> That is, in the EUC locale, the wchar_t value of a character consists
> of just the bits of the variable-width encoding of that character in
> EUC. From quick perusing of the source, other variable-width encodings
> seem to work the same way, except for utf8.c, which decodes the code
> point and stores that in wchar_t.


But is that wrong?

Decoding the utf8 code point is certainly right.

Based on anything you know about EUC (I know nothing), is EUC being handled
properly above? (Furthermore, do you care about the EUC encoding?)

This code is inside the mbrtowc function. Of course mbrtowc is
locale-dependent, by design. It converts multibyte strings to wchar_t, and it
has to do so according to an encoding! This function is locale-dependent,
not the wchar_t type. (And you don't have to use this function.)

Definitely, it's a good idea to do your own encoding and decoding, for
portability, at least in some kinds of programs.

The ISO C standard gives us this:

"At program startup, the equivalent of:

setlocale(LC_ALL, "C");

is executed."

So in C, you are automatically in the safe "C" locale, which specifies the
"minimal environment for C translation". You're insulated from the effect of
the native environment locale until you explicitly call setlocale(LC_ALL, "").

If you don't want to do localization using the C library, just don't
call setlocale, and do all your own converting from external formats. You can
still use wchar_t. Just don't use wide streams, don't use mbstowcs, etc.

By the way, feel free to take any code (BSD licensed) from here:

http://www.kylheku.com/cgit/txr/tree/

I've handled the internationalization of the program by restricting
all I/O to utf-8 and using wchar_t to store characters. On Cygwin and Win32,
text is resticted to U+0000 through U+FFFF. Users who find that
lacking can use a better OS. Problem solved.

> Frankly, I cannot understand how platforms like these could support
> C1X where wide string literals (whose encoding has to be decided at
> compile time before any locale is selected) can contain unicode
> escapes.


Simply by treating all conversions to wchar_t as targetting a common
representation (Unicode).

So for instance, suppose you have the character 野 in a literal, perhaps as a
UTF-8 multibyte character, or a \u sequence in ASCII. This maps to a wchar_t
which has the Unicode value.

The user is in a Shift-JIS locale, and inputs a string which contains
野 in Shift-JIS encoding. You convert that to wchar_t using the correct
locale and, and blam: same character code as what came from your string
literal.

Lauri Alanko 11-22-2011 01:15 AM

Re: wchar_t is useless
 
In article <20111121150711.264@kylheku.com>,
Kaz Kylheku <kaz@kylheku.com> wrote:
> On 2011-11-21, Lauri Alanko <la@iki.fi> wrote:
> > 199 for (i = (es->want == 0) ? 1 : 0; i < MIN(want, n); i++) {
> > 200 if (*s == '\0') {
> > 201 errno = EILSEQ;
> > 202 return ((size_t)-1);
> > 203 }
> > 204 wc = (wc << 8) | (unsigned char)*s++;
> > 205 }

>
> So it's obvious here that a wchar_t does not have an encoding. Some other
> encoding is being decoded, and that becomes the value of wchar_t.


That is a very strange way of putting it. Certainly wchar_t has _an_
encoding, that is, a mapping between abstract characters and integer
values. (In Unicode terminology, it's a "coded character set".)

The euc.c module is a bit of a complex example, since it is
parameterized (as there are many variants of EUC):

http://www.gsp.com/cgi-bin/man.cgi?section=5&topic=euc

Even the man page explicitly says that the encoding of wchar_t is
dependent on the precise definition of the locale. For instance,
character for love (U+611B), which is encoded in EUC-JP as "\xb0\xa6"
is represented by the wchar_t value 0xb0a6.

> > That is, in the EUC locale, the wchar_t value of a character consists
> > of just the bits of the variable-width encoding of that character in
> > EUC. From quick perusing of the source, other variable-width encodings
> > seem to work the same way, except for utf8.c, which decodes the code
> > point and stores that in wchar_t.

>
> But is that wrong?


No _single_ encoding is wrong, the problem is that these different
locales have different encodings for wchar_t. In the utf-8 locale, the
character for love is represented by the wchar_t value 0x611b. So now
if I want my library to input and output wchar_t values, _I need to
know which locale they were produced on_ in order to know how to
interpret them.

> This code is inside the mbrtowc function. Of course mbrtowc is
> locale-dependent, by design. It converts multibyte strings to wchar_t, and it
> has to do so according to an encoding! This function is locale-dependent,
> not the wchar_t type.


The standard library functions, and wide string literals, are what
imbue wchar_t values with an indended interpretation as characters.
Without the intended interpretation, wchar_t would just be a plain
integer type that wouldn't fulfill any function that other integer
types wouldn't already.

> Definitely, it's a good idea to do your own encoding and decoding, for
> portability, at least in some kinds of programs.


I'm not concerned with external encodings (other than UTF-8, which is
used by a certain file format I process). I can let the user of my
library worry about those. I'm concerned with the API, and the choice
of representation for strings. It's not only a question of choosing a
type, there must also be an interpretation for values of that type.
And for wchar_t, it seems, the interpretation can be quite volatile.

> If you don't want to do localization using the C library, just don't
> call setlocale, and do all your own converting from external formats.


I'm writing a _library_. As I explained earlier, a library cannot
control, or constrain, the current locale. Perhaps someone would like
to plug the library into a legacy application that needs to be run
in a certain locale. As a library writer, it's my job to make sure
that this is possible without pain.

> You can still use wchar_t. Just don't use wide streams, don't use
> mbstowcs, etc.


I indeed do not need to use those, but the user of the library
presumably might. Now suppose someone calls a function in my library,
and I wish to return the character for love as a wchar_t. Now how can
I know which wchar_t value I should return?

> I've handled the internationalization of the program by restricting
> all I/O to utf-8 and using wchar_t to store characters. On Cygwin and Win32,
> text is resticted to U+0000 through U+FFFF. Users who find that
> lacking can use a better OS. Problem solved.


It's curious that you find this particular limitation of Windows to be
significant. It's a nuisance, sure, but I don't see why it would be so
important to have a single wchar_t value represent a whole code point.
The only important operations on individual wchar_t's are those in
<wctype.h>, but if you need to classify code points at all, you are soon
likely to need more detailed access to Unicode character properties
that goes beyond what <wctype.h> provides.

And if you need to split a piece of text into discrete units, I don't
see why code points, especially of unnormalized or NFC-normalized
text, would be any more important units than, say, grapheme clusters.

> > Frankly, I cannot understand how platforms like these could support
> > C1X where wide string literals (whose encoding has to be decided at
> > compile time before any locale is selected) can contain unicode
> > escapes.

>
> Simply by treating all conversions to wchar_t as targetting a common
> representation (Unicode).


You mean, rewriting all those locale modules so that wchar_t always
has a consistent value (the unicode code point) for a given character,
regardless of the way it is encoded in the current module?

That's effectively what I was saying: those platforms, as they
currently stand, cannot have locale-independent unicode literals, so
they have to be modified.

But actually, I'm not quite sure if C1X really requires unicode
literals to be locale-independent. The text on character constants,
string literals and universal character names is really confusing, and
there's talk about "an implementation-dependent current locale", so it
might be that even C1X allows the meaning of wide string literals to
vary between locales. It'd be a shame if this is true.


Lauri

Kaz Kylheku 11-22-2011 01:57 AM

Re: wchar_t is useless
 
On 2011-11-22, Lauri Alanko <la@iki.fi> wrote:
> For instance,
> character for love (U+611B), which is encoded in EUC-JP as "\xb0\xa6"
> is represented by the wchar_t value 0xb0a6.


Ah, if that's the case, that is pretty broken. \xb0\xa6 should decode into the
value 0x611B. They ducked out of doing it right, didn't they. Perhaps this
preserves the behavior of legacy programs which expect the mapping
to work that way.

>> Simply by treating all conversions to wchar_t as targetting a common
>> representation (Unicode).

>
> You mean, rewriting all those locale modules so that wchar_t always
> has a consistent value (the unicode code point) for a given character,


Or not using/supporting the locales that don't produce Unicode code points.
You can treat those as weird legacy cruft, like EBCDIC. Find out what works,
and document that as being supported. "This is a Unicode program, whose
embedded strings are in Unicode, and which requires a Unicode-compatible
locale."

Either way, you don't have to throw out the wchar_t. It is handy because it's
supported in the form of string literals, and some useful functions like
wcsspn, wcschr, etc.

I think you have to regard these two problems as being completely separate:

- writing software that is multilingual.
- targetting two or more incompatible ways of being multilingual,
simultaneously in the same program. (incompatible meaning that the
internal representation for characters follows a different map.)

I think you're taking too much into your scope: you want to solve both
problems, and so then when you look at this FreeBSD mess, it looks
intractable.

Solve the first problem, and forget the second.

The only people needing to solve the second problem are those who
are saddled with legacy support requirements, like having to continue
being able to read data from 20 year old versions of the software.


All times are GMT. The time now is 07:13 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.