On Sun, 12 Oct 2003 13:29:25 -0700, Micah Cowan wrote:
> Sheldon Simms <> writes:
>
>> On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
>> name wrote:
>>
>> > in comp.lang.c i read:
>> >
>> >>Now if wchar_t is not forced to able to contain a full character then
>> >>again we are stuck at our multibyte (multi-some-unit) character
>> >>sequence with all of its inconveniances. This IMHO defeats the whole
>> >>purpose of wchar_t.
>> >
>> > wchar_t is required to have a range that can handle all the code points
>> > which can arise from the use of any locale supported by the implementation.
>> > c99 takes this further: the implementation can indicate to the programmer
>> > if iso-10646 is directly supported (though the encoding is *not* required
>> > to be ucs-4)
>>
>> I guess you're saying the encoding is not required to be ucs-4 because
>> the standard doesn't explicitly say so:
>>
>> 6.10.8.2
>> ...
>> __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
>> example, 199712L), intended to indicate that values of type wchar_t
>> are the coded representations of the characters defined by ISO/IEC
>> 10646, along with all amendments and technical corrigenda as of the
>> specified year and month.
>>
>> But if the encoding is not ucs-4, then what could it possibly be?
>> 7.17.2 says
>>
>> wchar_t which is an integer type whose range of values can represent
>> distinct codes for all members of the largest extended character set
>> specified among the supported locales;
>>
>> As I read this, it means that in implementations implementing ISO 10646
>> must have a wchar_t capable of representing over 1 million distinct
>> values. Given this requirement, ucs-4 seems to be the only reasonable
>> encoding to use for ISO 10646 wide character strings.
>
> No; the ISO 10646 and Unicode standards are 16-bit
> encodings.
Unicode 4.0 p.1:
Unicode provides for three encoding forms: a 32-bit form (UTF-32),
a 16-bit form (UTF- 16), and an 8-bit form (UTF-

.
> Some 16-bit codes work together (high/low surrogates)
> to produce the effect of a "single" character from two encoded
> characters; however, that does not change the fact that the
> standards themselves claim to present 16-bit encodings.
Unicode 4.0 p.1:
The Unicode Standard specifies a numeric value (code point) and a
name for each of its characters.
...
The Unicode Standard provides 1,114,112 code points,
Unicode 4.0 p.28:
UTF-32 is the simplest Unicode encoding form. Each Unicode code
point is represented directly by a single 32-bit code unit.
Because of this, UTF-32 has a one-to-one relationship between
encoded character and code unit;
...
In the UTF-16 encoding form, ... code points in the supplementary
planes, in the range U+10000..U+10FFFF, are instead represented
as pairs of 16-bit code units.
...
The distinction between characters represented with one versus
two 16-bit code units means that formally UTF-16 is a variable-
width encoding form.
> Not only this, but while support is in place for
> character codes 0x10000 and above, no character codes have
> actually been defined for these values, and so UCS-2/UTF-16 can
> safely be used to encode "all members of the largest extended
> character set".
Unicode 4.0 p.1:
The Unicode Standard, Version 4.0, contains 96,382 characters
from the world's scripts.
...
The unified Han subset contains 70,207 ideographic characters
Examples of characters at code points greater than or equal to
0x10000 are "Musical Symbols", "Mathematical Alphanumeric Symbols",
and "CJK Unified Ideographs Extension B"
http://www.unicode.org/charts/
My conclusion is that 16 bit values can NOT in fact encode "all
members of the largest extended character set", if that character
set is Unicode. This means that 16 bit wchar_t is NOT conforming
on implementations that claim to implement Unicode, and that
the only acceptable encoding for wide character strings in such
an implementations is UCS-4
-Sheldon