Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Multibyte string length

Reply
Thread Tools

Multibyte string length

 
 
Micah Cowan
Guest
Posts: n/a
 
      10-12-2003
Sheldon Simms <> writes:

> On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
> name wrote:
>
> > in comp.lang.c i read:
> >
> >>Now if wchar_t is not forced to able to contain a full character then
> >>again we are stuck at our multibyte (multi-some-unit) character
> >>sequence with all of its inconveniances. This IMHO defeats the whole
> >>purpose of wchar_t.

> >
> > wchar_t is required to have a range that can handle all the code points
> > which can arise from the use of any locale supported by the implementation.
> > c99 takes this further: the implementation can indicate to the programmer
> > if iso-10646 is directly supported (though the encoding is *not* required
> > to be ucs-4)

>
> I guess you're saying the encoding is not required to be ucs-4 because
> the standard doesn't explicitly say so:
>
> 6.10.8.2
> ...
> __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
> example, 199712L), intended to indicate that values of type wchar_t
> are the coded representations of the characters defined by ISO/IEC
> 10646, along with all amendments and technical corrigenda as of the
> specified year and month.
>
> But if the encoding is not ucs-4, then what could it possibly be?
> 7.17.2 says
>
> wchar_t which is an integer type whose range of values can represent
> distinct codes for all members of the largest extended character set
> specified among the supported locales;
>
> As I read this, it means that in implementations implementing ISO 10646
> must have a wchar_t capable of representing over 1 million distinct
> values. Given this requirement, ucs-4 seems to be the only reasonable
> encoding to use for ISO 10646 wide character strings.


No; the ISO 10646 and Unicode standards are 16-bit
encodings. Some 16-bit codes work together (high/low surrogates)
to produce the effect of a "single" character from two encoded
characters; however, that does not change the fact that the
standards themselves claim to present 16-bit encodings (Actually,
for ISO 10646 I'm making some assumptions, as I've not read it;
only Unicode). Not only this, but while support is in place for
character codes 0x10000 and above, no character codes have
actually been defined for these values, and so UCS-2/UTF-16 can
safely be used to encode "all members of the largest extended
character set".

> Would an implementation that used utf-8 encoding in wide character
> strings composed of 32-bit wchar_t be conforming?


I don't think so, no.

-Micah
 
Reply With Quote
 
 
 
 
Sheldon Simms
Guest
Posts: n/a
 
      10-13-2003
On Sun, 12 Oct 2003 13:29:25 -0700, Micah Cowan wrote:

> Sheldon Simms <> writes:
>
>> On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
>> name wrote:
>>
>> > in comp.lang.c i read:
>> >
>> >>Now if wchar_t is not forced to able to contain a full character then
>> >>again we are stuck at our multibyte (multi-some-unit) character
>> >>sequence with all of its inconveniances. This IMHO defeats the whole
>> >>purpose of wchar_t.
>> >
>> > wchar_t is required to have a range that can handle all the code points
>> > which can arise from the use of any locale supported by the implementation.
>> > c99 takes this further: the implementation can indicate to the programmer
>> > if iso-10646 is directly supported (though the encoding is *not* required
>> > to be ucs-4)

>>
>> I guess you're saying the encoding is not required to be ucs-4 because
>> the standard doesn't explicitly say so:
>>
>> 6.10.8.2
>> ...
>> __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
>> example, 199712L), intended to indicate that values of type wchar_t
>> are the coded representations of the characters defined by ISO/IEC
>> 10646, along with all amendments and technical corrigenda as of the
>> specified year and month.
>>
>> But if the encoding is not ucs-4, then what could it possibly be?
>> 7.17.2 says
>>
>> wchar_t which is an integer type whose range of values can represent
>> distinct codes for all members of the largest extended character set
>> specified among the supported locales;
>>
>> As I read this, it means that in implementations implementing ISO 10646
>> must have a wchar_t capable of representing over 1 million distinct
>> values. Given this requirement, ucs-4 seems to be the only reasonable
>> encoding to use for ISO 10646 wide character strings.

>
> No; the ISO 10646 and Unicode standards are 16-bit
> encodings.


Unicode 4.0 p.1:
Unicode provides for three encoding forms: a 32-bit form (UTF-32),
a 16-bit form (UTF- 16), and an 8-bit form (UTF-.

> Some 16-bit codes work together (high/low surrogates)
> to produce the effect of a "single" character from two encoded
> characters; however, that does not change the fact that the
> standards themselves claim to present 16-bit encodings.


Unicode 4.0 p.1:
The Unicode Standard specifies a numeric value (code point) and a
name for each of its characters.
...
The Unicode Standard provides 1,114,112 code points,

Unicode 4.0 p.28:
UTF-32 is the simplest Unicode encoding form. Each Unicode code
point is represented directly by a single 32-bit code unit.
Because of this, UTF-32 has a one-to-one relationship between
encoded character and code unit;
...
In the UTF-16 encoding form, ... code points in the supplementary
planes, in the range U+10000..U+10FFFF, are instead represented
as pairs of 16-bit code units.
...
The distinction between characters represented with one versus
two 16-bit code units means that formally UTF-16 is a variable-
width encoding form.

> Not only this, but while support is in place for
> character codes 0x10000 and above, no character codes have
> actually been defined for these values, and so UCS-2/UTF-16 can
> safely be used to encode "all members of the largest extended
> character set".


Unicode 4.0 p.1:
The Unicode Standard, Version 4.0, contains 96,382 characters
from the world's scripts.
...
The unified Han subset contains 70,207 ideographic characters

Examples of characters at code points greater than or equal to
0x10000 are "Musical Symbols", "Mathematical Alphanumeric Symbols",
and "CJK Unified Ideographs Extension B"

http://www.unicode.org/charts/

My conclusion is that 16 bit values can NOT in fact encode "all
members of the largest extended character set", if that character
set is Unicode. This means that 16 bit wchar_t is NOT conforming
on implementations that claim to implement Unicode, and that
the only acceptable encoding for wide character strings in such
an implementations is UCS-4

-Sheldon

 
Reply With Quote
 
 
 
 
Dan Pop
Guest
Posts: n/a
 
      10-13-2003
In <> Sheldon Simms <> writes:

>On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
>name wrote:
>
>> in comp.lang.c i read:
>>
>>>Now if wchar_t is not forced to able to contain a full character then
>>>again we are stuck at our multibyte (multi-some-unit) character
>>>sequence with all of its inconveniances. This IMHO defeats the whole
>>>purpose of wchar_t.

>>
>> wchar_t is required to have a range that can handle all the code points
>> which can arise from the use of any locale supported by the implementation.
>> c99 takes this further: the implementation can indicate to the programmer
>> if iso-10646 is directly supported (though the encoding is *not* required
>> to be ucs-4)

>
>I guess you're saying the encoding is not required to be ucs-4 because
>the standard doesn't explicitly say so:
>
> 6.10.8.2
> ...
> __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
> example, 199712L), intended to indicate that values of type wchar_t
> are the coded representations of the characters defined by ISO/IEC
> 10646, along with all amendments and technical corrigenda as of the
> specified year and month. ^^^^^^^^^

^^^^^^^^^^^^^^^^^^^^^^^^
>But if the encoding is not ucs-4, then what could it possibly be?
>7.17.2 says
>
> wchar_t which is an integer type whose range of values can represent
> distinct codes for all members of the largest extended character set
> specified among the supported locales;


Again, what part of the standard precludes ASCII, EBCDIC or ISO 8859-1
as being "the largest extended character set specified among the
supported locales" and, therefore, having wchar_t defined as char?

>As I read this, it means that in implementations implementing ISO 10646
>must have a wchar_t capable of representing over 1 million distinct
>values.


It depends on the actual value of the __STDC_ISO_10646__, which could
point to an earlier version of ISO 10646, or not be defined at all,
as in my ASCII example above.

>Given this requirement, ucs-4 seems to be the only reasonable
>encoding to use for ISO 10646 wide character strings.


If the implementation chooses to support a recent enough version of the
ISO 10646. Which the standard allows but doesn't require. The first
incarnation of ISO 10646 only specified 34203 characters, so a 16-bit
wchar_t would be enough for an implementation defining __STDC_ISO_10646__.

>Would an implementation that used utf-8 encoding in wide character
>strings composed of 32-bit wchar_t be conforming?


No way. utf-8 encodings need not fit in a 32-bit wchar_t (they take one
to six octets). They are clearly intended to be used in multibyte
character strings, which are composed of plain char's (e.g. printf's
format string).

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email:
 
Reply With Quote
 
Sheldon Simms
Guest
Posts: n/a
 
      10-13-2003
ffOn Mon, 13 Oct 2003 14:18:31 +0000, Dan Pop wrote:

> In <> Sheldon Simms <> writes:
>
>>On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
>>name wrote:
>>
>> wchar_t which is an integer type whose range of values can represent
>> distinct codes for all members of the largest extended character set
>> specified among the supported locales;

>
> Again, what part of the standard precludes ASCII, EBCDIC or ISO 8859-1
> as being "the largest extended character set specified among the
> supported locales" and, therefore, having wchar_t defined as char?


Nothing. However, I was only talking about cases where "the largest
extended character set" is Unicode.

>>As I read this, it means that in implementations implementing ISO 10646
>>must have a wchar_t capable of representing over 1 million distinct
>>values.

>
> It depends on the actual value of the __STDC_ISO_10646__, which could
> point to an earlier version of ISO 10646


All right. It might suck to know that your preferred implementation
is not capable of keeping up with ISO 10646 since it's stuck with a
16 bit wchar_t, but I guess that's a problem for the implementors
users of such an implementation, and off topic here.

>>Given this requirement, ucs-4 seems to be the only reasonable
>>encoding to use for ISO 10646 wide character strings.

>
> If the implementation chooses to support a recent enough version of the
> ISO 10646. Which the standard allows but doesn't require.


That's what I thought.

>>Would an implementation that used utf-8 encoding in wide character
>>strings composed of 32-bit wchar_t be conforming?

>
> No way. utf-8 encodings need not fit in a 32-bit wchar_t (they take one
> to six octets). They are clearly intended to be used in multibyte
> character strings, which are composed of plain char's (e.g. printf's
> format string).


My intention was to express that each of the 32 bit wide characters
contain the value of one octet of the UTF-8 encoding. I didn't
think that would be conforming.

 
Reply With Quote
 
Dan Pop
Guest
Posts: n/a
 
      10-13-2003
In <> Sheldon Simms <> writes:

>ffOn Mon, 13 Oct 2003 14:18:31 +0000, Dan Pop wrote:
>
>> In <> Sheldon Simms <> writes:
>>
>>>On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
>>>name wrote:
>>>
>>> wchar_t which is an integer type whose range of values can represent
>>> distinct codes for all members of the largest extended character set
>>> specified among the supported locales;

>>
>> Again, what part of the standard precludes ASCII, EBCDIC or ISO 8859-1
>> as being "the largest extended character set specified among the
>> supported locales" and, therefore, having wchar_t defined as char?

>
>Nothing. However, I was only talking about cases where "the largest
>extended character set" is Unicode.
>
>>>As I read this, it means that in implementations implementing ISO 10646
>>>must have a wchar_t capable of representing over 1 million distinct
>>>values.

>>
>> It depends on the actual value of the __STDC_ISO_10646__, which could
>> point to an earlier version of ISO 10646

>
>All right. It might suck to know that your preferred implementation
>is not capable of keeping up with ISO 10646 since it's stuck with a
>16 bit wchar_t, but I guess that's a problem for the implementors
>users of such an implementation, and off topic here.


Once you're talking about cases where "the largest extended character
set" is Unicode *only*, you're off-topic here, anyway.

However, I can see no reason why a certain implementation would be stuck
with a 16 bit wchar_t, once its intended market is asking for more. For
the time being, there is little market pressure for a wider wchar_t,
however, the 16-bit codes covering practically all locales of interest.

Widening wchar_t to 32-bit is not a no-cost decision: think about
programs manipulating huge amounts of wchar_t data.

>>>Would an implementation that used utf-8 encoding in wide character
>>>strings composed of 32-bit wchar_t be conforming?

>>
>> No way. utf-8 encodings need not fit in a 32-bit wchar_t (they take one
>> to six octets). They are clearly intended to be used in multibyte
>> character strings, which are composed of plain char's (e.g. printf's
>> format string).

>
>My intention was to express that each of the 32 bit wide characters
>contain the value of one octet of the UTF-8 encoding. I didn't
>think that would be conforming.


Of course it wouldn't: wchar_t objects are supposed to contain character
values, not *encoded* character values. Encoded character values can be
stored in multibyte character strings only.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email:
 
Reply With Quote
 
Micah Cowan
Guest
Posts: n/a
 
      10-13-2003
Sheldon Simms <> writes:

> On Sun, 12 Oct 2003 13:29:25 -0700, Micah Cowan wrote:
>
> > Sheldon Simms <> writes:
> >
> >> On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
> >> name wrote:
> >>
> >> > in comp.lang.c i read:
> >> >
> >> >>Now if wchar_t is not forced to able to contain a full character then
> >> >>again we are stuck at our multibyte (multi-some-unit) character
> >> >>sequence with all of its inconveniances. This IMHO defeats the whole
> >> >>purpose of wchar_t.
> >> >
> >> > wchar_t is required to have a range that can handle all the code points
> >> > which can arise from the use of any locale supported by the implementation.
> >> > c99 takes this further: the implementation can indicate to the programmer
> >> > if iso-10646 is directly supported (though the encoding is *not* required
> >> > to be ucs-4)
> >>
> >> I guess you're saying the encoding is not required to be ucs-4 because
> >> the standard doesn't explicitly say so:
> >>
> >> 6.10.8.2
> >> ...
> >> __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
> >> example, 199712L), intended to indicate that values of type wchar_t
> >> are the coded representations of the characters defined by ISO/IEC
> >> 10646, along with all amendments and technical corrigenda as of the
> >> specified year and month.
> >>
> >> But if the encoding is not ucs-4, then what could it possibly be?
> >> 7.17.2 says
> >>
> >> wchar_t which is an integer type whose range of values can represent
> >> distinct codes for all members of the largest extended character set
> >> specified among the supported locales;
> >>
> >> As I read this, it means that in implementations implementing ISO 10646
> >> must have a wchar_t capable of representing over 1 million distinct
> >> values. Given this requirement, ucs-4 seems to be the only reasonable
> >> encoding to use for ISO 10646 wide character strings.

> >
> > No; the ISO 10646 and Unicode standards are 16-bit
> > encodings.

>
> Unicode 4.0 p.1:
> Unicode provides for three encoding forms: a 32-bit form (UTF-32),
> a 16-bit form (UTF- 16), and an 8-bit form (UTF-.


I didn't mean quite what I wrote: What I meant was "Unicode
character codes have a width of 16 bits". This was true
regardless of the number of encodings available (Unicode 3.0 plus
addenda had UTF-32), yet sect. 2.2 still said "Unicode character
codes have a width of 16 bits". This appears to have been removed
from Unicode 4.0.

> > Some 16-bit codes work together (high/low surrogates)
> > to produce the effect of a "single" character from two encoded
> > characters; however, that does not change the fact that the
> > standards themselves claim to present 16-bit encodings.

>
> Unicode 4.0 p.1:
> The Unicode Standard specifies a numeric value (code point) and a
> name for each of its characters.
> ...
> The Unicode Standard provides 1,114,112 code points,


Hm. The same area in Unicode 3.0 said "Using a 16-bit encoding
means that code values are available for more than 65,000
characters." They clearly supported more than that; sloppy
wording on their part.

> Unicode 4.0 p.28:
> UTF-32 is the simplest Unicode encoding form. Each Unicode code
> point is represented directly by a single 32-bit code unit.
> Because of this, UTF-32 has a one-to-one relationship between
> encoded character and code unit;
> ...
> In the UTF-16 encoding form, ... code points in the supplementary
> planes, in the range U+10000..U+10FFFF, are instead represented
> as pairs of 16-bit code units.
> ...
> The distinction between characters represented with one versus
> two 16-bit code units means that formally UTF-16 is a variable-
> width encoding form.


Okay. Here's the chief difference then. In Unicode 3.0, UTF-16
was formally considered the one-to-one representation (which was
kind of sticky when you deal with surrogates; having to pretend
that they're really two separate characters...).

> My conclusion is that 16 bit values can NOT in fact encode "all
> members of the largest extended character set", if that character
> set is Unicode. This means that 16 bit wchar_t is NOT conforming
> on implementations that claim to implement Unicode, and that
> the only acceptable encoding for wide character strings in such
> an implementations is UCS-4


Alright, then: but it *is* conforming provided that they claim to
conform to a Unicode standard preceding 4.0 whose entire
character could be represented in 16 bits.

I hadn't gotten around to reading the 4.0 yet; I'm pleased to see
that they've eschewed all the "pay no attention to the man behind
the curtain; Unicode *is* a 16-bit character set... that seemed
to be present in 3.0". Perhaps they had already remedied some of
this in their addenda: I didn't read many of those except some of
the new character codespaces.

-Micah
 
Reply With Quote
 
Sheldon Simms
Guest
Posts: n/a
 
      10-13-2003
On Mon, 13 Oct 2003 18:25:04 +0000, Dan Pop wrote:

> In <> Sheldon Simms <> writes:
>
>>ffOn Mon, 13 Oct 2003 14:18:31 +0000, Dan Pop wrote:
>>
>>> In <> Sheldon Simms <> writes:
>>>
>>>>Would an implementation that used utf-8 encoding in wide character
>>>>strings composed of 32-bit wchar_t be conforming?
>>>
>>> No way. utf-8 encodings need not fit in a 32-bit wchar_t (they take one
>>> to six octets). They are clearly intended to be used in multibyte
>>> character strings, which are composed of plain char's (e.g. printf's
>>> format string).

>>
>>My intention was to express that each of the 32 bit wide characters
>>contain the value of one octet of the UTF-8 encoding. I didn't
>>think that would be conforming.

>
> Of course it wouldn't: wchar_t objects are supposed to contain character
> values, not *encoded* character values. Encoded character values can be
> stored in multibyte character strings only.


This gets back to the problem the original poster had. He seemed to
be confronted with an implementation that used 16 bit wchar_t and
encoded wide character strings (including characters outside of
Unicode's Basic Multilingual Plane) in UTF-16, a variable length
encoding.

I expressed the view that such an implementation would be non-conforming.

 
Reply With Quote
 
Dingo
Guest
Posts: n/a
 
      10-14-2003
(Dan Pop) wrote in message news:<bmec7n$jsh$>...
> In <> Sheldon Simms <> writes:
>
> >On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
> >name wrote:
> >
> >> in comp.lang.c i read:
> >>
> >>>Now if wchar_t is not forced to able to contain a full character then
> >>>again we are stuck at our multibyte (multi-some-unit) character
> >>>sequence with all of its inconveniances. This IMHO defeats the whole
> >>>purpose of wchar_t.
> >>
> >> wchar_t is required to have a range that can handle all the code points
> >> which can arise from the use of any locale supported by the implementation.
> >> c99 takes this further: the implementation can indicate to the programmer
> >> if iso-10646 is directly supported (though the encoding is *not* required
> >> to be ucs-4)

> >
> >I guess you're saying the encoding is not required to be ucs-4 because
> >the standard doesn't explicitly say so:
> >
> > 6.10.8.2
> > ...
> > __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
> > example, 199712L), intended to indicate that values of type wchar_t
> > are the coded representations of the characters defined by ISO/IEC
> > 10646, along with all amendments and technical corrigenda as of the
> > specified year and month. ^^^^^^^^^

> ^^^^^^^^^^^^^^^^^^^^^^^^
> >But if the encoding is not ucs-4, then what could it possibly be?
> >7.17.2 says
> >
> > wchar_t which is an integer type whose range of values can represent
> > distinct codes for all members of the largest extended character set
> > specified among the supported locales;

>
> Again, what part of the standard precludes ASCII, EBCDIC or ISO 8859-1
> as being "the largest extended character set specified among the
> supported locales" and, therefore, having wchar_t defined as char?
>
> >As I read this, it means that in implementations implementing ISO 10646
> >must have a wchar_t capable of representing over 1 million distinct
> >values.

>
> It depends on the actual value of the __STDC_ISO_10646__, which could
> point to an earlier version of ISO 10646, or not be defined at all,
> as in my ASCII example above.


The way I read it, __STDC_ISO_10646__ doesn't indicate the Unicode
version that defines the extended character set. It is just states
the version where wchar_t encodings may be found.

A seven-bit ASCII implementation with wchar_t defined as char could
define the most recent value for __STDC_ISO_10646__ and be conforming.
ASCII encodings map directly to the most recent version of ISO 10646.
And a char is wide enough to hold "the largest extended character set
among the supported locales."
 
Reply With Quote
 
Dan Pop
Guest
Posts: n/a
 
      10-14-2003
In < > (Dingo) writes:

> (Dan Pop) wrote in message news:<bmec7n$jsh$>...
>> In <> Sheldon Simms <> writes:
>>
>> >On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
>> >name wrote:
>> >
>> >> in comp.lang.c i read:
>> >>
>> >>>Now if wchar_t is not forced to able to contain a full character then
>> >>>again we are stuck at our multibyte (multi-some-unit) character
>> >>>sequence with all of its inconveniances. This IMHO defeats the whole
>> >>>purpose of wchar_t.
>> >>
>> >> wchar_t is required to have a range that can handle all the code points
>> >> which can arise from the use of any locale supported by the implementation.
>> >> c99 takes this further: the implementation can indicate to the programmer
>> >> if iso-10646 is directly supported (though the encoding is *not* required
>> >> to be ucs-4)
>> >
>> >I guess you're saying the encoding is not required to be ucs-4 because
>> >the standard doesn't explicitly say so:
>> >
>> > 6.10.8.2
>> > ...
>> > __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
>> > example, 199712L), intended to indicate that values of type wchar_t
>> > are the coded representations of the characters defined by ISO/IEC
>> > 10646, along with all amendments and technical corrigenda as of the
>> > specified year and month. ^^^^^^^^^

>> ^^^^^^^^^^^^^^^^^^^^^^^^
>> >But if the encoding is not ucs-4, then what could it possibly be?
>> >7.17.2 says
>> >
>> > wchar_t which is an integer type whose range of values can represent
>> > distinct codes for all members of the largest extended character set
>> > specified among the supported locales;

>>
>> Again, what part of the standard precludes ASCII, EBCDIC or ISO 8859-1
>> as being "the largest extended character set specified among the
>> supported locales" and, therefore, having wchar_t defined as char?
>>
>> >As I read this, it means that in implementations implementing ISO 10646
>> >must have a wchar_t capable of representing over 1 million distinct
>> >values.

>>
>> It depends on the actual value of the __STDC_ISO_10646__, which could
>> point to an earlier version of ISO 10646, or not be defined at all,
>> as in my ASCII example above.

>
>The way I read it, __STDC_ISO_10646__ doesn't indicate the Unicode
>version that defines the extended character set. It is just states
>the version where wchar_t encodings may be found.
>
>A seven-bit ASCII implementation with wchar_t defined as char could
>define the most recent value for __STDC_ISO_10646__ and be conforming.
>ASCII encodings map directly to the most recent version of ISO 10646.
>And a char is wide enough to hold "the largest extended character set
>among the supported locales."


As I read it, it is the whole ISO/IEC 10646 specification that must be
supported by wchar_t, once this macro is defined. The words "along
with all amendments and technical corrigenda as of the specified year
and month" clearly suggest this interpretation to me. Of course, only
comp.std.c can say which interpretation is the intended one.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email:
 
Reply With Quote
 
Dan Pop
Guest
Posts: n/a
 
      10-14-2003
In <> Sheldon Simms <> writes:

>This gets back to the problem the original poster had. He seemed to
>be confronted with an implementation that used 16 bit wchar_t and
>encoded wide character strings (including characters outside of
>Unicode's Basic Multilingual Plane) in UTF-16, a variable length
>encoding.


Couldn't find anything suggesting this in OP's post:

From: "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org>
Organization: Customers chello Poland
Date: Thu, 09 Oct 2003 12:54:00 GMT
Subject: Multibyte string length

Hello
I've browsed the FAQ but apparently it lacks any questions concenring wide
character strings. I'd like to calculate the length of a multibyte string
without converting the whole string.

Zygmunt

PS: The whole multibyte string vs wide character string concept is broken
IMHO since it allows wchar_t not to be large enough to contain a full
character (rendering both types virtually the same). What's the point of
standartizing wide characters if the standard makes portable usage of such
mechanism a programming hell? Feel free to disagree.

PS2: On my implementation wchar_t is 'big enough' so I might overcome the
problem in some other way but I'd like to see some fully portable approach.

He seemed to be worried about wchar_t not being wide enough for its
intended purpose, but the C standard makes it quite clear that this cannot
be the case, by definition, for the simple reason that it is the
implementor who decides what the extended character set actually is.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email:
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to get a "screen" length of a multibyte string? kobayashi Python 9 11-26-2012 12:58 AM
How to determine Multibyte string length. Owner C Programming 4 04-11-2011 12:12 AM
String exceeding length - Getting absolute string length james.w.appleby@gmail.com Java 5 01-11-2007 12:07 AM
multibyte length Jordan Abel C Programming 3 03-03-2006 10:51 PM
left(string, length) or right(string, length)? Sam ASP .Net 3 02-17-2005 12:01 PM



Advertisments