Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   C Programming (http://www.velocityreviews.com/forums/f42-c-programming.html)
-   -   Multibyte string length (http://www.velocityreviews.com/forums/t315681-multibyte-string-length.html)

Zygmunt Krynicki 10-09-2003 12:54 PM

Multibyte string length
 
Hello
I've browsed the FAQ but apparently it lacks any questions concenring wide
character strings. I'd like to calculate the length of a multibyte string
without converting the whole string.

Zygmunt

PS: The whole multibyte string vs wide character string concept is broken
IMHO since it allows wchar_t not to be large enough to contain a full
character (rendering both types virtually the same). What's the point of
standartizing wide characters if the standard makes portable usage of such
mechanism a programming hell? Feel free to disagree.

PS2: On my implementation wchar_t is 'big enough' so I might overcome the
problem in some other way but I'd like to see some fully portable approach.

Dan Pop 10-09-2003 03:08 PM

Re: Multibyte string length
 
In <pan.2003.10.09.12.50.01.320068@_CUT_2zyga.MEdyndn s._OUT_org> "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org> writes:

>I've browsed the FAQ but apparently it lacks any questions concenring wide
>character strings. I'd like to calculate the length of a multibyte string
>without converting the whole string.


Use the mblen function from the standard C library in a loop, until it
returns 0. The number of mblen calls returning a positive value is the
number of multibyte characters in that string.

>PS: The whole multibyte string vs wide character string concept is broken
>IMHO since it allows wchar_t not to be large enough to contain a full
>character (rendering both types virtually the same). What's the point of
>standartizing wide characters if the standard makes portable usage of such
>mechanism a programming hell? Feel free to disagree.


The bit you're missing is that the standard doesn't impose one character
set or another for wide characters. If the implementor decides to use
ASCII as the character set for wide characters, wchar_t need not be any
wider than char. But wchar_t is supposed to be wide enough for the
character set chosen by the implementor for wide characters.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Dan.Pop@ifh.de

Sheldon Simms 10-09-2003 04:24 PM

Re: Multibyte string length
 
On Thu, 09 Oct 2003 15:08:51 +0000, Dan Pop wrote:

> In <pan.2003.10.09.12.50.01.320068@_CUT_2zyga.MEdyndn s._OUT_org> "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org> writes:
>
>>PS: The whole multibyte string vs wide character string concept is broken
>>IMHO since it allows wchar_t not to be large enough to contain a full
>>character (rendering both types virtually the same). What's the point of
>>standartizing wide characters if the standard makes portable usage of such
>>mechanism a programming hell? Feel free to disagree.

>
> The bit you're missing is that the standard doesn't impose one character
> set or another for wide characters. If the implementor decides to use
> ASCII as the character set for wide characters, wchar_t need not be any
> wider than char. But wchar_t is supposed to be wide enough for the
> character set chosen by the implementor for wide characters.


I don't think he's missing that at all. He's simply pointing out that
the standard makes it pretty much impossible to use wide characters
portably (unless you only use wide characters with values between 0
and 127, of course).

Had the standard mandated, for instance, that wide characters be at
least 32 bits wide, then each wide character would be wide enough for
any character set and it would be possible to write portable code
using wide characters as long as the code had no character set
dependency.

The OP also seems to be griping about certain implementations using
unicode as a character set that have 16 bit wchar_t. Since it is
impossible to represent every unicode character in 16 bits, wide
character strings become 'multiwchar_t' encodings (UTF-16), which
defeats the whole purpose of wide characters and wide character strings

- Sheldon


NumLockOff 10-10-2003 06:25 AM

Re: Multibyte string length
 

"Sheldon Simms" <sheldonsimms@yahoo.com> wrote in message
news:pan.2003.10.09.16.24.36.286991@yahoo.com...
> On Thu, 09 Oct 2003 15:08:51 +0000, Dan Pop wrote:
>
> > In <pan.2003.10.09.12.50.01.320068@_CUT_2zyga.MEdyndn s._OUT_org>

"Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org> writes:
> >
> >>PS: The whole multibyte string vs wide character string concept is

broken
> >>IMHO since it allows wchar_t not to be large enough to contain a full
> >>character (rendering both types virtually the same). What's the point of
> >>standartizing wide characters if the standard makes portable usage of

such
> >>mechanism a programming hell? Feel free to disagree.

> >
> > The bit you're missing is that the standard doesn't impose one character
> > set or another for wide characters. If the implementor decides to use
> > ASCII as the character set for wide characters, wchar_t need not be any
> > wider than char. But wchar_t is supposed to be wide enough for the
> > character set chosen by the implementor for wide characters.

>
> I don't think he's missing that at all. He's simply pointing out that
> the standard makes it pretty much impossible to use wide characters
> portably (unless you only use wide characters with values between 0
> and 127, of course).
>
> Had the standard mandated, for instance, that wide characters be at
> least 32 bits wide, then each wide character would be wide enough for
> any character set and it would be possible to write portable code
> using wide characters as long as the code had no character set
> dependency.
>
> The OP also seems to be griping about certain implementations using
> unicode as a character set that have 16 bit wchar_t. Since it is
> impossible to represent every unicode character in 16 bits, wide
> character strings become 'multiwchar_t' encodings (UTF-16), which
> defeats the whole purpose of wide characters and wide character strings
>
> - Sheldon
>

It is just the evolution of the Unicode standard. Surrogares were added at
U+D800 to include more FarEastern characters. It has become now similar to a
mbcs mess. Could they have originally specified 32 bit charecters? maybe,
but in early 1990s, 16 bit characters were considered a major waste and
opposed. UTF8 was pretty much invented to solve the purpose of older 8bit
character systems to be able to read vanilla english text without code
change. With the memory and processing power costs plummeting, we now feel
that 32 bits is fine. At this point 32 bits seemd to be enough! Who knows
what will happen once we make the "first contact" :-)



Sheldon Simms 10-10-2003 09:05 AM

Re: Multibyte string length
 
On Thu, 09 Oct 2003 23:25:44 -0700, NumLockOff wrote:

>
> "Sheldon Simms" <sheldonsimms@yahoo.com> wrote in message
> news:pan.2003.10.09.16.24.36.286991@yahoo.com...
>> On Thu, 09 Oct 2003 15:08:51 +0000, Dan Pop wrote:
>>
>> > In <pan.2003.10.09.12.50.01.320068@_CUT_2zyga.MEdyndn s._OUT_org>

> "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org> writes:
>> >
>> >>PS: The whole multibyte string vs wide character string concept is

> broken
>> >>IMHO since it allows wchar_t not to be large enough to contain a full
>> >>character (rendering both types virtually the same). What's the point of
>> >>standartizing wide characters if the standard makes portable usage of

> such
>> >>mechanism a programming hell? Feel free to disagree.
>> >
>> > The bit you're missing is that the standard doesn't impose one character
>> > set or another for wide characters. If the implementor decides to use
>> > ASCII as the character set for wide characters, wchar_t need not be any
>> > wider than char. But wchar_t is supposed to be wide enough for the
>> > character set chosen by the implementor for wide characters.

>>
>> I don't think he's missing that at all. He's simply pointing out that
>> the standard makes it pretty much impossible to use wide characters
>> portably (unless you only use wide characters with values between 0
>> and 127, of course).
>>
>> Had the standard mandated, for instance, that wide characters be at
>> least 32 bits wide, then each wide character would be wide enough for
>> any character set and it would be possible to write portable code
>> using wide characters as long as the code had no character set
>> dependency.
>>
>> The OP also seems to be griping about certain implementations using
>> unicode as a character set that have 16 bit wchar_t. Since it is
>> impossible to represent every unicode character in 16 bits, wide
>> character strings become 'multiwchar_t' encodings (UTF-16), which
>> defeats the whole purpose of wide characters and wide character strings
>>
>> - Sheldon
>>

> It is just the evolution of the Unicode standard. Surrogares were added at
> U+D800 to include more FarEastern characters. It has become now similar to a
> mbcs mess.


Unicode is not the problem. 16 bit wchar_t is the problem.


Dan Pop 10-10-2003 11:49 AM

Re: Multibyte string length
 
In <pan.2003.10.09.16.24.36.286991@yahoo.com> Sheldon Simms <sheldonsimms@yahoo.com> writes:

>On Thu, 09 Oct 2003 15:08:51 +0000, Dan Pop wrote:
>
>> In <pan.2003.10.09.12.50.01.320068@_CUT_2zyga.MEdyndn s._OUT_org> "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org> writes:
>>
>>>PS: The whole multibyte string vs wide character string concept is broken
>>>IMHO since it allows wchar_t not to be large enough to contain a full
>>>character (rendering both types virtually the same). What's the point of
>>>standartizing wide characters if the standard makes portable usage of such
>>>mechanism a programming hell? Feel free to disagree.

>>
>> The bit you're missing is that the standard doesn't impose one character
>> set or another for wide characters. If the implementor decides to use
>> ASCII as the character set for wide characters, wchar_t need not be any
>> wider than char. But wchar_t is supposed to be wide enough for the
>> character set chosen by the implementor for wide characters.

>
>I don't think he's missing that at all. He's simply pointing out that
>the standard makes it pretty much impossible to use wide characters
>portably (unless you only use wide characters with values between 0
>and 127, of course).
>
>Had the standard mandated, for instance, that wide characters be at
>least 32 bits wide, then each wide character would be wide enough for
>any character set and it would be possible to write portable code
>using wide characters as long as the code had no character set
>dependency.


Nope, it wouldn't, as long as the standard doesn't specify a certain
character set for the wide characters. Imagine that you need to output
the character e with an acute accent. How do you do that *portably*, if
you have the additional guarantee that wchar_t is at least 32-bit wide?

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Dan.Pop@ifh.de

Sheldon Simms 10-10-2003 06:40 PM

Re: Multibyte string length
 
On Fri, 10 Oct 2003 11:49:19 +0000, Dan Pop wrote:

> In <pan.2003.10.09.16.24.36.286991@yahoo.com> Sheldon Simms <sheldonsimms@yahoo.com> writes:
>
>>On Thu, 09 Oct 2003 15:08:51 +0000, Dan Pop wrote:
>>
>>> In <pan.2003.10.09.12.50.01.320068@_CUT_2zyga.MEdyndn s._OUT_org> "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org> writes:
>>>
>>>>PS: The whole multibyte string vs wide character string concept is broken
>>>>IMHO since it allows wchar_t not to be large enough to contain a full
>>>>character (rendering both types virtually the same). What's the point of
>>>>standartizing wide characters if the standard makes portable usage of such
>>>>mechanism a programming hell? Feel free to disagree.
>>>

>>Had the standard mandated, for instance, that wide characters be at
>>least 32 bits wide, then each wide character would be wide enough for
>>any character set and it would be possible to write portable code
>>using wide characters as long as the code had no character set
>>dependency.

>
> Nope, it wouldn't, as long as the standard doesn't specify a certain
> character set for the wide characters. Imagine that you need to output
> the character e with an acute accent. How do you do that *portably*, if
> you have the additional guarantee that wchar_t is at least 32-bit wide?


I never meant to say that sort of thing could be done portably.

I was going on the assumption that the OP's assertion "it allows wchar_t
not to be large enough to contain a full character" was true, and thinking
about two implementations using the same execution character set where
one implementation used a wchar_t that was too small for the character
set.

It seems to me now, however, that an implementation in which wchar_t is
not "large enough to contain a full character" would be non-conforming,
since 7.17.2 states:

wchar_t which is an integer type whose range of values can represent
distinct codes for all members of the largest extended character set
specified among the supported locales;

In any case, my statment was based on the assumption of multiple
implementations using a common (but arbitrary) character set, and that
is an unportable assumption by itself, so I retract my assertion.

-Sheldon


Zygmunt Krynicki 10-10-2003 07:02 PM

Re: Multibyte string length
 
On Fri, 10 Oct 2003 11:49:19 +0000, Dan Pop wrote:

> Nope, it wouldn't, as long as the standard doesn't specify a certain
> character set for the wide characters. Imagine that you need to output
> the character e with an acute accent. How do you do that *portably*, if
> you have the additional guarantee that wchar_t is at least 32-bit wide?
>
> Dan


To clarify

Not my problem really, and not a reall one either as any specific program
knows its output encoding most probably. Hovever imagine I wish to write
a portable code for wide character regular expressions. Now the whole purpose
of wide characters is obvious; to be able to address all sorts of
characters and encodings, not just plain ascii, in a portable way.

Not to speak names it is common that the INTERNAL encoding used inside
program routines is often different than EXTERNAL encoding used to
store/transfer text.

Now we know that many external encodings use multibyte sequences for
various reasons which are not important here. We also know how inefficient
or uncomfortable it is to develop algorithms for multibyte sequence
character strings. It is much easier to assume that any single charater
can fit into some data type. Wether it's wchar_t or foo_t is not
important.

Now if wchar_t is not forced to able to contain a full character then
again we are stuck at our multibyte (multi-some-unit) character
sequence with all of its inconveniances. This IMHO defeats the whole
purpose of wchar_t.

Of course it is not clear which character encoding is the best one (or rather
since there is no perfect encoding which one should be made the standard).
Unicode seems to help alot providing UTF-8 as external and 32bit Unicode
as internal encoding. This has all sorts of benefits and non-benefits that
are not important here.

Also hardware doesn't need to have 32 bit wide data types so it
would be problematic to create conforming implementations

BTW: Thank you all for participating in this discussion :-)

Regards
Zygmunt Krynicki

those who know me have no need of my name 10-11-2003 07:42 PM

Re: Multibyte string length
 
in comp.lang.c i read:

>Now if wchar_t is not forced to able to contain a full character then
>again we are stuck at our multibyte (multi-some-unit) character
>sequence with all of its inconveniances. This IMHO defeats the whole
>purpose of wchar_t.


wchar_t is required to have a range that can handle all the code points
which can arise from the use of any locale supported by the implementation.
c99 takes this further: the implementation can indicate to the programmer
if iso-10646 is directly supported (though the encoding is *not* required
to be ucs-4), and the creation of the \U and \u escapes so that iso-10646
code points can be used directly.

>Also hardware doesn't need to have 32 bit wide data types so it
>would be problematic to create conforming implementations


hardware may not necessarily have a 32 bit wide integer type, but the
standard mandates that long be at least 32 value bits wide (sign + 31 for
signed long). so, there *is* always a 32 bit type available.

--
a signature

Sheldon Simms 10-12-2003 12:30 AM

Re: Multibyte string length
 
On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
name wrote:

> in comp.lang.c i read:
>
>>Now if wchar_t is not forced to able to contain a full character then
>>again we are stuck at our multibyte (multi-some-unit) character
>>sequence with all of its inconveniances. This IMHO defeats the whole
>>purpose of wchar_t.

>
> wchar_t is required to have a range that can handle all the code points
> which can arise from the use of any locale supported by the implementation.
> c99 takes this further: the implementation can indicate to the programmer
> if iso-10646 is directly supported (though the encoding is *not* required
> to be ucs-4)


I guess you're saying the encoding is not required to be ucs-4 because
the standard doesn't explicitly say so:

6.10.8.2
...
__STDC_ISO_10646__ An integer constant of the form yyyymmL (for
example, 199712L), intended to indicate that values of type wchar_t
are the coded representations of the characters defined by ISO/IEC
10646, along with all amendments and technical corrigenda as of the
specified year and month.

But if the encoding is not ucs-4, then what could it possibly be?
7.17.2 says

wchar_t which is an integer type whose range of values can represent
distinct codes for all members of the largest extended character set
specified among the supported locales;

As I read this, it means that in implementations implementing ISO 10646
must have a wchar_t capable of representing over 1 million distinct
values. Given this requirement, ucs-4 seems to be the only reasonable
encoding to use for ISO 10646 wide character strings.

Would an implementation that used utf-8 encoding in wide character
strings composed of 32-bit wchar_t be conforming?

-Sheldon



All times are GMT. The time now is 11:22 PM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.