![]() |
Multibyte string length
Hello
I've browsed the FAQ but apparently it lacks any questions concenring wide character strings. I'd like to calculate the length of a multibyte string without converting the whole string. Zygmunt PS: The whole multibyte string vs wide character string concept is broken IMHO since it allows wchar_t not to be large enough to contain a full character (rendering both types virtually the same). What's the point of standartizing wide characters if the standard makes portable usage of such mechanism a programming hell? Feel free to disagree. PS2: On my implementation wchar_t is 'big enough' so I might overcome the problem in some other way but I'd like to see some fully portable approach. |
Re: Multibyte string length
In <pan.2003.10.09.12.50.01.320068@_CUT_2zyga.MEdyndn s._OUT_org> "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org> writes:
>I've browsed the FAQ but apparently it lacks any questions concenring wide >character strings. I'd like to calculate the length of a multibyte string >without converting the whole string. Use the mblen function from the standard C library in a loop, until it returns 0. The number of mblen calls returning a positive value is the number of multibyte characters in that string. >PS: The whole multibyte string vs wide character string concept is broken >IMHO since it allows wchar_t not to be large enough to contain a full >character (rendering both types virtually the same). What's the point of >standartizing wide characters if the standard makes portable usage of such >mechanism a programming hell? Feel free to disagree. The bit you're missing is that the standard doesn't impose one character set or another for wide characters. If the implementor decides to use ASCII as the character set for wide characters, wchar_t need not be any wider than char. But wchar_t is supposed to be wide enough for the character set chosen by the implementor for wide characters. Dan -- Dan Pop DESY Zeuthen, RZ group Email: Dan.Pop@ifh.de |
Re: Multibyte string length
On Thu, 09 Oct 2003 15:08:51 +0000, Dan Pop wrote:
> In <pan.2003.10.09.12.50.01.320068@_CUT_2zyga.MEdyndn s._OUT_org> "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org> writes: > >>PS: The whole multibyte string vs wide character string concept is broken >>IMHO since it allows wchar_t not to be large enough to contain a full >>character (rendering both types virtually the same). What's the point of >>standartizing wide characters if the standard makes portable usage of such >>mechanism a programming hell? Feel free to disagree. > > The bit you're missing is that the standard doesn't impose one character > set or another for wide characters. If the implementor decides to use > ASCII as the character set for wide characters, wchar_t need not be any > wider than char. But wchar_t is supposed to be wide enough for the > character set chosen by the implementor for wide characters. I don't think he's missing that at all. He's simply pointing out that the standard makes it pretty much impossible to use wide characters portably (unless you only use wide characters with values between 0 and 127, of course). Had the standard mandated, for instance, that wide characters be at least 32 bits wide, then each wide character would be wide enough for any character set and it would be possible to write portable code using wide characters as long as the code had no character set dependency. The OP also seems to be griping about certain implementations using unicode as a character set that have 16 bit wchar_t. Since it is impossible to represent every unicode character in 16 bits, wide character strings become 'multiwchar_t' encodings (UTF-16), which defeats the whole purpose of wide characters and wide character strings - Sheldon |
Re: Multibyte string length
"Sheldon Simms" <sheldonsimms@yahoo.com> wrote in message news:pan.2003.10.09.16.24.36.286991@yahoo.com... > On Thu, 09 Oct 2003 15:08:51 +0000, Dan Pop wrote: > > > In <pan.2003.10.09.12.50.01.320068@_CUT_2zyga.MEdyndn s._OUT_org> "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org> writes: > > > >>PS: The whole multibyte string vs wide character string concept is broken > >>IMHO since it allows wchar_t not to be large enough to contain a full > >>character (rendering both types virtually the same). What's the point of > >>standartizing wide characters if the standard makes portable usage of such > >>mechanism a programming hell? Feel free to disagree. > > > > The bit you're missing is that the standard doesn't impose one character > > set or another for wide characters. If the implementor decides to use > > ASCII as the character set for wide characters, wchar_t need not be any > > wider than char. But wchar_t is supposed to be wide enough for the > > character set chosen by the implementor for wide characters. > > I don't think he's missing that at all. He's simply pointing out that > the standard makes it pretty much impossible to use wide characters > portably (unless you only use wide characters with values between 0 > and 127, of course). > > Had the standard mandated, for instance, that wide characters be at > least 32 bits wide, then each wide character would be wide enough for > any character set and it would be possible to write portable code > using wide characters as long as the code had no character set > dependency. > > The OP also seems to be griping about certain implementations using > unicode as a character set that have 16 bit wchar_t. Since it is > impossible to represent every unicode character in 16 bits, wide > character strings become 'multiwchar_t' encodings (UTF-16), which > defeats the whole purpose of wide characters and wide character strings > > - Sheldon > It is just the evolution of the Unicode standard. Surrogares were added at U+D800 to include more FarEastern characters. It has become now similar to a mbcs mess. Could they have originally specified 32 bit charecters? maybe, but in early 1990s, 16 bit characters were considered a major waste and opposed. UTF8 was pretty much invented to solve the purpose of older 8bit character systems to be able to read vanilla english text without code change. With the memory and processing power costs plummeting, we now feel that 32 bits is fine. At this point 32 bits seemd to be enough! Who knows what will happen once we make the "first contact" :-) |
Re: Multibyte string length
On Thu, 09 Oct 2003 23:25:44 -0700, NumLockOff wrote:
> > "Sheldon Simms" <sheldonsimms@yahoo.com> wrote in message > news:pan.2003.10.09.16.24.36.286991@yahoo.com... >> On Thu, 09 Oct 2003 15:08:51 +0000, Dan Pop wrote: >> >> > In <pan.2003.10.09.12.50.01.320068@_CUT_2zyga.MEdyndn s._OUT_org> > "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org> writes: >> > >> >>PS: The whole multibyte string vs wide character string concept is > broken >> >>IMHO since it allows wchar_t not to be large enough to contain a full >> >>character (rendering both types virtually the same). What's the point of >> >>standartizing wide characters if the standard makes portable usage of > such >> >>mechanism a programming hell? Feel free to disagree. >> > >> > The bit you're missing is that the standard doesn't impose one character >> > set or another for wide characters. If the implementor decides to use >> > ASCII as the character set for wide characters, wchar_t need not be any >> > wider than char. But wchar_t is supposed to be wide enough for the >> > character set chosen by the implementor for wide characters. >> >> I don't think he's missing that at all. He's simply pointing out that >> the standard makes it pretty much impossible to use wide characters >> portably (unless you only use wide characters with values between 0 >> and 127, of course). >> >> Had the standard mandated, for instance, that wide characters be at >> least 32 bits wide, then each wide character would be wide enough for >> any character set and it would be possible to write portable code >> using wide characters as long as the code had no character set >> dependency. >> >> The OP also seems to be griping about certain implementations using >> unicode as a character set that have 16 bit wchar_t. Since it is >> impossible to represent every unicode character in 16 bits, wide >> character strings become 'multiwchar_t' encodings (UTF-16), which >> defeats the whole purpose of wide characters and wide character strings >> >> - Sheldon >> > It is just the evolution of the Unicode standard. Surrogares were added at > U+D800 to include more FarEastern characters. It has become now similar to a > mbcs mess. Unicode is not the problem. 16 bit wchar_t is the problem. |
Re: Multibyte string length
In <pan.2003.10.09.16.24.36.286991@yahoo.com> Sheldon Simms <sheldonsimms@yahoo.com> writes:
>On Thu, 09 Oct 2003 15:08:51 +0000, Dan Pop wrote: > >> In <pan.2003.10.09.12.50.01.320068@_CUT_2zyga.MEdyndn s._OUT_org> "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org> writes: >> >>>PS: The whole multibyte string vs wide character string concept is broken >>>IMHO since it allows wchar_t not to be large enough to contain a full >>>character (rendering both types virtually the same). What's the point of >>>standartizing wide characters if the standard makes portable usage of such >>>mechanism a programming hell? Feel free to disagree. >> >> The bit you're missing is that the standard doesn't impose one character >> set or another for wide characters. If the implementor decides to use >> ASCII as the character set for wide characters, wchar_t need not be any >> wider than char. But wchar_t is supposed to be wide enough for the >> character set chosen by the implementor for wide characters. > >I don't think he's missing that at all. He's simply pointing out that >the standard makes it pretty much impossible to use wide characters >portably (unless you only use wide characters with values between 0 >and 127, of course). > >Had the standard mandated, for instance, that wide characters be at >least 32 bits wide, then each wide character would be wide enough for >any character set and it would be possible to write portable code >using wide characters as long as the code had no character set >dependency. Nope, it wouldn't, as long as the standard doesn't specify a certain character set for the wide characters. Imagine that you need to output the character e with an acute accent. How do you do that *portably*, if you have the additional guarantee that wchar_t is at least 32-bit wide? Dan -- Dan Pop DESY Zeuthen, RZ group Email: Dan.Pop@ifh.de |
Re: Multibyte string length
On Fri, 10 Oct 2003 11:49:19 +0000, Dan Pop wrote:
> In <pan.2003.10.09.16.24.36.286991@yahoo.com> Sheldon Simms <sheldonsimms@yahoo.com> writes: > >>On Thu, 09 Oct 2003 15:08:51 +0000, Dan Pop wrote: >> >>> In <pan.2003.10.09.12.50.01.320068@_CUT_2zyga.MEdyndn s._OUT_org> "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org> writes: >>> >>>>PS: The whole multibyte string vs wide character string concept is broken >>>>IMHO since it allows wchar_t not to be large enough to contain a full >>>>character (rendering both types virtually the same). What's the point of >>>>standartizing wide characters if the standard makes portable usage of such >>>>mechanism a programming hell? Feel free to disagree. >>> >>Had the standard mandated, for instance, that wide characters be at >>least 32 bits wide, then each wide character would be wide enough for >>any character set and it would be possible to write portable code >>using wide characters as long as the code had no character set >>dependency. > > Nope, it wouldn't, as long as the standard doesn't specify a certain > character set for the wide characters. Imagine that you need to output > the character e with an acute accent. How do you do that *portably*, if > you have the additional guarantee that wchar_t is at least 32-bit wide? I never meant to say that sort of thing could be done portably. I was going on the assumption that the OP's assertion "it allows wchar_t not to be large enough to contain a full character" was true, and thinking about two implementations using the same execution character set where one implementation used a wchar_t that was too small for the character set. It seems to me now, however, that an implementation in which wchar_t is not "large enough to contain a full character" would be non-conforming, since 7.17.2 states: wchar_t which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales; In any case, my statment was based on the assumption of multiple implementations using a common (but arbitrary) character set, and that is an unportable assumption by itself, so I retract my assertion. -Sheldon |
Re: Multibyte string length
On Fri, 10 Oct 2003 11:49:19 +0000, Dan Pop wrote:
> Nope, it wouldn't, as long as the standard doesn't specify a certain > character set for the wide characters. Imagine that you need to output > the character e with an acute accent. How do you do that *portably*, if > you have the additional guarantee that wchar_t is at least 32-bit wide? > > Dan To clarify Not my problem really, and not a reall one either as any specific program knows its output encoding most probably. Hovever imagine I wish to write a portable code for wide character regular expressions. Now the whole purpose of wide characters is obvious; to be able to address all sorts of characters and encodings, not just plain ascii, in a portable way. Not to speak names it is common that the INTERNAL encoding used inside program routines is often different than EXTERNAL encoding used to store/transfer text. Now we know that many external encodings use multibyte sequences for various reasons which are not important here. We also know how inefficient or uncomfortable it is to develop algorithms for multibyte sequence character strings. It is much easier to assume that any single charater can fit into some data type. Wether it's wchar_t or foo_t is not important. Now if wchar_t is not forced to able to contain a full character then again we are stuck at our multibyte (multi-some-unit) character sequence with all of its inconveniances. This IMHO defeats the whole purpose of wchar_t. Of course it is not clear which character encoding is the best one (or rather since there is no perfect encoding which one should be made the standard). Unicode seems to help alot providing UTF-8 as external and 32bit Unicode as internal encoding. This has all sorts of benefits and non-benefits that are not important here. Also hardware doesn't need to have 32 bit wide data types so it would be problematic to create conforming implementations BTW: Thank you all for participating in this discussion :-) Regards Zygmunt Krynicki |
Re: Multibyte string length
in comp.lang.c i read:
>Now if wchar_t is not forced to able to contain a full character then >again we are stuck at our multibyte (multi-some-unit) character >sequence with all of its inconveniances. This IMHO defeats the whole >purpose of wchar_t. wchar_t is required to have a range that can handle all the code points which can arise from the use of any locale supported by the implementation. c99 takes this further: the implementation can indicate to the programmer if iso-10646 is directly supported (though the encoding is *not* required to be ucs-4), and the creation of the \U and \u escapes so that iso-10646 code points can be used directly. >Also hardware doesn't need to have 32 bit wide data types so it >would be problematic to create conforming implementations hardware may not necessarily have a 32 bit wide integer type, but the standard mandates that long be at least 32 value bits wide (sign + 31 for signed long). so, there *is* always a 32 bit type available. -- a signature |
Re: Multibyte string length
On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
name wrote: > in comp.lang.c i read: > >>Now if wchar_t is not forced to able to contain a full character then >>again we are stuck at our multibyte (multi-some-unit) character >>sequence with all of its inconveniances. This IMHO defeats the whole >>purpose of wchar_t. > > wchar_t is required to have a range that can handle all the code points > which can arise from the use of any locale supported by the implementation. > c99 takes this further: the implementation can indicate to the programmer > if iso-10646 is directly supported (though the encoding is *not* required > to be ucs-4) I guess you're saying the encoding is not required to be ucs-4 because the standard doesn't explicitly say so: 6.10.8.2 ... __STDC_ISO_10646__ An integer constant of the form yyyymmL (for example, 199712L), intended to indicate that values of type wchar_t are the coded representations of the characters defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month. But if the encoding is not ucs-4, then what could it possibly be? 7.17.2 says wchar_t which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales; As I read this, it means that in implementations implementing ISO 10646 must have a wchar_t capable of representing over 1 million distinct values. Given this requirement, ucs-4 seems to be the only reasonable encoding to use for ISO 10646 wide character strings. Would an implementation that used utf-8 encoding in wide character strings composed of 32-bit wchar_t be conforming? -Sheldon |
| All times are GMT. The time now is 11:22 PM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.