Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > wchar_t is useless

Reply
Thread Tools

wchar_t is useless

 
 
Lauri Alanko
Guest
Posts: n/a
 
      11-22-2011
In article <>,
Kaz Kylheku <> wrote:
> On 2011-11-22, Lauri Alanko <> wrote:
> > For instance,
> > character for love (U+611B), which is encoded in EUC-JP as "\xb0\xa6"
> > is represented by the wchar_t value 0xb0a6.

>
> Ah, if that's the case, that is pretty broken. \xb0\xa6 should decode into the
> value 0x611B.


It should, if __STDC_ISO_10646__ were defined. The standard doesn't
require it to be.

> They ducked out of doing it right, didn't they.


BSD predates Unicode. I can quite well understand that old locale code
doesn't map characters into their unicode values. What I didn't expect
is that there would be no attempt to keep the non-standard mappings
consistent between locales.

> Or not using/supporting the locales that don't produce Unicode code points.
> You can treat those as weird legacy cruft, like EBCDIC.


I have worked on an EBCDIC platform. They are real.

> Find out what works,
> and document that as being supported. "This is a Unicode program, whose
> embedded strings are in Unicode, and which requires a Unicode-compatible
> locale."


As I was saying, wchar_t is useless in portable C programming. You
seem to be concurring, although in a roundabout way.

> Either way, you don't have to throw out the wchar_t. It is handy because it's
> supported in the form of string literals, and some useful functions like
> wcsspn, wcschr, etc.


None of those are particularly useful for me. I wouldn't be using
wchar_t* for data storage or processing anyway, just for interchange.

And again, I don't see a single code point as being a very meaningful
unit of text. If you need to search for a piece of text, you most
likely need to search for a substring.

The only use wchar_t could have had for me was if it had been an
established, well-defined way of representing multilingual text. As it
stands, it is far too underdefined, so it has no real use for me.

> I think you have to regard these two problems as being completely separate:
>
> - writing software that is multilingual.
> - targetting two or more incompatible ways of being multilingual,
> simultaneously in the same program. (incompatible meaning that the
> internal representation for characters follows a different map.)
>
> I think you're taking too much into your scope: you want to solve both
> problems, and so then when you look at this FreeBSD mess, it looks
> intractable.
>
> Solve the first problem, and forget the second.


That's what I'm doing, and that's why I'm going to forget about wchar_t.


Lauri
 
Reply With Quote
 
 
 
 
Dann Corbit
Guest
Posts: n/a
 
      11-22-2011
In article <>, says...
>
> On 2011-11-21, Lauri Alanko <> wrote:

[snip]
>
> > So screw it all, I'll just use UTF-32 like I should have from the
> > beginning.



Or you could use this, which is what every sensible person does:
http://www-01.ibm.com/software/globalization/icu/
 
Reply With Quote
 
 
 
 
Joe keane
Guest
Posts: n/a
 
      11-24-2011
No one should work on I18N unless they're Finnish, Hungarian, or Japanese.
 
Reply With Quote
 
Kaz Kylheku
Guest
Posts: n/a
 
      11-24-2011
On 2011-11-24, Joe keane <> wrote:
> No one should work on I18N unless they're Finnish, Hungarian, or Japanese.


It's no longer realistic to write programs that do any text processing, but
handle only 8 bit text, even if those programs are not actually multilingual.

(I mean programs for the world to use, not just for use by the author and maybe
a few of his ISO-latin-character-using colleagues.)
 
Reply With Quote
 
Keith Thompson
Guest
Posts: n/a
 
      11-24-2011
(Joe keane) writes:
> No one should work on I18N unless they're Finnish, Hungarian, or Japanese.


Do you mean that speakers of those languages have additional insight
into internationalization issues? If so, you have a point, but
it's vastly overstated (deliberately so for effect, I presume).

--
Keith Thompson (The_Other_Keith) kst- <http://www.ghoti.net/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
 
Reply With Quote
 
James Kuyper
Guest
Posts: n/a
 
      11-24-2011
On 11/24/2011 04:16 PM, Kaz Kylheku wrote:
> On 2011-11-24, Joe keane <> wrote:
>> No one should work on I18N unless they're Finnish, Hungarian, or Japanese.

>
> It's no longer realistic to write programs that do any text processing, but
> handle only 8 bit text, even if those programs are not actually multilingual.
>
> (I mean programs for the world to use, not just for use by the author and maybe
> a few of his ISO-latin-character-using colleagues.)


Not even if those "few colleagues" number in the hundreds of millions?
I'm in favor of I18N, but I also work is one of the largest markets in
the world where it's quite feasible to make a decent profit on software
that has support for only one language.
--
James Kuyper
 
Reply With Quote
 
88888 Dihedral
Guest
Posts: n/a
 
      11-25-2011
On Monday, November 21, 2011 10:14:35 PM UTC+8, Lauri Alanko wrote:
> I have recently written a number of posts regarding C's wide character
> support. It now turns out that my investigation has been in vain:
> wchar_t is useless in portable C programming, although I'm not quite
> sure whether the standard or implementations are to blame for this. Most
> likely both: the standard has sanctioned the implementations'
> deficiencies.
>
> I'm working on a library that deals with multilingual strings. The
> library only does computation, and doesn't have need for very fancy I/O,
> so I'm trying to avoid any unnecessary platform dependencies and make
> the library as portable as possible.
>
> One question I'm facing is what kind of representation to use for the
> multilingual strings in the public API of the library. Internally, the
> library reads some binary data containing UTF-8 strings, so the obvious
> answer would be for the public library functions to accept and return
> strings in a standard unicode format, either UTF-8 or UTF-32.
>
> But this is not very C-ish. Since C has standard ways to represent
> multilingual strings, it's more convenient for the API to use those
> standard ways rather than introducing yet another string representation
> type. I thought.
>
> So I considered the options. Multibyte strings are not a viable choice,
> since their encoding is locale-dependent. If the library communicated
> via multibyte strings, then the locale would have to be set to something
> that made it possible to represent all the strings that the library had
> to deal with.
>
> But a library cannot make requirements on the global locale: libraries
> should be components that can be plugged together, and if they begin to
> make any requirements on the locale, then they cannot be used together
> if the requirements conflict.
>
> I cannot understand why C still only has a global locale. C++ came up
> with first-class locales ages ago, and surely nowadays everyone should
> know that anything global wreaks havoc to interoperability and
> re-entrancy.
>
> So I looked at wchar_t. If __STDC_ISO_10646__ is defined and wchar_t
> represents a unicode code point, this would be just perfect. But that's
> not the case on all platforms. But that's okay, I thought, as long as I
> can (with some platform-dependent magic) convert between unicode code
> points and wchar_t.
>
> On Windows, it turns out, wchar_t represents a UTF-16 code unit, so a
> code point can require two wchar_t's. That's ugly (and makes <wctype.h>
> useless), but not very crucial for my purposes. The important thing is
> that sequences of code points can still be encoded to and from wide
> _strings_. I could have lived with this.
>
> But then I found out about the killer: on FreeBSD (and Solaris?) the
> encoding used by wchar_t is locale-dependent! That is, a single wchar_t
> can represent any code point supported by the current locale, but the
> same wchar_t value may be used to represent different code points in
> different locales. So adopting wchar_t as the representation type would
> again make the capabilities of the library dependent on the current
> locale, which might be constrained by other parts of the application.
> (Also, the locale-dependent wchar_t encodings are quite undocumented, so
> the required platform-dependent magic would be magic indeed.)
>
> To recap: C's multibyte strings are in a locale-dependent, possibly
> variable-width encoding. On Windows, the wchar_t string encoding is
> variable-width, on FreeBSD and Solaris it is locale-dependent. So for
> portable C code, wchar_t doesn't provide any advantages over multibyte
> strings.
>
> So screw it all, I'll just use UTF-32 like I should have from the
> beginning.
>
>
> Lauri


The c string part is too slow in many applications nowadays by the default standard way of no length tagged but a terminator marked. That was goodto teach pointers and opened the door to write assembly and c together inmany platforms to support a c compiler long long time ago!


 
Reply With Quote
 
Rui Maciel
Guest
Posts: n/a
 
      11-25-2011
Joe keane wrote:

> No one should work on I18N unless they're Finnish, Hungarian, or Japanese.


Why is that?


Rui Maciel
 
Reply With Quote
 
Jorgen Grahn
Guest
Posts: n/a
 
      11-27-2011
On Thu, 2011-11-24, Keith Thompson wrote:
> (Joe keane) writes:
>> No one should work on I18N unless they're Finnish, Hungarian, or Japanese.

>
> Do you mean that speakers of those languages have additional insight
> into internationalization issues? If so, you have a point, but
> it's vastly overstated (deliberately so for effect, I presume).


AFAICT, the only thing special with Finland compared to the rest of
the Latin-1 world is that Finnish has very long words.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
converting from windows wchar_t to linux wchar_t yakir22@gmail.com C++ 3 08-15-2008 09:04 AM
Useless thread about some useless statistics Daniel Nogradi Python 0 11-15-2006 11:33 PM
std::wstring, TCHAR, wchar_t and LPTSTR sorty C++ 4 11-25-2003 11:05 AM
string class that can convert between wchar_t and char Bren C++ 4 10-07-2003 03:24 PM
wchar_t wstring char string transformations Adrian Cornish C++ 2 07-12-2003 08:05 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57