Go Back   Velocity Reviews > Newsgroups > C++
User Name
Password
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

Reply

C++ - Understanding UNICODE

 
Thread Tools Search this Thread
Old 11-20-2009, 05:43 PM   #1
Default Understanding UNICODE


Hi there,

I am trying to understand UNICODE in C++, but I fear this is really
something I do not understand. Where can I find good documentation
regarding portability (I am targeting UNIX/gcc and Win32/cl) ? Esp.
I'd like know how I can open a std::ifstream when user input is
UNICODE.

Does the following line makes any sense (I know this is not legal) ?

const char alpha[] = "á.dcm";

Is there a way to say, when I share my C++ file, that my file is in
UTF-8 ?

Thanks,


mathieu
  Reply With Quote
Old 11-20-2009, 05:58 PM   #2
Victor Bazarov
 
Posts: n/a
Default Re: Understanding UNICODE
mathieu wrote:
> I am trying to understand UNICODE in C++, but I fear this is really
> something I do not understand. Where can I find good documentation
> regarding portability (I am targeting UNIX/gcc and Win32/cl) ? Esp.
> I'd like know how I can open a std::ifstream when user input is
> UNICODE.
>
> Does the following line makes any sense (I know this is not legal) ?
>
> const char alpha[] = "á.dcm";
>
> Is there a way to say, when I share my C++ file, that my file is in
> UTF-8 ?


Here is a 30000ft view.

Unicode is a way to interpret the sequence of octets. Any file contains
octets (essentially) when stored on a device. Now, the meaning those
octets have depends on the program that writes or reads them. Unicode
is just a convention. Just like ASCII is a convention. If you take a
JPG file (just another convention) and try reading it as ASCII, you're
not likely to get much from it.

So, if you have a file written by some program using the Unicode
convention (one of its predefined encoding schemes), you can try
extracting the information from that file using the same convention. If
the information you manage to extract (and the process is essentially an
interpretation of the stored data) makes sense to you, you can call that
file a Unicode file with encoding <blah>.

Hard-coding Unicode symbols in C++ source code is possible. You need to
use the \U notation for that.

Documentation? Well, have you tried your local library or a bookstore?

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask


Victor Bazarov
  Reply With Quote
Old 11-21-2009, 03:01 AM   #3
dragan
 
Posts: n/a
Default Re: Understanding UNICODE
mathieu wrote:
> Hi there,
>
> I am trying to understand UNICODE in C++, but I fear this is really
> something I do not understand. Where can I find good documentation
> regarding portability (I am targeting UNIX/gcc and Win32/cl) ?


try:

site.icu-project.org




dragan
  Reply With Quote
Old 11-21-2009, 05:50 AM   #4
White Wolf
 
Posts: n/a
Default Re: Understanding UNICODE
Victor Bazarov wrote:
> Documentation? Well, have you tried your local library or a bookstore?


Do you have any particular book in mind? It seems I will have to write
code that should deal with Simplified Chinese text in UTF8 Unicode. I
mean something that covers the C++ aspects of such things. BTW it does
not have to be a book, website with examples/tutorial would great as
well. I did not ever had to do anything like that, so I wouldn't even
know if I have to start using wide strings and streams or what...

I guess as long as I just copy them around and display it on the screen,
there is no need for anything special. I assume I get into trouble if I
want to change them, or ask for the size...

--
BR, WW


White Wolf
  Reply With Quote
Old 11-21-2009, 09:53 AM   #5
mathieu
 
Posts: n/a
Default Re: Understanding UNICODE
On Nov 21, 12:35*am, "Daniel T." <danie...@earthlink.net> wrote:
> > Esp. I'd like know how I can open a std::ifstream when user input is
> > UNICODE.

>
> Open the ifstream as binary. You will also have to decide what unicode
> encodings you are willing to accept, there are something like 5
> different encodings (UTF-8, UTF-16BE/LE, UTF-32BE/LE.)


Ooops, I realize my question was poorly formulated. I meant to say:
how do I specify the filename ? C++ standard only accept
std::ifstream:pen(const char *). But I have been told that on
Windows, you also have access to std::ifstream:pen(const wchar_t *).

Reading the other post, it looks like if my pet project wants to be
portable I need to offer a UTF-8 interface to specify the filename on
Linux, while I should offer a wchar_t/UTF-16 interface on Win32
machine.

Thanks,


mathieu
  Reply With Quote
Old 11-22-2009, 10:39 AM   #6
James Kanze
 
Posts: n/a
Default Re: Understanding UNICODE
On Nov 21, 12:35 am, "Daniel T." <danie...@earthlink.net> wrote:
> mathieu <mathieu.malate...@gmail.com> wrote:
> > I am trying to understand UNICODE in C++, but I fear this is
> > really something I do not understand. Where can I find good
> > documentation regarding portability (I am targeting UNIX/gcc
> > and Win32/cl) ?


> Considerhttp://unicode.org/


I found _Fonts and Encodings_, by Yannis Haralambous, excellent
for explaining most of the issues, but it may be more than you
need, if all you want is to read files written in Unicode.

> > Esp. I'd like know how I can open a std::ifstream when user
> > input is UNICODE.


> Open the ifstream as binary. You will also have to decide what
> unicode encodings you are willing to accept, there are
> something like 5 different encodings (UTF-8, UTF-16BE/LE,
> UTF-32BE/LE.)


The canonical solution is to use a wifstream, and imbue it with
a locale which supports the encoding format you're using. If
the encoding format is UTF-8, however, I rather agree with you:
open in binary (and even that may not be necessary), ensure that
the file is imbued with the "C" locale, or at least that the
codecvt facet comes from the "C" locale, and just read the
UTF-8.

> > Does the following line makes any sense (I know this is not legal) ?


> > const char alpha[] = "á.dcm";


> The above (AFAIK) is not defined in the language, but it may
> be acceptable in some compilers. The "middle dot" character is
> code 00B7 or in UTF-8 C2 B7. You could do something like this:


> const unsigned char alpha[] = "\xc2\xb7.dcm";


That would be "\u00C2\u00B7.dcm (except that I don't see any
middle dot---I see a "LATIN SMALL LETTER A WITH GRAVE", Unicode
0xE0). And even that isn't guaranteed; until we know what he
actually wants in the string (in terms of encoding) , we can't
really make any concrete recommendations.

> (Note, I strongly suggest you use an unsigned representation
> when working with unicode to avoid sign extension problems.
> Either unsigned char, or some unsigned integral type that is 2
> bytes long.)


I'd stick with char. Using unsigned char sounds logical, and
does avoid one or two minor issues, but in practice, it will
cause no end of problems with type checking, since nothing else
expects unsigned char. But again, a lot depends on what you are
doing with the text.

> > Is there a way to say, when I share my C++ file, that my
> > file is in UTF-8 ?


> Not as far as the standard is concerned, but particular
> compilers may have some means of doing so.


The standard provides the codecvt facet for handling different
input and output encodings. Whether this helps in your
situation depends; I've found it useful in some cases, less so
in others.

Note that UTF-8 is a multi-byte encoding, which means that a lot
(most?) of the traditional text manipulations won't work with
it. (I ended up designing my own UTF-8 iterators, so that ++
advanced a character, and not just a byte, * returned an
uint32_t with the Unicode encoding, and the iterator itself had
begin and end functions which returned a char iterator over just
the single character. I also ended up implementing things like
islower which took the two char iterators, but they're typically
less essential---in a lot of applications, the only UTF-8
characters which interest you in particular are the ones common
with ASCII.)

--
James Kanze


James Kanze
  Reply With Quote
Old 11-22-2009, 10:57 AM   #7
James Kanze
 
Posts: n/a
Default Re: Understanding UNICODE
On Nov 21, 1:38 am, Paavo Helde <myfirstn...@osa.pri.ee> wrote:
> mathieu <mathieu.malate...@gmail.com> wrote in news:8e3fda98-b476-4d82-
> af39-ee2e6229b...@x31g2000yqx.googlegroups.com:


> > Hi there,


> > I am trying to understand UNICODE in C++, but I fear this is
> > really something I do not understand. Where can I find good
> > documentation regarding portability (I am targeting UNIX/gcc
> > and Win32/cl) ? Esp. I'd like know how I can open a
> > std::ifstream when user input is UNICODE.


> > Does the following line makes any sense (I know this is
> > not legal) ?


> > const char alpha[] = "á.dcm";


> > Is there a way to say, when I share my C++ file, that my
> > file is in UTF-8 ?


> Not generally, but some implementations may support it. For
> example, current Linux implementations use UTF-8 encoding as a
> default locale, which is supported by many Linux applications
> (but not by the Linux kernel itself, which does not care about
> the encodings AFAIK).


> Still, I would keep Unicode strings out of the source code for
> now for portability, and put them into data files instead. The
> data files would be read by the programs knowing more about
> their encoding.


> As you mentioned Windows, it would be handy to know that
> Windows does not support UTF-8 locales.


Really? I've used them under Windows, with no problems.

More accurately: locales do not support UTF-8, or any multibyte
encodings, in general---they are designed with the idea that
everything internal is single byte, and that large character
sets would be handled by a wchar_t (which must be at least 21
bits to handle full Unicode). The only place the encoding
enters into play is in the codecvt facet (or in the single byte
encodings for char---islower will depend on the encoding, for
example---but these do not work for multibyte encodings). So
depending on what you are doing, there are several alternatives:

-- Use wchar_t and imbue your input and output with a locale
which has a UTF-8 codecvt facet. This will work on systems
which have a wchar_t which supports Unicode, or at least all
of the Unicode characters which are of interest to you.
(This is the case for Linux, for example, and some locales
of Solaris. It's also the case for Windows IF you don't
need any characters outside of the basic encoding plane; if,
for example, you're really only concerned with European
languages.)

-- Use char, imbue input and output with the "C" locale, at
least for the codecvt facet, and open them in binary (which
means you can't read standard in or write to standard
error). Do the rest yourself. This is what works best for
me most of the time: typically, I only need to iterate over
the strings, looking for specific characters which are all
single byte in UTF-8, and UTF-8 was designed with support
for this as a goal. It does mean that you don't have
functions like isupper, but if you don't need them, this is
clearly a good solution.

> This means that if the strings in the program are internally
> UTF-8, then you have to translate them back and forth all the
> time you are calling Windows SDK functions. The good news is
> that Windows fully supports Unicode, but only in UTF-16
> encoding, with the wchar_t/*W versions of the SDK functions.
> Linux on the other hand does not support UTF-16, all the SDK
> interface functions are defined in terms of char only.


Yes. The interface with the system and other software you might
be using must also be considered in your choice.

--
James Kanze


James Kanze
  Reply With Quote
Old 11-23-2009, 09:33 AM   #8
James Kanze
 
Posts: n/a
Default Re: Understanding UNICODE
On Nov 22, 2:41 pm, Paavo Helde <myfirstn...@osa.pri.ee> wrote:
> James Kanze <james.ka...@gmail.com> wrote in news:2a1ef89e-14dd-4c4d-bb68-
> 2f5f8dae9...@b15g2000yqd.googlegroups.com:


> > On Nov 21, 1:38 am, Paavo Helde <myfirstn...@osa.pri.ee> wrote:
> >> As you mentioned Windows, it would be handy to know that
> >> Windows does not support UTF-8 locales.


> > Really? I've used them under Windows, with no problems.


> It seems I used the term "locale" too sloppyily. What I meant that
> Windows has built-in support for 8-bit based (potentially multibyte)
> encodings, which it calls "active codepage" (ACP). All SDK functions
> having string parameters come in two variants - narrow and wide, and
> narrow strings are automatically converted to wide strings (UTF-16) by
> Windows SDK. It would be logical to set this 8-bit codepage setting to
> UTF-8 and let Windows do all the translation transparently. However,
> when I did an experiment and set the ACP and OEMCP values in the
> registry to 65001 (UTF-, the Windows did not boot up any more.
> That's what I called "not supported".


I didn't fiddle with the registry, but in fact: after executing chcp
65001 in a command window, the output from a program which outputs C3
A9
74 C3 A9 0D 0A disappears completely---not even the 't' appears.
There's definitely something wrong there. (This was under Windows XP
professional. Perhaps under some later version...)

> It appears you are considering more the file content encoding. This is
> another issue, not directly related to OS, and C++ locales might
> probably sometimes be helpful here.


I was considering principally file content encoding; that is really
the
only thing C++ locales address (with regards to encoding). I've had
no
problem reading and writing UTF-8 files under Windows, once the
correct
locale was installed. I've also had no problem using UTF-8
internally,
but reading and writing files with some different encoding, again
using
the appropriate locales. What locales don't (and can't) address, of
course, is what the system does with the bytes if they reach someplace
where the system interprets them (e.g. the display buffer of a console
window).

--
James Kanze


James Kanze
  Reply With Quote
Old 11-23-2009, 11:07 AM   #9
Alf P. Steinbach
 
Posts: n/a
Default Re: Understanding UNICODE
* James Kanze:
>
> I didn't fiddle with the registry, but in fact: after executing chcp
> 65001 in a command window, the output from a program which outputs C3
> A9
> 74 C3 A9 0D 0A disappears completely---not even the 't' appears.
> There's definitely something wrong there. (This was under Windows XP
> professional. Perhaps under some later version...)


In Windows XP program invocation commands issued in the command interpreter with
codepage 65001 generally have no effect: the programs don't even appear to be run.

So the support for UTF-8 is definitely very much broken.


Cheers & hth.,

- Alf


Alf P. Steinbach
  Reply With Quote
Old 11-23-2009, 09:29 PM   #10
Victor Bazarov
 
Posts: n/a
Default Re: Understanding UNICODE
White Wolf wrote:
> Victor Bazarov wrote:
>> Documentation? Well, have you tried your local library or a bookstore?

>
> Do you have any particular book in mind?


No. It just seems like a relatively well explored topic at this point
to not have any books at all explaining various aspects of it. I am
fairly certain something exists, perhaps even in the form of a website.

> [..]


V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask


Victor Bazarov
  Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
Batchfiles, Pipe > Output.txt, Redirection might fail with Unicode text files ?!?!?! Skybuck Flying Windows 64bit 0 01-20-2008 09:22 PM
Visual Studio .Net and 64 bit Unicode MFC Project Simon Guertin Windows 64bit 5 05-16-2006 02:59 PM
Unicode text file shown in popped-up Notepad window yong321@yahoo.com Computer Support 1 08-16-2005 04:29 AM
Confirm my wireless understanding please? Evil Uncle Chris Wireless Networking 1 05-01-2005 04:19 PM
Re: Unicode fonts in Windows °Mike° Computer Support 2 07-22-2003 06:49 PM




SEO by vBSEO 3.3.2 ©2009, Crawlability, Inc.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46