![]() |
|
|
|
#1 |
|
Hi there,
I am trying to understand UNICODE in C++, but I fear this is really something I do not understand. Where can I find good documentation regarding portability (I am targeting UNIX/gcc and Win32/cl) ? Esp. I'd like know how I can open a std::ifstream when user input is UNICODE. Does the following line makes any sense (I know this is not legal) ? const char alpha[] = "á.dcm"; Is there a way to say, when I share my C++ file, that my file is in UTF-8 ? Thanks, mathieu |
|
|
|
|
#2 |
|
Posts: n/a
|
mathieu wrote:
> I am trying to understand UNICODE in C++, but I fear this is really > something I do not understand. Where can I find good documentation > regarding portability (I am targeting UNIX/gcc and Win32/cl) ? Esp. > I'd like know how I can open a std::ifstream when user input is > UNICODE. > > Does the following line makes any sense (I know this is not legal) ? > > const char alpha[] = "á.dcm"; > > Is there a way to say, when I share my C++ file, that my file is in > UTF-8 ? Here is a 30000ft view. Unicode is a way to interpret the sequence of octets. Any file contains octets (essentially) when stored on a device. Now, the meaning those octets have depends on the program that writes or reads them. Unicode is just a convention. Just like ASCII is a convention. If you take a JPG file (just another convention) and try reading it as ASCII, you're not likely to get much from it. So, if you have a file written by some program using the Unicode convention (one of its predefined encoding schemes), you can try extracting the information from that file using the same convention. If the information you manage to extract (and the process is essentially an interpretation of the stored data) makes sense to you, you can call that file a Unicode file with encoding <blah>. Hard-coding Unicode symbols in C++ source code is possible. You need to use the \U notation for that. Documentation? Well, have you tried your local library or a bookstore? V -- Please remove capital 'A's when replying by e-mail I do not respond to top-posted replies, please don't ask Victor Bazarov |
|
|
|
#3 |
|
Posts: n/a
|
mathieu wrote:
> Hi there, > > I am trying to understand UNICODE in C++, but I fear this is really > something I do not understand. Where can I find good documentation > regarding portability (I am targeting UNIX/gcc and Win32/cl) ? try: site.icu-project.org dragan |
|
|
|
#4 |
|
Posts: n/a
|
Victor Bazarov wrote:
> Documentation? Well, have you tried your local library or a bookstore? Do you have any particular book in mind? It seems I will have to write code that should deal with Simplified Chinese text in UTF8 Unicode. I mean something that covers the C++ aspects of such things. BTW it does not have to be a book, website with examples/tutorial would great as well. I did not ever had to do anything like that, so I wouldn't even know if I have to start using wide strings and streams or what... I guess as long as I just copy them around and display it on the screen, there is no need for anything special. I assume I get into trouble if I want to change them, or ask for the size... -- BR, WW White Wolf |
|
|
|
#5 |
|
Posts: n/a
|
On Nov 21, 12:35*am, "Daniel T." <danie...@earthlink.net> wrote:
> > Esp. I'd like know how I can open a std::ifstream when user input is > > UNICODE. > > Open the ifstream as binary. You will also have to decide what unicode > encodings you are willing to accept, there are something like 5 > different encodings (UTF-8, UTF-16BE/LE, UTF-32BE/LE.) Ooops, I realize my question was poorly formulated. I meant to say: how do I specify the filename ? C++ standard only accept std::ifstream: Windows, you also have access to std::ifstream: Reading the other post, it looks like if my pet project wants to be portable I need to offer a UTF-8 interface to specify the filename on Linux, while I should offer a wchar_t/UTF-16 interface on Win32 machine. Thanks, mathieu |
|
|
|
#6 |
|
Posts: n/a
|
On Nov 21, 12:35 am, "Daniel T." <danie...@earthlink.net> wrote:
> mathieu <mathieu.malate...@gmail.com> wrote: > > I am trying to understand UNICODE in C++, but I fear this is > > really something I do not understand. Where can I find good > > documentation regarding portability (I am targeting UNIX/gcc > > and Win32/cl) ? > Considerhttp://unicode.org/ I found _Fonts and Encodings_, by Yannis Haralambous, excellent for explaining most of the issues, but it may be more than you need, if all you want is to read files written in Unicode. > > Esp. I'd like know how I can open a std::ifstream when user > > input is UNICODE. > Open the ifstream as binary. You will also have to decide what > unicode encodings you are willing to accept, there are > something like 5 different encodings (UTF-8, UTF-16BE/LE, > UTF-32BE/LE.) The canonical solution is to use a wifstream, and imbue it with a locale which supports the encoding format you're using. If the encoding format is UTF-8, however, I rather agree with you: open in binary (and even that may not be necessary), ensure that the file is imbued with the "C" locale, or at least that the codecvt facet comes from the "C" locale, and just read the UTF-8. > > Does the following line makes any sense (I know this is not legal) ? > > const char alpha[] = "á.dcm"; > The above (AFAIK) is not defined in the language, but it may > be acceptable in some compilers. The "middle dot" character is > code 00B7 or in UTF-8 C2 B7. You could do something like this: > const unsigned char alpha[] = "\xc2\xb7.dcm"; That would be "\u00C2\u00B7.dcm (except that I don't see any middle dot---I see a "LATIN SMALL LETTER A WITH GRAVE", Unicode 0xE0). And even that isn't guaranteed; until we know what he actually wants in the string (in terms of encoding) , we can't really make any concrete recommendations. > (Note, I strongly suggest you use an unsigned representation > when working with unicode to avoid sign extension problems. > Either unsigned char, or some unsigned integral type that is 2 > bytes long.) I'd stick with char. Using unsigned char sounds logical, and does avoid one or two minor issues, but in practice, it will cause no end of problems with type checking, since nothing else expects unsigned char. But again, a lot depends on what you are doing with the text. > > Is there a way to say, when I share my C++ file, that my > > file is in UTF-8 ? > Not as far as the standard is concerned, but particular > compilers may have some means of doing so. The standard provides the codecvt facet for handling different input and output encodings. Whether this helps in your situation depends; I've found it useful in some cases, less so in others. Note that UTF-8 is a multi-byte encoding, which means that a lot (most?) of the traditional text manipulations won't work with it. (I ended up designing my own UTF-8 iterators, so that ++ advanced a character, and not just a byte, * returned an uint32_t with the Unicode encoding, and the iterator itself had begin and end functions which returned a char iterator over just the single character. I also ended up implementing things like islower which took the two char iterators, but they're typically less essential---in a lot of applications, the only UTF-8 characters which interest you in particular are the ones common with ASCII.) -- James Kanze James Kanze |
|
|
|
#7 |
|
Posts: n/a
|
On Nov 21, 1:38 am, Paavo Helde <myfirstn...@osa.pri.ee> wrote:
> mathieu <mathieu.malate...@gmail.com> wrote in news:8e3fda98-b476-4d82- > af39-ee2e6229b...@x31g2000yqx.googlegroups.com: > > Hi there, > > I am trying to understand UNICODE in C++, but I fear this is > > really something I do not understand. Where can I find good > > documentation regarding portability (I am targeting UNIX/gcc > > and Win32/cl) ? Esp. I'd like know how I can open a > > std::ifstream when user input is UNICODE. > > Does the following line makes any sense (I know this is > > not legal) ? > > const char alpha[] = "á.dcm"; > > Is there a way to say, when I share my C++ file, that my > > file is in UTF-8 ? > Not generally, but some implementations may support it. For > example, current Linux implementations use UTF-8 encoding as a > default locale, which is supported by many Linux applications > (but not by the Linux kernel itself, which does not care about > the encodings AFAIK). > Still, I would keep Unicode strings out of the source code for > now for portability, and put them into data files instead. The > data files would be read by the programs knowing more about > their encoding. > As you mentioned Windows, it would be handy to know that > Windows does not support UTF-8 locales. Really? I've used them under Windows, with no problems. More accurately: locales do not support UTF-8, or any multibyte encodings, in general---they are designed with the idea that everything internal is single byte, and that large character sets would be handled by a wchar_t (which must be at least 21 bits to handle full Unicode). The only place the encoding enters into play is in the codecvt facet (or in the single byte encodings for char---islower will depend on the encoding, for example---but these do not work for multibyte encodings). So depending on what you are doing, there are several alternatives: -- Use wchar_t and imbue your input and output with a locale which has a UTF-8 codecvt facet. This will work on systems which have a wchar_t which supports Unicode, or at least all of the Unicode characters which are of interest to you. (This is the case for Linux, for example, and some locales of Solaris. It's also the case for Windows IF you don't need any characters outside of the basic encoding plane; if, for example, you're really only concerned with European languages.) -- Use char, imbue input and output with the "C" locale, at least for the codecvt facet, and open them in binary (which means you can't read standard in or write to standard error). Do the rest yourself. This is what works best for me most of the time: typically, I only need to iterate over the strings, looking for specific characters which are all single byte in UTF-8, and UTF-8 was designed with support for this as a goal. It does mean that you don't have functions like isupper, but if you don't need them, this is clearly a good solution. > This means that if the strings in the program are internally > UTF-8, then you have to translate them back and forth all the > time you are calling Windows SDK functions. The good news is > that Windows fully supports Unicode, but only in UTF-16 > encoding, with the wchar_t/*W versions of the SDK functions. > Linux on the other hand does not support UTF-16, all the SDK > interface functions are defined in terms of char only. Yes. The interface with the system and other software you might be using must also be considered in your choice. -- James Kanze James Kanze |
|
|
|
#8 |
|
Posts: n/a
|
On Nov 22, 2:41 pm, Paavo Helde <myfirstn...@osa.pri.ee> wrote:
> James Kanze <james.ka...@gmail.com> wrote in news:2a1ef89e-14dd-4c4d-bb68- > 2f5f8dae9...@b15g2000yqd.googlegroups.com: > > On Nov 21, 1:38 am, Paavo Helde <myfirstn...@osa.pri.ee> wrote: > >> As you mentioned Windows, it would be handy to know that > >> Windows does not support UTF-8 locales. > > Really? I've used them under Windows, with no problems. > It seems I used the term "locale" too sloppyily. What I meant that > Windows has built-in support for 8-bit based (potentially multibyte) > encodings, which it calls "active codepage" (ACP). All SDK functions > having string parameters come in two variants - narrow and wide, and > narrow strings are automatically converted to wide strings (UTF-16) by > Windows SDK. It would be logical to set this 8-bit codepage setting to > UTF-8 and let Windows do all the translation transparently. However, > when I did an experiment and set the ACP and OEMCP values in the > registry to 65001 (UTF- > That's what I called "not supported". I didn't fiddle with the registry, but in fact: after executing chcp 65001 in a command window, the output from a program which outputs C3 A9 74 C3 A9 0D 0A disappears completely---not even the 't' appears. There's definitely something wrong there. (This was under Windows XP professional. Perhaps under some later version...) > It appears you are considering more the file content encoding. This is > another issue, not directly related to OS, and C++ locales might > probably sometimes be helpful here. I was considering principally file content encoding; that is really the only thing C++ locales address (with regards to encoding). I've had no problem reading and writing UTF-8 files under Windows, once the correct locale was installed. I've also had no problem using UTF-8 internally, but reading and writing files with some different encoding, again using the appropriate locales. What locales don't (and can't) address, of course, is what the system does with the bytes if they reach someplace where the system interprets them (e.g. the display buffer of a console window). -- James Kanze James Kanze |
|
|
|
#9 |
|
Posts: n/a
|
* James Kanze:
> > I didn't fiddle with the registry, but in fact: after executing chcp > 65001 in a command window, the output from a program which outputs C3 > A9 > 74 C3 A9 0D 0A disappears completely---not even the 't' appears. > There's definitely something wrong there. (This was under Windows XP > professional. Perhaps under some later version...) In Windows XP program invocation commands issued in the command interpreter with codepage 65001 generally have no effect: the programs don't even appear to be run. So the support for UTF-8 is definitely very much broken. Cheers & hth., - Alf Alf P. Steinbach |
|
|
|
#10 |
|
Posts: n/a
|
White Wolf wrote:
> Victor Bazarov wrote: >> Documentation? Well, have you tried your local library or a bookstore? > > Do you have any particular book in mind? No. It just seems like a relatively well explored topic at this point to not have any books at all explaining various aspects of it. I am fairly certain something exists, perhaps even in the form of a website. > [..] V -- Please remove capital 'A's when replying by e-mail I do not respond to top-posted replies, please don't ask Victor Bazarov |
|
![]() |
| Thread Tools | Search this Thread |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Batchfiles, Pipe > Output.txt, Redirection might fail with Unicode text files ?!?!?! | Skybuck Flying | Windows 64bit | 0 | 01-20-2008 09:22 PM |
| Visual Studio .Net and 64 bit Unicode MFC Project | Simon Guertin | Windows 64bit | 5 | 05-16-2006 02:59 PM |
| Unicode text file shown in popped-up Notepad window | yong321@yahoo.com | Computer Support | 1 | 08-16-2005 04:29 AM |
| Confirm my wireless understanding please? | Evil Uncle Chris | Wireless Networking | 1 | 05-01-2005 04:19 PM |
| Re: Unicode fonts in Windows | °Mike° | Computer Support | 2 | 07-22-2003 06:49 PM |