Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Unicode character in C++

Reply
Thread Tools

Unicode character in C++

 
 
liveshell
Guest
Posts: n/a
 
      04-07-2008
Hi all,
In my application, I am reading a file and storing it in a
array of character. That is ascii format...now in certain situation I
get unicode character (or lets say junk character). I want to know
that whether it is plain ascii or Unicode...How can I ??

Thanks,
LiveShell
 
Reply With Quote
 
 
 
 
Michael DOUBEZ
Guest
Posts: n/a
 
      04-07-2008
liveshell a écrit :
> Hi all,
> In my application, I am reading a file and storing it in a
> array of character. That is ascii format...now in certain situation I
> get unicode character (or lets say junk character). I want to know
> that whether it is plain ascii or Unicode...How can I ??


Supposing your junk is UTF-8, you have to look for MSB equal to 1. This
is how is is done in UTF-8: char 0-127 are the historical ascii char,
and the number of ones in the MSB of the char gives the number of char
that follow in the encoding:
US-ASCII: 0xxxxxxx
2 bytes: 10xxxxxx xxxxxxxx
3 bytes: 110xxxxx xxxxxxxx xxxxxxxx
4 bytes: 1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx

Michael
 
Reply With Quote
 
 
 
 
James Kanze
Guest
Posts: n/a
 
      04-08-2008
On Apr 7, 2:50 pm, Michael DOUBEZ <(E-Mail Removed)> wrote:
> liveshell a écrit :
> > In my application, I am reading a file and storing it in a
> > array of character. That is ascii format...now in certain situation I
> > get unicode character (or lets say junk character). I want to know
> > that whether it is plain ascii or Unicode...How can I ??


> Supposing your junk is UTF-8, you have to look for MSB equal to 1. This
> is how is is done in UTF-8: char 0-127 are the historical ascii char,
> and the number of ones in the MSB of the char gives the number of char
> that follow in the encoding:
> US-ASCII: 0xxxxxxx
> 2 bytes: 10xxxxxx xxxxxxxx
> 3 bytes: 110xxxxx xxxxxxxx xxxxxxxx
> 4 bytes: 1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx


Note too that the following bytes will always have 10 in their
upper bits, so that should be something like:
2 bytes: 10xxxxxx 10xxxxxx
3 bytes: 110xxxxx 10xxxxxx 10xxxxxx
4 bytes: 1110xxxx 10xxxxxx 10xxxxxx 10xxxxxx

--
James Kanze (GABI Software) email:(E-Mail Removed)
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Resolving unicode escapes to unicode character Tyler Ruby 1 07-29-2011 01:47 PM
Getting unicode escape sequence from unicode character? Kenneth McDonald Python 1 12-27-2006 10:27 PM
getting the character code of a character in a string Velvet ASP .Net 9 01-19-2006 09:27 PM
Unicode character in Japanese printing, but only square =?Utf-8?B?U3BlbmNlciBILiBQcnVl?= ASP .Net 4 04-01-2005 07:57 PM
Displaying a specific Unicode character Jave Java 4 09-08-2003 12:34 PM



Advertisments