Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > c++ support for unicode, utf-8, encode/decode, ifstream, wstream?

Reply
Thread Tools

c++ support for unicode, utf-8, encode/decode, ifstream, wstream?

 
 
=?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=
Guest
Posts: n/a
 
      01-20-2006

Hi,
I have an UNICODE text file endcoded in UTF-8.

I should store the UNICODE strings in my program for example in
std::wstring right? To be able to work on them normally, so that
std::wstring foo; foo[5] would mean 5-th _character_, and not 5-th
byte of UNICODE encoded string.

How do I read a text from UTF-8 file into std::wstring? I need to do
some conversion right? from utf-8 to internal format used by
std::wstring (probably UCS-2 or -4 right?)

Also, how to save back the string, and how to manipulate it (like,
replace 4-th character, just str[4]=(wchar)'x' ?)

Thanks







 
Reply With Quote
 
 
 
 
TB
Guest
Posts: n/a
 
      01-20-2006
Rafał Maj Raf256 sade:
> Hi,
> I have an UNICODE text file endcoded in UTF-8.
>
> I should store the UNICODE strings in my program for example in
> std::wstring right? To be able to work on them normally, so that
> std::wstring foo; foo[5] would mean 5-th _character_, and not 5-th
> byte of UNICODE encoded string.
>
> How do I read a text from UTF-8 file into std::wstring? I need to do
> some conversion right? from utf-8 to internal format used by
> std::wstring (probably UCS-2 or -4 right?)
>
> Also, how to save back the string, and how to manipulate it (like,
> replace 4-th character, just str[4]=(wchar)'x' ?)
>


Upon reading the UTF-8 data convert it internally to UTF-32 for
easier parsing. The conversion process is quite easy to write.
The problem with std::wstring is that it's templatized with
wchar_t, and that primitive is at least on my machine only 2 bytes,
and therefore not practical to use with unicode (unless you actually
wish to use the abnormal UTF-16 variant in such a case).

--
TB @ SWEDEN
 
Reply With Quote
 
 
 
 
=?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=
Guest
Posts: n/a
 
      01-20-2006
TB wrote:

> Upon reading the UTF-8 data convert it internally to UTF-32 for
> easier parsing.


How? Arent there ready to use functions/classes doing that? In std,
perhaps in boost?

> The conversion process is quite easy to write.


> The problem with std::wstring is that it's templatized with
> wchar_t, and that primitive is at least on my machine only 2 bytes,
> and therefore not practical to use with unicode (unless you actually
> wish to use the abnormal UTF-16 variant in such a case).


Hm.. so which class is best to store any-language text string then?



 
Reply With Quote
 
P.J. Plauger
Guest
Posts: n/a
 
      01-20-2006
"Rafal Maj Raf256" <> wrote in
message news:dqqid4$dn6$...

> TB wrote:
>
>> Upon reading the UTF-8 data convert it internally to UTF-32 for
>> easier parsing.

>
> How? Arent there ready to use functions/classes doing that? In std,
> perhaps in boost?


You'll find a few codecvt facets (the critters you need) in various places,
but for a complete set of all that you're likely to need -- ready made,
tested, and supported -- see our CoreX library.

>> The conversion process is quite easy to write.


No it isn't. At least not correctly and robustly.

>> The problem with std::wstring is that it's templatized with
>> wchar_t, and that primitive is at least on my machine only 2 bytes,
>> and therefore not practical to use with unicode (unless you actually
>> wish to use the abnormal UTF-16 variant in such a case).

>
> Hm.. so which class is best to store any-language text string then?


Depends on your goals. In truth and reality, you can still get away quite
nicely with UCS-2. Effectively, you ignore the exotic characters with
code values above 0xffff more recently added. Your input converter
then treats as erroneous any UTF-8 sequence that specifies a code
value that's too big. But if you feel the need to support the complete
Unicode set in its current form, you need to convert UTF-8 to UTF-16
internally, and accept the fact that characters can occupy either one or
two storage elements. Whatever your choice, CoreX has the
conversion tools you need to carry it out.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


 
Reply With Quote
 
TB
Guest
Posts: n/a
 
      01-20-2006
Rafał Maj Raf256 sade:
> TB wrote:
>
>> Upon reading the UTF-8 data convert it internally to UTF-32 for
>> easier parsing.

>
> How? Arent there ready to use functions/classes doing that? In std,
> perhaps in boost?
>
>> The conversion process is quite easy to write.

>
>> The problem with std::wstring is that it's templatized with
>> wchar_t, and that primitive is at least on my machine only 2 bytes,
>> and therefore not practical to use with unicode (unless you actually
>> wish to use the abnormal UTF-16 variant in such a case).

>
> Hm.. so which class is best to store any-language text string then?


If 'unsigned int' is 4 bytes on your machine, write a unicode
implementation based on that primitive, or use an already available
framework; hm, perhaps this 'CoreX'-thingy advocated by P.J.

--
TB @ SWEDEN
 
Reply With Quote
 
P.J. Plauger
Guest
Posts: n/a
 
      01-20-2006
"TB" <> wrote in message
news:43d10b59$0$8259$.. .

> Rafal Maj Raf256 sade:
>> TB wrote:
>>
>>> Upon reading the UTF-8 data convert it internally to UTF-32 for
>>> easier parsing.

>>
>> How? Arent there ready to use functions/classes doing that? In std,
>> perhaps in boost?
>>
>>> The conversion process is quite easy to write.

>>
>>> The problem with std::wstring is that it's templatized with
>>> wchar_t, and that primitive is at least on my machine only 2 bytes,
>>> and therefore not practical to use with unicode (unless you actually
>>> wish to use the abnormal UTF-16 variant in such a case).

>>
>> Hm.. so which class is best to store any-language text string then?

>
> If 'unsigned int' is 4 bytes on your machine, write a unicode
> implementation based on that primitive, or use an already available
> framework; hm, perhaps this 'CoreX'-thingy advocated by P.J.


Yep. It includes UTF-8 to UCS-4 too. And it's templatized on the
internal character type. Forgot to mention that.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


 
Reply With Quote
 
JustBoo
Guest
Posts: n/a
 
      01-21-2006
On Fri, 20 Jan 2006 17:13:37 +0100, TB <> wrote:

>If 'unsigned int' is 4 bytes on your machine, write a unicode
>implementation based on that primitive, or use an already available
>framework; hm, perhaps this 'CoreX'-thingy advocated by P.J.


Oh, it isn't just advocated by Mr. Plauger it's SOLD ($) by Mr.
Plauger. Bit of a difference I think.

"If you have ten thousand regulations you destroy
all respect for the law." - Winston Churchill
 
Reply With Quote
 
P.J. Plauger
Guest
Posts: n/a
 
      01-21-2006
"JustBoo" <> wrote in message
news:...

> On Fri, 20 Jan 2006 17:13:37 +0100, TB <> wrote:
>
>>If 'unsigned int' is 4 bytes on your machine, write a unicode
>>implementation based on that primitive, or use an already available
>>framework; hm, perhaps this 'CoreX'-thingy advocated by P.J.

>
> Oh, it isn't just advocated by Mr. Plauger it's SOLD ($) by Mr.
> Plauger. Bit of a difference I think.


Really? In what way? I certainly *advocate* using an already
available framework, as did TB. If you can get a free one that
does the job (and it's still sufficiently "free" after you locate
it, download it, figure out how to build it, integrate it into
your product, deal with the surprises, and test it to your
satisfaction) by all means do so. I also *advocate* using CoreX,
if you're sufficiently professional that USD 90 is cheaper than
the above parenthetical exercise costs you in your time and
peace of mind.

But if you think I *advocate* something just because I make
ninety bucks off it, then by all means avoid anything that's
$OLD and stick with open sour¢e. Just don't measure me by
your standards.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


 
Reply With Quote
 
JustBoo
Guest
Posts: n/a
 
      01-22-2006
On Sat, 21 Jan 2006 18:32:22 -0500, "P.J. Plauger"
<> wrote:
>"JustBoo" <> wrote in message
>news:.. .
>> Oh, it isn't just advocated by Mr. Plauger it's SOLD ($) by Mr.
>> Plauger. Bit of a difference I think.


>Really? In what way? I certainly *advocate* using an already
>available framework, as did TB.


In what way? Well, from the simple *fact* you make money selling
your products here. I make that the point of this paragraph. Can you
deny that? It is a fact. Leave emotion out of it. Leave capitalism out
of it. Leave your perception that this is an insult out of it, and all
the rest. You sell your products here. I could truly *careless*
whether you do or not. But you do. Please do not enumerate all the
good you do for the free and not-so-free world by doing this. You sell
them here on a consistent basis. Period. As the sun comes up every
morning, it's just the obvious truth. Now please pay attention; in the
*context* of this thread, I thought it important to point this out to
the poster. That's it.

Before getting your hackles up please read further.

>If you can get a free one that
>does the job (and it's still sufficiently "free" after you locate
>it, download it, figure out how to build it, integrate it into
>your product, deal with the surprises, and test it to your
>satisfaction) by all means do so. I also *advocate* using CoreX,
>if you're sufficiently professional that USD 90 is cheaper than
>the above parenthetical exercise costs you in your time and
>peace of mind.


Fairly ironic that. I'm certain you don't remember, but *I have
recommended* people look at your products on a regular basis, in
this ng and others. Even in the real world. And ready for this, I have
used precisely the exact same logic to justify the recommendation
when *attacked* for doing so. That is the very definition of irony.

[Note: I usually leave out the snarky remark about "sufficiently
professional" though.]

>But if you think I *advocate* something just because I make
>ninety bucks off it, then by all means avoid anything that's
>$OLD and stick with open sour¢e.


Wow, an ocean's worth of assumption and presumptions to boot. You
think me a socialist? Bwha. I'm a stone-cold capitalist. You've
assumed far too much. <chuckling> You've read far too much into my
simple statement of fact.

And yes, I do think you advocate it because you make money from it.
Welcome to the commerce of the human race. It's just human nature.
I'll leave it up to you to decide if that is an insult or not.

Trend your own posts, seriously. Look at what you respond to and what
you always recommend. I believe you to be of a scientific mentality
and if you are honest with yourself you will see truth. Noting more
nothing less.

And in the end, so what. As I'm sure one of your arguments would/will
be: people are free to buy it or not, and you're making them aware of
its existence. And there you have it. Try to read this post without
emotion and perhaps you'll see my intent.

>Just don't measure me by
>your standards.

[Insult acknowledged but not accepted; like a refused package]

Once again, you assume far too much. Especially given that I simply
pointed out that you sell products, which is true. Does being a
capitalist bother you? Guilt perhaps? Note those are questions, not
assumptions.

"I didn't fight my way to the top of the food chain to be a
vegetarian."

Have a *prosperous* week.
 
Reply With Quote
 
P.J. Plauger
Guest
Posts: n/a
 
      01-22-2006
"JustBoo" <> wrote in message
news:...

> On Sat, 21 Jan 2006 18:32:22 -0500, "P.J. Plauger"
> <> wrote:
>>"JustBoo" <> wrote in message
>>news:. ..
>>> Oh, it isn't just advocated by Mr. Plauger it's SOLD ($) by Mr.
>>> Plauger. Bit of a difference I think.

>
>>Really? In what way? I certainly *advocate* using an already
>>available framework, as did TB.

>
> In what way? Well, from the simple *fact* you make money selling
> your products here. I make that the point of this paragraph. Can you
> deny that?


Uh, no.

> [extensive rant elided]


Got it. Now chill out.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
IEEE 1284.4 generic printing support / HP LaserJet printer support Manuel Lopez Windows 64bit 13 10-18-2009 12:51 PM
Question for IT Support/Managers - Providing Personal PC Support Big Dog Microsoft Certification 4 01-16-2008 01:33 AM
Anti Virus Support + Japanese character support =?Utf-8?B?Q2hyaXMgQnVzaA==?= Windows 64bit 6 07-31-2005 09:57 AM
ANN: SCons.0.96 adds Fortran 90/95 support, better Qt support,platform-independent file system actions, improved debugging, lots more Steven Knight Python 0 08-18-2004 03:57 PM
Getting Third Party Component Suppliers to support NUnit and NUnitASP to support test driven development in web pages Nick Zdunic ASP .Net 0 11-05-2003 10:45 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57