Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Re: Best way to handle UTF-8 in C++

Reply
Thread Tools

Re: Best way to handle UTF-8 in C++

 
 
Marek Borowski
Guest
Posts: n/a
 
      05-08-2010
On 08-05-2010 16:05, Sam wrote:
> Peter Olcott writes:
>
>> I want the exact std::string interface, but, the underlying
>> representation would be UTF-8. This means that substring would work on
>> the basis of Unicode CodePoints, instead of bytes.

>
> The point that you consistently seem to be missing is that UTF-8 /is/ a
> byte-oriented representation of Unicode. If you're asking for something
> that handles unicode codepoints, what you're asking has absolutely
> nothing to do, whatsoever, with UTF-8, or any other encoding. UTF-8 is
> just a byte-oriented encoding of the full Unicode set.
>

NO. Every other 8bit encoding has 1 byte per char.
UTF-8 It's not the same! Have you ever tried what you proposed ?

> std::string is perfectly capable of handling UTF-8-encoded text, as in
> this very own news client, running on a UTF-8 platform, accepting
> UTF-8-encoded input from the keyboard, composing a UTF-8-encoded
> message, and posting it.
>

Assing that "gęś" is in UTF-8 text, substr(0,2) don't produce "gę"
as it should be.

> ¿entienda?
>

You don't "entiendies".


Regards

Marek
 
Reply With Quote
 
 
 
 
James Kanze
Guest
Posts: n/a
 
      05-08-2010
On May 8, 6:15 am, "DaveB" <(E-Mail Removed)> wrote:
> Victor Bazarov wrote:


[...]
> Victor, why not just "see" what he is asking and give him what
> he needs: the answer!


Because he can't give an answer if the question isn't clear.

> Most things do not need a "dancing around it" style. Assess it
> as best you can in the first message, then blurt out an
> "answer"!


That would be rather irresponsible, don't you think? Make a
random guess as to what is being asked, and then answer that?

The "obvious" answer is that std::string does support UTF-8.
Until we know what is actually meant by "support" UTF-8, that's
the only possible answer. I rather suspect that the original
poster wanted more support than std::string gives, but until he
specifies what, it's impossible to give an answer.

--
James Kanze
 
Reply With Quote
 
 
 
 
James Kanze
Guest
Posts: n/a
 
      05-08-2010
On May 8, 7:48 pm, "Peter Olcott" <(E-Mail Removed)> wrote:
> "James Kanze" <(E-Mail Removed)> wrote in message


> news:(E-Mail Removed)...


> I have decided that if I want a utf8string that implements a
> subset of the std::string interface I must implement this
> myself.


Probably. Or pick it up off the net. (There's a good deal of
UTF-8 support at my site, but it's still pretty experimental.
Supporting UTF-8 is very non-trivial.)

> See [Is this UTF-8 Regular Expression semantically
> correct?] to follow the details of this. As long as the
> regex is correct then my utf8string design will easily
> provide the capabilities that you referred to above.


Safely? (I ask because I know that it's quite difficult to
handle UTF-8 correctly. At least with the classical std
syntax---as usual, the standard iterator idiom causes no end of
problems.)

--
James Kanze
 
Reply With Quote
 
Thomas J. Gritzan
Guest
Posts: n/a
 
      05-08-2010
Am 08.05.2010 20:48, schrieb Peter Olcott:
> I have decided that if I want a utf8string that implements a
> subset of the std::string interface I must implement this
> myself.


Take a look at Glib::ustring. It's part of gtkmm, has a similar
interface to std::string and stores UTF-8:
http://library.gnome.org/devel/gtkmm...string.html.en

The entire library gtkmm seems to be a bit more STL friendly than any
other GUI library I've seen. However, I didn't use it myself.

--
Thomas
 
Reply With Quote
 
James Kanze
Guest
Posts: n/a
 
      05-08-2010
On May 8, 10:19 pm, Marek Borowski <(E-Mail Removed)> wrote:
> On 08-05-2010 16:05, Sam wrote:> Peter Olcott writes:


> >> I want the exact std::string interface, but, the underlying
> >> representation would be UTF-8. This means that substring
> >> would work on the basis of Unicode CodePoints, instead of
> >> bytes.


> > The point that you consistently seem to be missing is that
> > UTF-8 /is/ a byte-oriented representation of Unicode. If
> > you're asking for something that handles unicode codepoints,
> > what you're asking has absolutely nothing to do, whatsoever,
> > with UTF-8, or any other encoding. UTF-8 is just a
> > byte-oriented encoding of the full Unicode set.


> NO. Every other 8bit encoding has 1 byte per char.


Bullshit. There are any number of multibyte encodings, many of
them older than UTF-8.

> UTF-8 It's not the same! Have you ever tried what you proposed ?


Until we know what Peter wants to do, it's impossible to say
whether std::string can be used "as is", or not.

> > std::string is perfectly capable of handling UTF-8-encoded
> > text, as in this very own news client, running on a UTF-8
> > platform, accepting UTF-8-encoded input from the keyboard,
> > composing a UTF-8-encoded message, and posting it.


> Assing that "gęś" is in UTF-8 text, substr(0,2) don't produce
> "gę" as it should be.


Should it? (In practice, I've not found much use for
std::string::substr. And something like std::string(s.begin(),
std::search(s.begin(), s.end(), target.begin(), target.end())
does work as expected. But that's probably linked to my
particular type of applications; I don't think my experience
would hold in an editor, for example.)

Depending on what your application is doing, std::string and the
standard library might provide all you need. Or you might need
a few addional functions. Or you might be better off
transcoding on input and output (which you probably have to do
anyway) and using UTF-32 internally.

--
James Kanze
 
Reply With Quote
 
James Kanze
Guest
Posts: n/a
 
      05-08-2010
On May 8, 6:21 am, "DaveB" <(E-Mail Removed)> wrote:
> Joshua Maurice wrote:


> > Let me try to explain. std::string has member functions like
> > find and substring. When used to store UTF-8 for an 8 bit
> > char, the indexes are in terms of 8 bit encoding units.
> > However, generally a user does not want to work with indexes
> > in terms of encoding units. They want to work with indexes
> > in terms of encoded Unicode code points, or more probably
> > Unicode grapheme clusters.


> Good thing I'm not relevant to evaluating your resume. JK,
> your corporate-coding "experience" shows up in the stock
> tickers though. I dunno what to think anymore. I'll be cliche:
> you get what you pay for, and easy come/easy go.


Not sure what this is about. Usually, when someone says "JK",
they're referring to me, but when this was posted, I hadn't made
a single contribution to this thread. Still, I don't understand
what the poster is trying to say.

--
James Kanze
 
Reply With Quote
 
James Kanze
Guest
Posts: n/a
 
      05-09-2010
On May 9, 4:14 am, "Peter Olcott" <(E-Mail Removed)> wrote:
> "Thomas J. Gritzan" <(E-Mail Removed)> wrote in
> messagenews:hs4r9r$bg2$(E-Mail Removed) ...


[...]
> Now that I know how to do this myself very easily I won't
> bother looking at alternatives.
> I will be precisely implementing the subset of the
> std::string that I need:
> operator[]()


With what return type?

> substr()
> operator+=()
> operator=()
> length() in characters
> reserve() in bytes
> capacity() in bytes
> size() in bytes
> resize() in bytes
> relational operators
> operator>>()
> operator<<()


With the exception of length and substr (assuming you want to
use character indexes), these all already work for UTF-8 in
std::string.

Given that all you apparently need is substr, length and some
sort of indexing, the simplest solution would seem to be some
sort of free functions. In practice, however, I think you'll
find that you also need some sort of mechanism to support
iterators, so that you can use the STL.

Where things get complicated, of course, is what operator[] and
iterator:perator* should return. (An uint32_t is an obvious
choice. Except that this doesn't allow using these results as
an lvalue.)

--
James Kanze
 
Reply With Quote
 
Marek Borowski
Guest
Posts: n/a
 
      05-09-2010
On 09-05-2010 00:50, Sam wrote:
> Marek Borowski writes:
>
>>> std::string is perfectly capable of handling UTF-8-encoded text, as in
>>> this very own news client, running on a UTF-8 platform, accepting
>>> UTF-8-encoded input from the keyboard, composing a UTF-8-encoded
>>> message, and posting it.
>>>

>> Assing that "g" is in UTF-8 text, substr(0,2) don't produce "g"

>
> Since "g" is 6 bytes in UTF-8, I'm not sure why you would expect that
> it would.
>

UTF-8 "g" is 5 bytes long not 6.

> "g" is 6 bytes, and substr(0, 2) would give you the first two bytes,
> as expected, nothing more, nothing more.
>

It should give 2 charT elements. When you uses wstring, substr(0,2)
produce "g".

wstring uses 16bit characters. string uses 8bit characters. Both are not
suitable for variable size characters like utf8 char.


Regards

Marek






 
Reply With Quote
 
Marek Borowski
Guest
Posts: n/a
 
      05-09-2010
On 09-05-2010 13:50, Sam wrote:
> Marek Borowski writes:
>


>> UTF-8 "g" is 5 bytes long not 6.
>>
>> It should give 2 charT elements.

>
> And, it does. std::string is a typedef for std::basic_string<char>, so
> your charT is a char.
>

And char is not utf8 char. It doesn't work as it would be expected.
This is what I am trying to explain to you. string is not suitable for
utf8 strings.

> The first two charTs of this UTF-8 string are 0x67 0xc4, and that's what
> you'll get with your substr() call.
>

It is not valid text in utf8 encoding. I would expect 0x67, 0xC4, 0x99
as this is "g" representation in utf8.
(2 characters - 3 bytes).

>> When you uses wstring, substr(0,2) produce "g".
>>
>> wstring uses 16bit characters.

>
> That's great. But what does std::wstring have to do with UTF-8, and
> whether or not std::string "works" (for some unspecified definition of
> "works") with UTF-8-encoded strings?
>

It was an example how substr function works. Second parameter is length
of substring in characters - doesn't matter what those are.

Regards

Marek


 
Reply With Quote
 
Paul Bibbings
Guest
Posts: n/a
 
      05-09-2010
"Peter Olcott" <(E-Mail Removed)> writes:

> My original question was sufficiently complete. I said that
> I wanted a string class that provided the std::string
> interface and had an underlying utf-8 representation. It
> doesn't take psychic powers to know that substr() must be
> implemented differently. I took Victor's request for more
> information to simply be head games so I ignored the
> request.


As a moderately-interested `lurker' in this thread I think that there
*are* head games here, but I don't see them in Victor's replies.
Rather, I would classify them as `power games' of the "I will
pose a question and then shoot down all responses because I know the
answer already" variety.

Regards

Paul Bibbings
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
best way to handle sql decimal fields Steve Richter ASP .Net 3 03-31-2005 02:55 PM
What's the best way to handle showing/editing this data? Alan Silver ASP .Net 4 02-16-2005 06:23 PM
Best way to handle documents in ASP.NET Thomas Scheiderich ASP .Net 11 05-20-2004 05:57 PM
Question: Best way to handle DBNULL in datareaders Ravikanth[MVP] ASP .Net 6 07-18-2003 10:51 AM
Re: Best way to handle a mutually exclusive situation gabriel XML 0 06-25-2003 08:08 AM



Advertisments