Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Re: New utf8string design may make UTF-8 the superior encoding

Reply
Thread Tools

Re: New utf8string design may make UTF-8 the superior encoding

 
 
Tiib
Guest
Posts: n/a
 
      05-16-2010
On 16 mai, 15:34, "Peter Olcott" <(E-Mail Removed)> wrote:
> Since the reason for using other encodings than UTF-8 is
> speed and ease of use, a string that is as fast and easy to
> use (as the strings of other encodings) that often takes
> less space would be superior to alternative strings.


If you care so much ... perhaps throw together your utf8string and let
us to see it. Perhaps test & profile it first to compare with
Glib::ustring. http://library.gnome.org/devel/glibm..._1ustring.html

I suspect UTF8 fades gradually into history. Reasons are similar like
256 color video-modes and raster-graphic formats went. GUI-s are
already often made with java or C# (for lack of C++ devs) and these
use UTF16 internally. Notice that modern processor architectures are
already optimized in the way that byte-level operations are often
slower.
 
Reply With Quote
 
 
 
 
Tiib
Guest
Posts: n/a
 
      05-16-2010
On 16 mai, 17:46, Peter Olcott <(E-Mail Removed)> wrote:
>
> UTF-8 is the best Unicode data-interchange format because it works
> exactly the same way across every machine architecture without the need
> for separate adaptations. It also stores the entire ASCII character set
> in a single byte per code point.


Similarly is Portable Network Graphics good format to interchange
raster graphics. Gimp, Photoshop etc. however do not use such packed
format for graphics manipulation internally. They use their own
internal format to achieve manipulation speed and flexibility. You
insist using interchange format for manipulation. It may be good or
bad idea depends on context.

> I will put it together because it will become one of my standard tools.
> The design is now essentially complete. Coding this updated design will
> go very quickly. I will put it on my website and provide a free license
> for any use as long as the copyright notice remains in the source code.


Great.
 
Reply With Quote
 
 
 
 
James Kanze
Guest
Posts: n/a
 
      05-18-2010
On 16 May, 14:51, Tiib <(E-Mail Removed)> wrote:
> On 16 mai, 15:34, "Peter Olcott" <(E-Mail Removed)> wrote:


> I suspect UTF8 fades gradually into history. Reasons are
> similar like 256 color video-modes and raster-graphic formats
> went. GUI-s are already often made with java or C# (for lack
> of C++ devs) and these use UTF16 internally. Notice that
> modern processor architectures are already optimized in the
> way that byte-level operations are often slower.


The network is still 8 bits UTF-8. As are the disks; using
UTF-16 on an external support simply doesn't work.

Also, UTF-8 may result in less memory use, and thus less paging.

If all you're doing are simple operations, searching for a few
ASCII delimiters and copying the delimited substrings, for
example, UTF-8 will probably be significantly faster: the CPU
will always read a word at a time, even if you access it byte by
byte, and you'll usually get more characters per word using
UTF-8.

If you need full and complete support, as in an editor, for
example, UTF-32 is the best general solution. For a lot of
things in between, UTF-16 is a good compromise.

But the trade-offs only concern internal representation.
Externally, the world is 8 bits, and UTF-8 is the only solution.

--
James Kanze
 
Reply With Quote
 
Tiib
Guest
Posts: n/a
 
      05-18-2010
On 18 mai, 17:18, James Kanze <(E-Mail Removed)> wrote:
> The network is still 8 bits UTF-8. *As are the disks; using
> UTF-16 on an external support simply doesn't work.
>
> Also, UTF-8 may result in less memory use, and thus less paging.
>
> If all you're doing are simple operations, searching for a few
> ASCII delimiters and copying the delimited substrings, for
> example, UTF-8 will probably be significantly faster: the CPU
> will always read a word at a time, even if you access it byte by
> byte, and you'll usually get more characters per word using
> UTF-8.
>
> If you need full and complete support, as in an editor, for
> example, UTF-32 is the best general solution. *For a lot of
> things in between, UTF-16 is a good compromise.
>
> But the trade-offs only concern internal representation.
> Externally, the world is 8 bits, and UTF-8 is the only solution.


I would be honestly extremely glad if it was the only solution. Real
life applications throw in texts in all possible forms also they await
responses in all possible forms. For example texts in financial
transactions done in most Northern Europe assume that "/\{}[]" means
something like "" (i do not remember correct order, but
something like that).

I prefer to convert incoming texts into std::wstring. Outgoing texts i
convert back to whatever they await (UTF-8 is really relaxing news
there, true). All what i need is a set of conversion functions. If it
is going to user interface then std::wstring goes and it is business
of UI to convert it further into CString or QString or whatever they
enjoy there and sort it out for user.

I perhaps have too low experience with sophisticated text processing.
Simple std::sort(), wide char literals of C++ and boost::wformat plus
full set of conversion functions is all i need really. Peter Olcott
raises lot of noise around it and so it makes me a bit
interested.
 
Reply With Quote
 
Mihai N.
Guest
Posts: n/a
 
      05-19-2010

> I perhaps have too low experience with sophisticated text processing.
> Simple std::sort(), wide char literals of C++ and boost::wformat plus
> full set of conversion functions is all i need really.


It depends a lot what you need.

Sorting is locale-sensitive (German, Swedish, French, Spanish, all
have different sorting rules).
The CRT (and STL, and boost) are pretty dumb when dealing with things
in a locale sensitive way (meaning that they usualy don't


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

 
Reply With Quote
 
Tiib
Guest
Posts: n/a
 
      05-19-2010
On May 19, 8:24*am, "Mihai N." <(E-Mail Removed)> wrote:
> > I perhaps have too low experience with sophisticated text processing.
> > Simple std::sort(), wide char literals of C++ and boost::wformat plus
> > full set of conversion functions is all i need really.

>
> It depends a lot what you need.
>
> Sorting is locale-sensitive (German, Swedish, French, Spanish, all
> have different sorting rules).
> The CRT (and STL, and boost) are pretty dumb when dealing with things
> in a locale sensitive way (meaning that they usualy don't


Yes, sorting in real alphabetic order for user is perhaps business of
GUI. GUI has to display it. GUI however usually has its WxStrings or
FooStrings anyway. I hate when someone leaks these weirdos to
application mechanics layer. Internal application logic is often best
made totally locale-agnostic and not caring about positioning in GUI
and if the end-users write from up to down or from right to left.

So text in electronic interfaces layer are bytes, text in application
layer are wchar_t and text in user interface layer are whatever weirdo
rules there. If maintainer forgets to convert in interface between
layers he gets compiler warnings or errors. That makes life easy, but
i suspect my problems with texts are more trivial than these of some
others.
 
Reply With Quote
 
James Kanze
Guest
Posts: n/a
 
      05-19-2010
On May 19, 12:01 am, Tiib <(E-Mail Removed)> wrote:
> On 18 mai, 17:18, James Kanze <(E-Mail Removed)> wrote:


[...]
> > But the trade-offs only concern internal representation.
> > Externally, the world is 8 bits, and UTF-8 is the only solution.


> I would be honestly extremely glad if it was the only solution. Real
> life applications throw in texts in all possible forms also they await
> responses in all possible forms.


Yes. I meant it is the only solution if you are choosing
yourself. In practice, there are a lot of other solutions being
used; they don't work, except in limited environments, but they
are being widely used.

> For example texts in financial transactions done in most
> Northern Europe assume that "/\{}[]" means something like
> "" (i do not remember correct order, but something like
> that).


> I prefer to convert incoming texts into std::wstring. Outgoing
> texts i convert back to whatever they await (UTF-8 is really
> relaxing news there, true). All what i need is a set of
> conversion functions. If it is going to user interface then
> std::wstring goes and it is business of UI to convert it
> further into CString or QString or whatever they enjoy there
> and sort it out for user.


In theory, the conversion should take place in the filebuf,
using the imbued locale.

> I perhaps have too low experience with sophisticated text processing.
> Simple std::sort(), wide char literals of C++ and boost::wformat plus
> full set of conversion functions is all i need really. Peter Olcott
> raises lot of noise around it and so it makes me a bit
> interested.


There can be advantages to using UTF-8 internally, as well as at
the interface level, and if you're not doing too complicated
things, it can work quite nicely. But only as long as your
manipulations aren't too complicated.

--
James Kanze
 
Reply With Quote
 
Tiib
Guest
Posts: n/a
 
      05-19-2010
On May 19, 1:21*pm, James Kanze <(E-Mail Removed)> wrote:
> On May 19, 12:01 am, Tiib <(E-Mail Removed)> wrote:
>
> > On 18 mai, 17:18, James Kanze <(E-Mail Removed)> wrote:

>
> * * [...]
>
> > > But the trade-offs only concern internal representation.
> > > Externally, the world is 8 bits, and UTF-8 is the only solution.

> > I would be honestly extremely glad if it was the only solution. Real
> > life applications throw in texts in all possible forms also they await
> > responses in all possible forms.

>
> Yes. *I meant it is the only solution if you are choosing
> yourself. *In practice, there are a lot of other solutions being
> used; they don't work, except in limited environments, but they
> are being widely used.
>
> > For example texts in financial transactions done in most
> > Northern Europe assume that *"/\{}[]" means something like
> > "" (i do not remember correct order, but something like
> > that).
> > I prefer to convert incoming texts into std::wstring. Outgoing
> > texts i convert back to whatever they await (UTF-8 is really
> > relaxing news there, true). All what i need is a set of
> > conversion functions. If it is going to user interface then
> > std::wstring goes and it is business of UI to convert it
> > further into CString or QString or whatever they enjoy there
> > and sort it out for user.

>
> In theory, the conversion should take place in the filebuf,
> using the imbued locale.


Yes, if it is good wfilebuf then my problems are totally unexisting.
Often it is not in practice; instead there are strange protocol layers
and security by obscurity.

> > I perhaps have too low experience with sophisticated text processing.
> > Simple std::sort(), wide char literals of C++ and boost::wformat plus
> > full set of conversion functions is all i need really. Peter Olcott
> > raises lot of noise around it and so it makes me a bit
> > interested. *

>
> There can be advantages to using UTF-8 internally, as well as at
> the interface level, and if you're not doing too complicated
> things, it can work quite nicely. *But only as long as your
> manipulations aren't too complicated.


My major advantage from using wstring is that ...

Bytes are often too ambiguous information, even if exception like
UTF-8 the information is fully sufficient. Compiler does not make
difference between byte (char) in UTF-8 string, or byte in string in
some other encoding. wstring ensures that compilers/tools can easily
frown upon such bytes that sneak into application layer in whatever
encoding these are and from where-ever these come. That gains
attention at right place and for right reason.

For example there is:
basic_fstream::basic_fstream(const char* s, ios_base:penmode
mode);

If i give wstring::c_str() result as parameter s to that constructor i
get error. So compiler drags my attention to right place. If i get no
error then there is most likely extension to STL that most likely
works correctly. Giving result of string::c_str() (that contains
UTF- creates most likely garbage-filled file name.

 
Reply With Quote
 
Joshua Maurice
Guest
Posts: n/a
 
      05-19-2010
On May 19, 1:50*am, Tiib <(E-Mail Removed)> wrote:
> On May 19, 8:24*am, "Mihai N." <(E-Mail Removed)> wrote:
>
> > > I perhaps have too low experience with sophisticated text processing.
> > > Simple std::sort(), wide char literals of C++ and boost::wformat plus
> > > full set of conversion functions is all i need really.

>
> > It depends a lot what you need.

>
> > Sorting is locale-sensitive (German, Swedish, French, Spanish, all
> > have different sorting rules).
> > The CRT (and STL, and boost) are pretty dumb when dealing with things
> > in a locale sensitive way (meaning that they usualy don't

>
> Yes, sorting in real alphabetic order for user is perhaps business of
> GUI. GUI has to display it. GUI however usually has its WxStrings or
> FooStrings anyway. I hate when someone leaks these weirdos to
> application mechanics layer. Internal application logic is often best
> made totally locale-agnostic and not caring about positioning in GUI
> and if the end-users write from up to down or from right to left.
>
> So text in electronic interfaces layer are bytes, text in application
> layer are wchar_t and text in user interface layer are whatever weirdo
> rules there. If maintainer forgets to convert in interface between
> layers he gets compiler warnings or errors. That makes life easy, but
> i suspect my problems with texts are more trivial than these of some
> others.


First, as I mentioned in the other current thread on Unicode, please
stop saying "wchar_t" and "wstring" as though that means something, or
is at all a useful portable tool. wchar_t is 16 bits on windows, and
32 bits on most Unix-like systems IIRC. (Yes, the other thread listed
some more exceptions.) So, either you're suggesting an entirely not
portable solution with wstring, or you are suggesting that it makes
sense to use UTF32 on Unix-like computers and UTF16 on windows
computers, a quite silly statement.

Then, locales in my experience have not been terribly portable, not
portable enough for my company's product which runs on nearly all
computer OSs known to man, including windows, win x64, the so to be
"desupported by windows" windows itanium, Linux, z Linux, OS 2, HPUX
IPF, and so on. Moreover, it's not terribly practical to tell our
customers "you have to install these 'x' locales". Moreover, the
locales of the same name on different OSs have been known to have
subtly different behavior.

Finally, I can't think of a useful example off the top of my head
where sorting based on locale would be required except when
"printing", to the screen, file, etc., but this doesn't convince me
that there is no use for it. As a potential example, should you have
to bring in an entire GUI framework just to implement the Unix utility
"sort" except with an additional locale option? That seems silly to
me.
 
Reply With Quote
 
Tiib
Guest
Posts: n/a
 
      05-19-2010
On May 20, 12:02*am, Joshua Maurice <(E-Mail Removed)> wrote:
> On May 19, 1:50*am, Tiib <(E-Mail Removed)> wrote:
> > On May 19, 8:24*am, "Mihai N." <(E-Mail Removed)> wrote:

>
> > > > I perhaps have too low experience with sophisticated text processing.
> > > > Simple std::sort(), wide char literals of C++ and boost::wformat plus
> > > > full set of conversion functions is all i need really.

>
> > > It depends a lot what you need.

>
> > > Sorting is locale-sensitive (German, Swedish, French, Spanish, all
> > > have different sorting rules).
> > > The CRT (and STL, and boost) are pretty dumb when dealing with things
> > > in a locale sensitive way (meaning that they usualy don't

>
> > Yes, sorting in real alphabetic order for user is perhaps business of
> > GUI. GUI has to display it. GUI however usually has its WxStrings or
> > FooStrings anyway. I hate when someone leaks these weirdos to
> > application mechanics layer. Internal application logic is often best
> > made totally locale-agnostic and not caring about positioning in GUI
> > and if the end-users write from up to down or from right to left.

>
> > So text in electronic interfaces layer are bytes, text in application
> > layer are wchar_t and text in user interface layer are whatever weirdo
> > rules there. If maintainer forgets to convert in interface between
> > layers he gets compiler warnings or errors. That makes life easy, but
> > i suspect my problems with texts are more trivial than these of some
> > others.

>
> First, as I mentioned in the other current thread on Unicode, please
> stop saying "wchar_t" and "wstring" as though that means something, or
> is at all a useful portable tool. wchar_t is 16 bits on windows, and
> 32 bits on most Unix-like systems IIRC. (Yes, the other thread listed
> some more exceptions.) So, either you're suggesting an entirely not
> portable solution with wstring, or you are suggesting that it makes
> sense to use UTF32 on Unix-like computers and UTF16 on windows
> computers, a quite silly statement.


Now ... seems that there is strange misunderstanding. For anyone
converting between whatever char sequence to whatever wchar_t sequence
it is highly-platform-dependent-operation anyway. I have no way said
that such operations are portable. Since wstring is used for
internally holding texts the sizeof(wchar_t) is not affecting
anything. The major property of wchar_t for me is that it is different
from char on all platforms i know and so i get warnings/errors from
tools on attempts to mechanically assign one to other.

> Then, locales in my experience have not been terribly portable, not
> portable enough for my company's product which runs on nearly all
> computer OSs known to man, including windows, win x64, the so to be
> "desupported by windows" windows itanium, Linux, z Linux, OS 2, HPUX
> IPF, and so on.


You managed to somehow have portability in string-to-string
conversions? Congrats. I have abandoned all hope there. Different code
is used for conversions platform-by-platform. The platform makers (and
not only) seemingly fight with each other to make their data
incompatible so why should i hope there will be peace and portability
any day? Is there something new? Same goes on with dates, values with
measurement units and even plain floating point numbers ... only name
it. Plain text is nothing different.

> Moreover, it's not terribly practical to tell our
> customers "you have to install these 'x' locales". Moreover, the
> locales of the same name on different OSs have been known to have
> subtly different behavior.


Exactly! So portability and localization is possible only by having
converter for each platform that does know the quirks of platform. If
sizeof(wchar_t) is 2 or 4 does not matter at all since code that
produces it is anyway different.

> Finally, I can't think of a useful example off the top of my head
> where sorting based on locale would be required except when
> "printing", to the screen, file, etc., but this doesn't convince me
> that there is no use for it.


No need to nail me. I only confirm that i have not meet a need for it,
but i can not prove that it does not exist. I fight problems that i
meet on field, not theoretical possibilities.

As a potential example, should you have
> to bring in an entire GUI framework just to implement the Unix utility
> "sort" except with an additional locale option? That seems silly to
> me.


No. GUI sorts if there is GUI and printing is part of GUI (if it
really deserves to be named GUI that is). If it goes elsewhere then it
is not a GUI and so why should i sort without user to see it? As for
GUI I am optimistic there. GUI sorts based on the things it uses. For
example:

bool QString:perator< ( const QString & other ) const {}

In theoretical failure on particular case/platform/locale i would get
defect report, can forward a bug to Nokia and meanwhile write some
custom operator to be used instead:

bool hack::broken_platform_name_here::less( const QString & one,
const QString & another);

In practice however it seems to work or is classified cosmetic or
minor problem. Such do not affect success.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: New utf8string design may make UTF-8 the superior encoding Joseph M. Newcomer C++ 31 05-21-2010 12:26 PM
Re: New utf8string design may make UTF-8 the superior encoding Joseph M. Newcomer C++ 0 05-17-2010 03:18 AM
May I have a example of design pattern of "composite", I still feel fuzzy after reading book of Addison-Wesley's"design pattern " jones9413@yahoo.com C++ 1 08-31-2007 04:09 AM
[ANN] rcov 0.8.0: new output modes, fix for RSpec woes, superior Emacs integration Mauricio Fernandez Ruby 7 03-01-2007 12:16 AM
Review: Sunbeam 20 in 1 Superior Panel Silverstrand Reviews & How-To's 1 06-22-2005 03:50 PM



Advertisments