Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Why No Supplemental Characters In Character Literals?

Reply
Thread Tools

Why No Supplemental Characters In Character Literals?

 
 
Roedy Green
Guest
Posts: n/a
 
      02-04-2011
On Fri, 04 Feb 2011 18:59:30 +1300, Lawrence D'Oliveiro
<(E-Mail Removed)_zealand> wrote, quoted or indirectly quoted
someone who said :

>Why was it decreed in the language spec that characters beyond U+FFFF are
>not allowed in character literals, when they are allowed everywhere else (in
>string literals, in the program text, in character and string values etc)?


because they did not exist at the time Java was invented. extended
literals were tacked on to the 16-bit internal scheme in a somewhat
half-hearted way. to go to full 32-bit internally would gobble RAM
hugely.

Java does not have 32-bit String literals, like C style code points
e.g. \U0001d504. Note the capital U vs the usual \ud504 I wrote the
SurrogatePair applet (see
http://mindprod.com/applet/surrogatepair.html)
to convert C-style code points to a arcane surrogate pairs to let you
use 32-bit Unicode glyphs in your programs.


Personally, I don’t see the point of any great rush to support 32-bit
Unicode. The new symbols will be rarely used. Consider what’s there.
The only ones I would conceivably use are musical symbols and
Mathematical Alphanumeric symbols (especially the German black letters
so favoured in real analysis). The rest I can’t imagine ever using
unless I took up a career in anthropology, i.e. linear B syllabary (I
have not a clue what it is), linear B ideograms (Looks like symbols
for categorising cave petroglyphs), Aegean Numbers (counting with
stones and sticks), Old Italic (looks like Phoenecian), Gothic
(medieval script), Ugaritic (cuneiform), Deseret (Mormon), Shavian
(George Bernard Shaw’s phonetic script), Osmanya (Somalian), Cypriot
syllabary, Byzantine music symbols (looks like Arabic), Musical
Symbols, Tai Xuan Jing Symbols (truncated I-Ching), CJK
extensions(Chinese Japanese Korean) and tags (letters with blank
“price tags”).


--
Roedy Green Canadian Mind Products
http://mindprod.com
To err is human, but to really foul things up requires a computer.
~ Farmer's Almanac
It is breathtaking how a misplaced comma in a computer program can
shred megabytes of data in seconds.
 
Reply With Quote
 
 
 
 
Roedy Green
Guest
Posts: n/a
 
      02-04-2011
On Fri, 04 Feb 2011 08:04:23 -0500, Joshua Cranmer
<(E-Mail Removed)> wrote, quoted or indirectly quoted someone
who said :

>The JLS clearly states that a char is an unsigned 16-bit value.


Perhaps char will be redefined as 32 bits, or a new unsigned 32-bit
echar type will be invented.

It is an intractable problem. Consider the logic that uses indexOf
substring with character index arithmetic. Most if it would go insane
if you threw a few 32-bit chars in there. You need something that
simulates an array of 32-bit chars to the programmer.

--
Roedy Green Canadian Mind Products
http://mindprod.com
To err is human, but to really foul things up requires a computer.
~ Farmer's Almanac
It is breathtaking how a misplaced comma in a computer program can
shred megabytes of data in seconds.
 
Reply With Quote
 
 
 
 
Joshua Cranmer
Guest
Posts: n/a
 
      02-04-2011
On 02/04/2011 12:10 PM, Mike Schilling wrote:
> "Arne Vajhøj" <(E-Mail Removed)> wrote in message
>> But since codepoints above U+FFFF was added after the String
>> class was defined, then the options on how to handle it were
>> pretty limited.

>
> The sticky issue is, I think, that chars were defined as 16-bit. If that
> had been left undefined, they could have been extended to 24 bits, which
> would make things nice and regular again.


Well, the real problem is that Unicode swore that 16 bits were enough
for everybody, so people opted for the UTF-16 encoding in Unicode-aware
platforms (e.g., Windows uses 16-bit char values for wchar_t). When they
backtracked and increased the count to 20 bits, every system that did
UTF-16 was now screwed, because UTF-16 "kind of" becomes a
variable-width format like UTF-8... but not really. Instead you get a
mess with surrogate characters, this distinction between UTF-16 and
UCS-2, and, in short, anything not in the Basic Multilingual Plane is a
recipe for disaster.

Extending to 24 bits is problematic because 24 bits opens you up to
unaligned memory access on most, if not all, platforms, so you'd have to
go fully up to 32 bits (this is what the codePoint methods in String et
al. do). But considering the sheer amount of Strings in memory, going to
32-bit memory storage for Strings now doubles the size of that data...
and can increase memory consumption in some cases by 30-40%.

To make a long story short: Unicode made a very, very big mistake, and
everyone who designed their systems to be particularly i18n-aware before
that is now really smarting as a result.

It actually is possible to change the internal storage of String to a
UTF-8 representation (while keeping UTF-16/UTF-32 API access) and still
get good performance--people largely use direct indexes into strings in
largely consistent access patterns (e.g., str.substring(str.indexOf(":")
+ 1) ), so you can cache index lookup tables for a few values. It's ugly
as hell to code properly, taking into account proper multithreading,
etc., but it is not impossible.

--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth
 
Reply With Quote
 
markspace
Guest
Posts: n/a
 
      02-04-2011
On 2/4/2011 10:36 AM, Roedy Green wrote:
> On Fri, 04 Feb 2011 08:04:23 -0500, Joshua Cranmer
> <(E-Mail Removed)> wrote, quoted or indirectly quoted someone
> who said :
>
>> The JLS clearly states that a char is an unsigned 16-bit value.

>
> Perhaps char will be redefined as 32 bits, or a new unsigned 32-bit
> echar type will be invented.



An int is currently used for this purpose. For example,
Character.codePointAt(CharSequence,int) returns an int.


<http://download.oracle.com/javase/6/docs/api/java/lang/Character.html>


Also, from that same page, this explains the whole story in one go:


"Unicode Character Representations

"The char data type (and therefore the value that a Character object
encapsulates) are based on the original Unicode specification, which
defined characters as fixed-width 16-bit entities. The Unicode standard
has since been changed to allow for characters whose representation
requires more than 16 bits. The range of legal code points is now U+0000
to U+10FFFF, known as Unicode scalar value. (Refer to the definition of
the U+n notation in the Unicode standard.)

"The set of characters from U+0000 to U+FFFF is sometimes referred to as
the Basic Multilingual Plane (BMP). Characters whose code points are
greater than U+FFFF are called supplementary characters. The Java 2
platform uses the UTF-16 representation in char arrays and in the String
and StringBuffer classes. In this representation, supplementary
characters are represented as a pair of char values, the first from the
high-surrogates range, (\uD800-\uDBFF), the second from the
low-surrogates range (\uDC00-\uDFFF).

"A char value, therefore, represents Basic Multilingual Plane (BMP) code
points, including the surrogate code points, or code units of the UTF-16
encoding. An int value represents all Unicode code points, including
supplementary code points. The lower (least significant) 21 bits of int
are used to represent Unicode code points and the upper (most
significant) 11 bits must be zero.


....etc....

 
Reply With Quote
 
markspace
Guest
Posts: n/a
 
      02-04-2011
On 2/4/2011 9:37 AM, Daniele Futtorovic wrote:

> Pity they haven't touched upon java.lang.CharSequence. Probably out of
> concerns about compatibility.



You know that Character has static methods for pulling code points out
of a CharSequence, right?



 
Reply With Quote
 
Lew
Guest
Posts: n/a
 
      02-04-2011
Lawrence D'Oliveiro wrote:
>>> Why was it decreed in the language spec that characters beyond U+FFFF are
>>> not allowed in character literals, when they are allowed everywhere else
>>> (in string literals, in the program text, in character and string values
>>> etc)?

>


Lew wrote:
>> Because a 'char' type holds only 16 bits.

>


Lawrence D'Oliveiro wrote:
> No it doesn’t. Otherwise you wouldn’t be allowed supplementary characters in
> character and string values. Which you are.
>


/* DemoChar */
package eg;
public class DemoChar
{
public static void main( String [] args )
{
System.out.println( "Character.MAX_VALUE + 1 = "
+ (Character.MAX_VALUE + 1) );

char foo1, foo2;
foo1 = (char) (Character.MAX_VALUE - 1);
foo2 = (char) (foo1 / 2);
System.out.println( "foo1 = "+ (int) foo1 +", foo2 = "+ (int)
foo2 );

foo1 = '§';
foo2 = '@';
char sum = (char) (foo1 + foo2);
System.out.println( "foo1 + foo2 = "+ sum );
}
}

--
Lew
 
Reply With Quote
 
Daniele Futtorovic
Guest
Posts: n/a
 
      02-04-2011
On 04/02/2011 20:27, markspace allegedly wrote:
> On 2/4/2011 9:37 AM, Daniele Futtorovic wrote:
>
>> Pity they haven't touched upon java.lang.CharSequence. Probably out of
>> concerns about compatibility.

>
>
> You know that Character has static methods for pulling code points out
> of a CharSequence, right?


Yeah. But that's not quite the same thing, is it? What with OOP and all.

 
Reply With Quote
 
Daniele Futtorovic
Guest
Posts: n/a
 
      02-04-2011
On 04/02/2011 19:26, Roedy Green allegedly wrote:
> On Fri, 04 Feb 2011 18:59:30 +1300, Lawrence D'Oliveiro
> <(E-Mail Removed)_zealand> wrote, quoted or indirectly quoted
> someone who said :
>
>> Why was it decreed in the language spec that characters beyond U+FFFF are
>> not allowed in character literals, when they are allowed everywhere else (in
>> string literals, in the program text, in character and string values etc)?

>
> because they did not exist at the time Java was invented. extended
> literals were tacked on to the 16-bit internal scheme in a somewhat
> half-hearted way. to go to full 32-bit internally would gobble RAM
> hugely.
>
> Java does not have 32-bit String literals, like C style code points
> e.g. \U0001d504. Note the capital U vs the usual \ud504 I wrote the
> SurrogatePair applet (see
> http://mindprod.com/applet/surrogatepair.html)
> to convert C-style code points to a arcane surrogate pairs to let you
> use 32-bit Unicode glyphs in your programs.
>
>
> Personally, I don’t see the point of any great rush to support 32-bit
> Unicode. The new symbols will be rarely used. Consider what’s there.
> The only ones I would conceivably use are musical symbols and
> Mathematical Alphanumeric symbols (especially the German black letters
> so favoured in real analysis). The rest I can’t imagine ever using
> unless I took up a career in anthropology, i.e. linear B syllabary (I
> have not a clue what it is), linear B ideograms (Looks like symbols
> for categorising cave petroglyphs), Aegean Numbers (counting with
> stones and sticks), Old Italic (looks like Phoenecian), Gothic
> (medieval script), Ugaritic (cuneiform), Deseret (Mormon), Shavian
> (George Bernard Shaw’s phonetic script), Osmanya (Somalian), Cypriot
> syllabary, Byzantine music symbols (looks like Arabic), Musical
> Symbols, Tai Xuan Jing Symbols (truncated I-Ching), CJK
> extensions(Chinese Japanese Korean) and tags (letters with blank
> “price tags”).


And Klingon!

--
DF.

 
Reply With Quote
 
Tom Anderson
Guest
Posts: n/a
 
      02-04-2011
On Fri, 4 Feb 2011, Joshua Cranmer wrote:

>> "Arne Vajhřj" <(E-Mail Removed)> wrote in message
>>
>>> But since codepoints above U+FFFF was added after the String class was
>>> defined, then the options on how to handle it were pretty limited.

>
> Extending to 24 bits is problematic because 24 bits opens you up to
> unaligned memory access on most, if not all, platforms, so you'd have to
> go fully up to 32 bits (this is what the codePoint methods in String et
> al. do). But considering the sheer amount of Strings in memory, going to
> 32-bit memory storage for Strings now doubles the size of that data...
> and can increase memory consumption in some cases by 30-40%.


This is something i ponder quite a lot.

It's essential that computers be able to represent characters from any
living human script. The astral planes include some such characters,
notably in the CJK extensions, without which it is impossible to write
some people's names correctly. The necessity of supporting more than 2**16
codepoints is simply beyond question.

The problem is how to do it efficiently.

Going to strings of 24- or 32-bit characters would indeed be prohibitive
in its effect in memory. But isn't 16-bit already an eye-watering waste?
Most characters currently sitting in RAM around the world are, i would
wager, in the ASCII range: the great majority of characters in almost any
text in a latin script will be ASCII, in that they won't have diacritics
[1] (and most text is still in latin script), and almost all characters in
non-natural-language text (HTML and XML markup, configuration files,
filesystem paths) will be ASCII. A sizeable fraction of non-latin text is
still encodable in one byte per character, using a national character set.
Forcing all users of programs written in Java (or any other platform which
uses UCS-2 encoding) to spend two bytes on each of those characters to
ease the lives of the minority of users who store a lot of CJK text seems
wildly regressive.

I am, however, at a loss to suggest a practical alternative!

A question to the house, then: has anyone ever invented a data structure
for strings which allows space-efficient storage for strings in different
scripts, but also allows time-efficient implementation of the common
string operations?

Upthread, Joshua mentions the idea of using UTF-8 strings, and cacheing
codepoint-to-bytepoint mappings. That's certainly an approach that would
work, although i worry about the performance effect of generating so many
writes, the difficulty of making it correct in multithreaded systems, and
the dependency on a good cache hit rate to make it pay off.

Anyone else?

For extra credit, give a representation which also makes it simple and
efficient to do normalisation, reversal, and "find the first occurrence of
this character, ignoring diacritics".

tom

[1] I would be interested to hear of a language (more properly, an
orthography) using latin script in which a majority of characters, or even
an unusually large fraction, do have diacritics. The pinyin romanisation
of Mandarin uses a lot of accents. Hawaiian uses quite a lot. Some ways of
writing ancient Greek use a lot of diacritics, for breathings and accents
and in verse, for long and short syllables.

--
Understand the world we're living in
 
Reply With Quote
 
Lawrence D'Oliveiro
Guest
Posts: n/a
 
      02-04-2011
In message <iigcva$90q$(E-Mail Removed)-september.org>, Mike Schilling wrote:

> Yes, it does (contain 16 bits.)


Yeah, I didn’t realize it was spelled out that way in the original language
spec. What a short-sighted decision.

> It was defined to do so before there were supplemental characters ...


Why was there a need to define the size of a character at all? Even in the
early days of the unification of Unicode and ISO-10646, there was already
provision for UCS-4. Did they really think that could safely be ignored?

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
supplemental data factory sndive@gmail.com Python 0 11-06-2007 03:39 AM
findcontrol("PlaceHolderPrice") why why why why why why why why why why why Mr. SweatyFinger ASP .Net 2 12-02-2006 03:46 PM
Problems running Supplemental Course material CD on W2K SP4? Richard Hayward MCSA 2 03-23-2005 02:24 PM
training kit supplemental CD-ROM installation problem student MCAD 2 08-23-2004 07:50 PM
A Few Good Supplemental Sources for A+ Info P. Qwan A+ Certification 1 07-12-2004 07:07 PM



Advertisments