Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Why No Supplemental Characters In Character Literals?

Reply
Thread Tools

Why No Supplemental Characters In Character Literals?

 
 
Joshua Cranmer
Guest
Posts: n/a
 
      02-04-2011
On 02/04/2011 04:30 PM, Tom Anderson wrote:
> A question to the house, then: has anyone ever invented a data structure
> for strings which allows space-efficient storage for strings in
> different scripts, but also allows time-efficient implementation of the
> common string operations?


I think the real answer is that maybe we need to rethink traditional
string APIs. Particularly, we have the issues of diacratics, since "A
[combining diacritic `]" is basically 1 character stored in 3,4, or 8
bytes, depending on storage format.

I would be surprised if there weren't already some studies on the impact
of using UTF-8 based strings in UTF-16/-32-ish contexts.

--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth
 
Reply With Quote
 
 
 
 
Lawrence D'Oliveiro
Guest
Posts: n/a
 
      02-04-2011
In message <iihikc$dpj$(E-Mail Removed)-september.org>, markspace wrote:

> <http://download.oracle.com/javase/6/docs/api/java/lang/Character.html>
>
> "The char data type (and therefore the value that a Character object
> encapsulates) are based on the original Unicode specification, which
> defined characters as fixed-width 16-bit entities.


When did the unification with ISO-10646 happen? That was already talking
about 32-bit characters.

> "A char value, therefore, represents Basic Multilingual Plane (BMP) code
> points, including the surrogate code points, or code units of the UTF-16
> encoding. An int value represents all Unicode code points, including
> supplementary code points.


Why was there even a need to spell out the size of a char? If you wanted
types with explicit sizes, there was already byte, short, int and long.
 
Reply With Quote
 
 
 
 
Lawrence D'Oliveiro
Guest
Posts: n/a
 
      02-04-2011
In message <(E-Mail Removed)>, Roedy Green wrote:

> Personally, I don’t see the point of any great rush to support 32-bit
> Unicode. ... The rest I can’t imagine ever using unless I took up a career
> in anthropology ...


But you, or another programmer, might work for an anthropologist. The
computer is a universal machine, after all. If a programming language can’t
support that universality, what good is it?

 
Reply With Quote
 
Arne Vajhj
Guest
Posts: n/a
 
      02-04-2011
On 04-02-2011 17:43, Lawrence D'Oliveiro wrote:
> In message<iihikc$dpj$(E-Mail Removed)-september.org>, markspace wrote:
>> <http://download.oracle.com/javase/6/docs/api/java/lang/Character.html>
>>
>> "The char data type (and therefore the value that a Character object
>> encapsulates) are based on the original Unicode specification, which
>> defined characters as fixed-width 16-bit entities.

>
> When did the unification with ISO-10646 happen? That was already talking
> about 32-bit characters.
>
>> "A char value, therefore, represents Basic Multilingual Plane (BMP) code
>> points, including the surrogate code points, or code units of the UTF-16
>> encoding. An int value represents all Unicode code points, including
>> supplementary code points.

>
> Why was there even a need to spell out the size of a char? If you wanted
> types with explicit sizes, there was already byte, short, int and long.


It provides well defined semantics.

Nobody wanted to repeat C89 undefined/implementation specific
behavior.

Arne
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      02-04-2011
On Sat, 05 Feb 2011 11:43:18 +1300, Lawrence D'Oliveiro
<(E-Mail Removed)_zealand> wrote, quoted or indirectly quoted
someone who said :

>
>Why was there even a need to spell out the size of a char? If you wanted
>types with explicit sizes, there was already byte, short, int and long.


I think because Java's designers thought on the byte code level.
There, chars are unsigned 16-bit. That they are used for chars was not
really of interest to them. Much of Java is just a thin wrapper
around byte code. It has no high level features of its own.

--
Roedy Green Canadian Mind Products
http://mindprod.com
To err is human, but to really foul things up requires a computer.
~ Farmer's Almanac
It is breathtaking how a misplaced comma in a computer program can
shred megabytes of data in seconds.
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      02-04-2011
On Sat, 05 Feb 2011 11:26:53 +1300, Lawrence D'Oliveiro
<(E-Mail Removed)_zealand> wrote, quoted or indirectly quoted
someone who said :

>Why was there a need to define the size of a character at all?


Because C did worked that way and lead to non-wora code.
--
Roedy Green Canadian Mind Products
http://mindprod.com
To err is human, but to really foul things up requires a computer.
~ Farmer's Almanac
It is breathtaking how a misplaced comma in a computer program can
shred megabytes of data in seconds.
 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      02-04-2011
On 04-02-2011 17:26, Lawrence D'Oliveiro wrote:
> In message<iigcva$90q$(E-Mail Removed)-september.org>, Mike Schilling wrote:
>> Yes, it does (contain 16 bits.)

>
> Yeah, I didn’t realize it was spelled out that way in the original language
> spec.


It is. And give that you in another thread talk about problems in JLS,
then I think you should have read it.

It should also be in most Java beginners books.

It is also in the Java tutorial:

http://download.oracle.com/javase/tu...datatypes.html

> What a short-sighted decision.


Back then Unicode was 16 bit.

The increase in bit was done in 1996 after the release of Java 1.0.

>> It was defined to do so before there were supplemental characters ...

>
> Why was there a need to define the size of a character at all?


Well defines data types is a very good thing.

> Even in the
> early days of the unification of Unicode and ISO-10646, there was already
> provision for UCS-4.


Java decided to do Unicode. And at that time 16 bit was sufficient
for that.

> Did they really think that could safely be ignored?


Apparently yes.

Given that the 16 bit had just replaced 8 bit, then I think it
is understandable.

Arne



 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      02-04-2011
On Fri, 04 Feb 2011 13:44:08 -0500, Joshua Cranmer
<(E-Mail Removed)> wrote, quoted or indirectly quoted someone
who said :

>Well, the real problem is that Unicode swore that 16 bits were enough
>for everybody,


be fair. I bought a book showing thousands of Unicode glyphs including
pages and pages of HAN ideographs. There were plenty of holes for
future growth. At the time I thought it was overkill. When I started
my career, character sets had 64 glyphs, including control chars.
Later it was considered "extravagant" to use lower case since it took
so much longer to print. In the very early days, each installation
designed its own local character set. I recall sitting in one such
meeting, and Vern Detwiler (later of MacDonald Detwiler) explaining
the virtues of new code called ASCII.
--
Roedy Green Canadian Mind Products
http://mindprod.com
To err is human, but to really foul things up requires a computer.
~ Farmer's Almanac
It is breathtaking how a misplaced comma in a computer program can
shred megabytes of data in seconds.
 
Reply With Quote
 
Arne Vajhj
Guest
Posts: n/a
 
      02-04-2011
On 04-02-2011 13:26, Roedy Green wrote:
> On Fri, 04 Feb 2011 18:59:30 +1300, Lawrence D'Oliveiro
> <(E-Mail Removed)_zealand> wrote, quoted or indirectly quoted
> someone who said :
>> Why was it decreed in the language spec that characters beyond U+FFFF are
>> not allowed in character literals, when they are allowed everywhere else (in
>> string literals, in the program text, in character and string values etc)?

>
> because they did not exist at the time Java was invented. extended
> literals were tacked on to the 16-bit internal scheme in a somewhat
> half-hearted way. to go to full 32-bit internally would gobble RAM
> hugely.
>
> Java does not have 32-bit String literals, like C style code points
> e.g. \U0001d504. Note the capital U vs the usual \ud504 I wrote the
> SurrogatePair applet (see
> http://mindprod.com/applet/surrogatepair.html)
> to convert C-style code points to a arcane surrogate pairs to let you
> use 32-bit Unicode glyphs in your programs.
>
> Personally, I dont see the point of any great rush to support 32-bit
> Unicode. The new symbols will be rarely used. Consider whats there.
> The only ones I would conceivably use are musical symbols and
> Mathematical Alphanumeric symbols (especially the German black letters
> so favoured in real analysis). The rest I cant imagine ever using
> unless I took up a career in anthropology, i.e. linear B syllabary (I
> have not a clue what it is), linear B ideograms (Looks like symbols
> for categorising cave petroglyphs), Aegean Numbers (counting with
> stones and sticks), Old Italic (looks like Phoenecian), Gothic
> (medieval script), Ugaritic (cuneiform), Deseret (Mormon), Shavian
> (George Bernard Shaws phonetic script), Osmanya (Somalian), Cypriot
> syllabary, Byzantine music symbols (looks like Arabic), Musical
> Symbols, Tai Xuan Jing Symbols (truncated I-Ching), CJK
> extensions(Chinese Japanese Korean) and tags (letters with blank
> price tags).


Most western people does never use them.

But that does not mean much as we got our stuff in the low codepoints.

The relevant question is whether Chinese/Japanese/Korean use the
>=64K code points.


Arne
 
Reply With Quote
 
Arne Vajhj
Guest
Posts: n/a
 
      02-04-2011
On 04-02-2011 18:08, Roedy Green wrote:
> On Fri, 04 Feb 2011 13:44:08 -0500, Joshua Cranmer
> <(E-Mail Removed)> wrote, quoted or indirectly quoted someone
> who said :
>> Well, the real problem is that Unicode swore that 16 bits were enough
>> for everybody,

>
> be fair. I bought a book showing thousands of Unicode glyphs including
> pages and pages of HAN ideographs. There were plenty of holes for
> future growth. At the time I thought it was overkill. When I started
> my career, character sets had 64 glyphs, including control chars.
> Later it was considered "extravagant" to use lower case since it took
> so much longer to print. In the very early days, each installation
> designed its own local character set. I recall sitting in one such
> meeting, and Vern Detwiler (later of MacDonald Detwiler) explaining
> the virtues of new code called ASCII.


Impressive that he wanted to discuss that with a 12 year old.

Arne
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
supplemental data factory sndive@gmail.com Python 0 11-06-2007 03:39 AM
findcontrol("PlaceHolderPrice") why why why why why why why why why why why Mr. SweatyFinger ASP .Net 2 12-02-2006 03:46 PM
Problems running Supplemental Course material CD on W2K SP4? Richard Hayward MCSA 2 03-23-2005 02:24 PM
training kit supplemental CD-ROM installation problem student MCAD 2 08-23-2004 07:50 PM
A Few Good Supplemental Sources for A+ Info P. Qwan A+ Certification 1 07-12-2004 07:07 PM



Advertisments