Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: languages with full unicode support

Reply
Thread Tools

Re: languages with full unicode support

 
 
Tim Roberts
Guest
Posts: n/a
 
      06-28-2006
"Xah Lee" <(E-Mail Removed)> wrote:

>Languages with Full Unicode Support
>
>As far as i know, Java and JavaScript are languages with full, complete
>unicode support. That is, they allow names to be defined using unicode.
>(the JavaScript engine used by FireFox support this)
>
>As far as i know, here's few other lang's status:
>
>C ? No.


This is implementation-defined in C. A compiler is allowed to accept
variable names with alphabetic Unicode characters outside of ASCII.
--
- Tim Roberts, http://www.velocityreviews.com/forums/(E-Mail Removed)
Providenza & Boekelheide, Inc.
 
Reply With Quote
 
 
 
 
Joachim Durchholz
Guest
Posts: n/a
 
      06-28-2006
Tim Roberts schrieb:
> "Xah Lee" <(E-Mail Removed)> wrote:
>> C ? No.

>
> This is implementation-defined in C. A compiler is allowed to accept
> variable names with alphabetic Unicode characters outside of ASCII.


Hmm... that could would be nonportable, so C support for Unicode is
half-baked at best.

Regards,
Jo
 
Reply With Quote
 
 
 
 
David Hopwood
Guest
Posts: n/a
 
      06-28-2006
Tim Roberts wrote:
> "Xah Lee" <(E-Mail Removed)> wrote:
>
>>Languages with Full Unicode Support
>>
>>As far as i know, Java and JavaScript are languages with full, complete
>>unicode support. That is, they allow names to be defined using unicode.
>>(the JavaScript engine used by FireFox support this)
>>
>>As far as i know, here's few other lang's status:
>>
>>C ? No.

>
> This is implementation-defined in C. A compiler is allowed to accept
> variable names with alphabetic Unicode characters outside of ASCII.


It is not implementation-defined in C99 whether Unicode characters are
accepted; only how they are encoded directly in the source multibyte character
set.

Characters escaped using \uHHHH or \U00HHHHHH (H is a hex digit), and that
are in the sets of characters defined by Unicode for identifiers, are required
to be supported, and should be mangled in some consistent way by a platform's
linker. There are Unicode text editors which encode/decode \u and \U on the fly,
so you can treat this essentially like a Unicode transformation format (it
would have been nicer to require support for UTF-8, but never mind).


C99 6.4.2.1:

# 3 Each universal character name in an identifier shall designate a character
# whose encoding in ISO/IEC 10646 falls into one of the ranges specified in
# annex D. 59) The initial character shall not be a universal character name
# designating a digit. An implementation may allow multibyte characters that
# are not part of the basic source character set to appear in identifiers;
# which characters and their correspondence to universal character names is
# implementation-defined.
#
# 59) On systems in which linkers cannot accept extended characters, an encoding
# of the universal character name may be used in forming valid external
# identifiers. For example, some otherwise unused character or sequence of
# characters may be used to encode the \u in a universal character name.
# Extended characters may produce a long external identifier.

--
David Hopwood <(E-Mail Removed)>
 
Reply With Quote
 
Chris Uppal
Guest
Posts: n/a
 
      06-28-2006
Joachim Durchholz wrote:

> > This is implementation-defined in C. A compiler is allowed to accept
> > variable names with alphabetic Unicode characters outside of ASCII.

>
> Hmm... that could would be nonportable, so C support for Unicode is
> half-baked at best.


Since the interpretation of characters which are yet to be added to
Unicode is undefined (will they be digits, "letters", operators, symbol,
punctuation.... ?), there doesn't seem to be any sane way that a language could
allow an unrestricted choice of Unicode in identifiers. Hence, it must define
a specific allowed sub-set. C certainly defines an allowed subset of Unicode
characters -- so I don't think you could call its Unicode support "half-baked"
(not in that respect, anyway). A case -- not entirely convincing, IMO -- could
be made that it would be better to allow a wider range of characters.

And no, I don't think Java's approach -- where there /is no defined set of
allowed identifier characters/ -- makes any sense at all

-- chris




 
Reply With Quote
 
David Hopwood
Guest
Posts: n/a
 
      06-28-2006
Note Followup-To: comp.lang.java.programmer

Chris Uppal wrote:
> Since the interpretation of characters which are yet to be added to
> Unicode is undefined (will they be digits, "letters", operators, symbol,
> punctuation.... ?), there doesn't seem to be any sane way that a language could
> allow an unrestricted choice of Unicode in identifiers. Hence, it must define
> a specific allowed sub-set. C certainly defines an allowed subset of Unicode
> characters -- so I don't think you could call its Unicode support "half-baked"
> (not in that respect, anyway). A case -- not entirely convincing, IMO -- could
> be made that it would be better to allow a wider range of characters.
>
> And no, I don't think Java's approach -- where there /is no defined set of
> allowed identifier characters/ -- makes any sense at all


Java does have a defined set of allowed identifier characters. However, you
certainly have to go around the houses a bit to work out what that set is:


<http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.8>

# An identifier is an unlimited-length sequence of Java letters and Java digits,
# the first of which must be a Java letter. An identifier cannot have the same
# spelling (Unicode character sequence) as a keyword (ง3.9), boolean literal
# (ง3.10.3), or the null literal (ง3.10.7).
[...]
# A "Java letter" is a character for which the method
# Character.isJavaIdentifierStart(int) returns true. A "Java letter-or-digit"
# is a character for which the method Character.isJavaIdentifierPart(int)
# returns true.
[...]
# Two identifiers are the same only if they are identical, that is, have the
# same Unicode character for each letter or digit.

For Java 1.5.0:

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html>

# Character information is based on the Unicode Standard, version 4.0.

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isJavaIdentifierStart(int)>

# A character may start a Java identifier if and only if one of the following
# conditions is true:
#
# * isLetter(codePoint) returns true
# * getType(codePoint) returns LETTER_NUMBER
# * the referenced character is a currency symbol (such as "$")

[This means that getType(codePoint) returns CURRENCY_SYMBOL, i.e. Unicode
General Category Sc.]

# * the referenced character is a connecting punctuation character (such as "_").

[This means that getType(codePoint) returns CONNECTOR_PUNCTUATION, i.e. Unicode
General Category Pc.]

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isJavaIdentifierPart(int)>

# A character may be part of a Java identifier if any of the following are true:
#
# * it is a letter
# * it is a currency symbol (such as '$')
# * it is a connecting punctuation character (such as '_')
# * it is a digit
# * it is a numeric letter (such as a Roman numeral character)

[General Category Nl.]

# * it is a combining mark

[General Category Mc (see <http://www.unicode.org/versions/Unicode4.0.0/ch04.pdf>).]

# * it is a non-spacing mark

[General Category Mn (ditto).]

# * isIdentifierIgnorable(codePoint) returns true for the character

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isDigit(int)>

# A character is a digit if its general category type, provided by
# getType(codePoint), is DECIMAL_DIGIT_NUMBER.

[General Category Nd.]

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isIdentifierIgnorable(int)>

# The following Unicode characters are ignorable in a Java identifier or a Unicode
# identifier:
#
# * ISO control characters that are not whitespace
# o '\u0000' through '\u0008'
# o '\u000E' through '\u001B'
# o '\u007F' through '\u009F'
# * all characters that have the FORMAT general category value

[FORMAT is General Category Cf.]

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isLetter(int)>

# A character is considered to be a letter if its general category type, provided
# by getType(codePoint), is any of the following:
#
# * UPPERCASE_LETTER
# * LOWERCASE_LETTER
# * TITLECASE_LETTER
# * MODIFIER_LETTER
# * OTHER_LETTER

====

To cut a long story short, the syntax of identifiers in Java 1.5 is therefore:

Keyword ::= one of
abstract continue for new switch
assert default if package synchronized
boolean do goto private this
break double implements protected throw
byte else import public throws
case enum instanceof return transient
catch extends int short try
char final interface static void
class finally long strictfp volatile
const float native super while

Identifier ::= IdentifierChars butnot (Keyword | "true" | "false" | "null")
IdentifierChars ::= JavaLetter | IdentifierChars JavaLetterOrDigit
JavaLetter ::= Lu | Ll | Lt | Lm | Lo | Nl | Sc | Pc
JavaLetterOrDigit ::= JavaLetter | Nd | Mn | Mc |
U+0000..0008 | U+000E..001B | U+007F..009F | Cf

where the two-letter terminals refer to General Categories in Unicode 4.0.0
(exactly).

Note that the so-called "ignorable" characters (for which
isIdentifierIgnorable(codePoint) returns true) are not ignorable; they are
treated like any other identifier character. This quote from the API spec:

# The following Unicode characters are ignorable in a Java identifier [...]

should be ignored (no pun intended). It is contradicted by:

# Two identifiers are the same only if they are identical, that is, have the
# same Unicode character for each letter or digit.

in the language spec. Unicode does have a concept of ignorable characters in
identifiers, which is probably where this documentation bug crept in.

The inclusion of U+0000 and various control characters in the set of valid
identifier characters is also a dubious decision, IMHO.

Note that I am not defending in any way the complexity of this definition; there's
clearly no excuse for it (or for the "ignorable" documentation bug). The language
spec should have been defined directly in terms of the Unicode General Categories,
and then the API in terms of the language spec. They way it is done now is
completely backwards.

--
David Hopwood <(E-Mail Removed)>
 
Reply With Quote
 
Joachim Durchholz
Guest
Posts: n/a
 
      07-01-2006
Chris Uppal schrieb:
> Joachim Durchholz wrote:
>
>>> This is implementation-defined in C. A compiler is allowed to accept
>>> variable names with alphabetic Unicode characters outside of ASCII.

>> Hmm... that could would be nonportable, so C support for Unicode is
>> half-baked at best.

>
> Since the interpretation of characters which are yet to be added to
> Unicode is undefined (will they be digits, "letters", operators, symbol,
> punctuation.... ?), there doesn't seem to be any sane way that a language could
> allow an unrestricted choice of Unicode in identifiers.


I don't think this is a problem in practice. E.g. if a language uses the
usual definition for identifiers (first letter, then letters/digits),
you end up with a language that changes its definition on the whims of
the Unicode consortium, but that's less of a problem than one might
think at first.

I'd expect two kinds of changes in character categorization: additions
and corrections. (Any other?)

Additions are relatively unproblematic. Existing code will remain valid
and retain its semantics. The new characters will be available for new
programs.
There's a slight technological complication: the compiler needs to be
able to look up the newest definition. In other words, for a compiler to
run, it needs to be able to access http://unicode.org, or the language
infrastructure needs a way to carry around various revisions of the
Unicode tables and select the newest one.

Corrections are technically more problematic, but then we can rely on
the common sense of the programmers. If the Unicode consortium
miscategorized a character as a letter, the programmers that use that
character set will probably know it well enough to avoid its use. It
will probably not even occur to them that that character could be a
letter


Actually I'm not sure that Unicode is important for long-lived code.
Code tends to not survive very long unless it's written in English, in
which case anything outside of strings is in 7-bit ASCII. So the
majority of code won't ever be affected by Unicode problems - Unicode is
more a way of lowering entry barriers.

Regards,
Jo
 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      07-01-2006
Chris Uppal schreef:

> Since the interpretation of characters which are yet to be added to
> Unicode is undefined (will they be digits, "letters", operators,
> symbol, punctuation.... ?), there doesn't seem to be any sane way
> that a language could allow an unrestricted choice of Unicode in
> identifiers.


The Perl-code below prints:

xdigit
22 /194522 = 0.011% (lower: 6, upper: 6)
ascii
128 /194522 = 0.066% (lower: 26, upper: 26)
\d
268 /194522 = 0.138%
digit
268 /194522 = 0.138%
IsNumber
612 /194522 = 0.315%
alpha
91183 /194522 = 46.875% (lower: 1380, upper: 1160)
alnum
91451 /194522 = 47.013% (lower: 1380, upper: 1160)
word
91801 /194522 = 47.193% (lower: 1380, upper: 1160)
graph
102330 /194522 = 52.606% (lower: 1380, upper: 1160)
print
102349 /194522 = 52.616% (lower: 1380, upper: 1160)
blank
18 /194522 = 0.009%
space
24 /194522 = 0.012%
punct
374 /194522 = 0.192%
cntrl
6473 /194522 = 3.328%


Especially look at 'word', the same as \w, which for ASCII is
[0-9A-Za-z_].


==8<===================
#!/usr/bin/perl
# Program-Id: unicount.pl
# Subject: show Unicode statistics

use strict ;
use warnings ;

use Data::Alias ;

binmode STDOUT, ':utf8' ;

my @table =
# +--Name------+---qRegexp--------+-C-+-L-+-U-+
(
[ 'xdigit' , qr/[[digit:]]/ , 0 , 0 , 0 ] ,
[ 'ascii' , qr/[[:ascii:]]/ , 0 , 0 , 0 ] ,
[ '\\d' , qr/\d/ , 0 , 0 , 0 ] ,
[ 'digit' , qr/[[:digit:]]/ , 0 , 0 , 0 ] ,
[ 'IsNumber' , qr/\p{IsNumber}/ , 0 , 0 , 0 ] ,
[ 'alpha' , qr/[[:alpha:]]/ , 0 , 0 , 0 ] ,
[ 'alnum' , qr/[[:alnum:]]/ , 0 , 0 , 0 ] ,
[ 'word' , qr/[[:word:]]/ , 0 , 0 , 0 ] ,
[ 'graph' , qr/[[:graph:]]/ , 0 , 0 , 0 ] ,
[ 'print' , qr/[[rint:]]/ , 0 , 0 , 0 ] ,
[ 'blank' , qr/[[:blank:]]/ , 0 , 0 , 0 ] ,
[ 'space' , qr/[[:space:]]/ , 0 , 0 , 0 ] ,
[ 'punct' , qr/[[unct:]]/ , 0 , 0 , 0 ] ,
[ 'cntrl' , qr/[[:cntrl:]]/ , 0 , 0 , 0 ] ,
) ;

my @codepoints =
(
0x0000 .. 0xD7FF,
0xE000 .. 0xFDCF,
0xFDF0 .. 0xFFFD,
0x10000 .. 0x1FFFD,
0x20000 .. 0x2FFFD,
# 0x30000 .. 0x3FFFD, # etc.
) ;

for my $row ( @table )
{
alias my ($name, $qrx, $count, $lower, $upper) = @$row ;

printf "\n%s\n", $name ;

my $n = 0 ;

for ( @codepoints )
{
local $_ = chr ; # int-2-char conversion
$n++ ;

if ( /$qrx/ )
{
$count++ ;
$lower++ if / [[:lower:]] /x ;
$upper++ if / [[:upper:]] /x ;
}
}

my $show_lower_upper =
($lower || $upper)
? sprintf( " (lower:%6d, upper:%6d)"
, $lower
, $upper
)
: '' ;

printf "%6d /%6d =%7.3f%%%s\n"
, $count
, $n
, 100 * $count / $n
, $show_lower_upper
}
__END__

--
Affijn, Ruud

"Gewoon is een tijger."


 
Reply With Quote
 
David Hopwood
Guest
Posts: n/a
 
      07-01-2006
Joachim Durchholz wrote:
> Chris Uppal schrieb:
>> Joachim Durchholz wrote:
>>
>>>> This is implementation-defined in C. A compiler is allowed to accept
>>>> variable names with alphabetic Unicode characters outside of ASCII.
>>>
>>> Hmm... that could would be nonportable, so C support for Unicode is
>>> half-baked at best.

>>
>> Since the interpretation of characters which are yet to be added to
>> Unicode is undefined (will they be digits, "letters", operators, symbol,
>> punctuation.... ?), there doesn't seem to be any sane way that a
>> language could allow an unrestricted choice of Unicode in identifiers.

>
> I don't think this is a problem in practice. E.g. if a language uses the
> usual definition for identifiers (first letter, then letters/digits),
> you end up with a language that changes its definition on the whims of
> the Unicode consortium, but that's less of a problem than one might
> think at first.


It is not a problem at all. See the stability policies in
<http://www.unicode.org/reports/tr31/tr31-2.html>.

> Actually I'm not sure that Unicode is important for long-lived code.
> Code tends to not survive very long unless it's written in English, in
> which case anything outside of strings is in 7-bit ASCII. So the
> majority of code won't ever be affected by Unicode problems - Unicode is
> more a way of lowering entry barriers.


Unicode in identifiers has certainly been less important than some thought
it would be -- and not at all important for open source projects, for example,
which essentially have to use English to get the widest possible participation.

--
David Hopwood <(E-Mail Removed)>
 
Reply With Quote
 
Dale King
Guest
Posts: n/a
 
      07-05-2006
Tim Roberts wrote:
> "Xah Lee" <(E-Mail Removed)> wrote:
>
>> Languages with Full Unicode Support
>>
>> As far as i know, Java and JavaScript are languages with full, complete
>> unicode support. That is, they allow names to be defined using unicode.
>> (the JavaScript engine used by FireFox support this)
>>
>> As far as i know, here's few other lang's status:
>>
>> C ? No.

>
> This is implementation-defined in C. A compiler is allowed to accept
> variable names with alphabetic Unicode characters outside of ASCII.


I don't think it is implementation defined. I believe it is actually
required by the spec. The trouble is that so few compilers actually
comply with the spec. A few years ago I asked for someone to actually
point to a fully compliant compiler and no one could.

--
Dale King
 
Reply With Quote
 
Tim Roberts
Guest
Posts: n/a
 
      07-06-2006
Dale King <(E-Mail Removed)> wrote:
>Tim Roberts wrote:
>> "Xah Lee" <(E-Mail Removed)> wrote:
>>
>>> Languages with Full Unicode Support
>>>
>>> As far as i know, Java and JavaScript are languages with full, complete
>>> unicode support. That is, they allow names to be defined using unicode.
>>> (the JavaScript engine used by FireFox support this)

>>
>> This is implementation-defined in C. A compiler is allowed to accept
>> variable names with alphabetic Unicode characters outside of ASCII.

>
>I don't think it is implementation defined. I believe it is actually
>required by the spec.


C99 does have a list of Unicode codepoints that are required to be accepted
in identifiers, although implementations are free to accept other
characters as well. For example, few people realize that Visual C++
accepts the dollar sign $ in an identifier.
--
- Tim Roberts, (E-Mail Removed)
Providenza & Boekelheide, Inc.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
URGENT opening- Technical Support Manager/ Lead/ Windows sys Support/admin- San Mateo, CA - Full Time prachi Java 0 01-17-2009 12:46 AM
Re: languages with full unicode support Tim Roberts Java 12 07-06-2006 03:51 AM
Re: languages with full unicode support Oliver Bandel Java 8 07-04-2006 01:49 PM
Re: languages with full unicode support Oliver Bandel Python 8 07-04-2006 01:49 PM
Re: languages with full unicode support Mumia W. Java 0 06-25-2006 07:27 PM



Advertisments