Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Re: Isn't java.lang.Character.html#{ isLetterFromLang(int codePoint,String ISOLangDef) missing from the spec?

Reply
Thread Tools

Re: Isn't java.lang.Character.html#{ isLetterFromLang(int codePoint,String ISOLangDef) missing from the spec?

 
 
Joshua Cranmer
Guest
Posts: n/a
 
      12-05-2010
On 12/04/2010 07:16 PM, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> One possibly (and easily ) could based on the Unicode code
> points check the ranges for each language, but I think it would be
> very useful for people parsing text from different languages.


Language is not so simple. First of all, code points don't necessarily
map to a `character' in a language--you can represent `è' as both the
"Latin small e with accent grave" and as "Latin small e" followed by a
"modifying accent grave". Second of all, what would you say makes a
character in a language? For the most part, é does not exist in English,
but, e.g., résumé is the proper spelling. Then you get complicated cases
like Japanese, which can write in hiragana, katakana, kanji, or rōmaji.
Technically, rōmaji is merely Latin transliteration of Japanese, so it's
debatable how much it is or isn't Japanese.

Finally, you run into the ambiguities of Unicode codepoints. Are
fullwidth roman letters valid for en-US, even though English typography
doesn't distinguish between fullwidth and halfwidth? English also
borrows the characters of other languages for various purposes: remember
that the abbreviation for micrometer is `μm', so is `μ' in en-US or not?

In my opinion, this is not generally useful enough to be worth having in
the standard library. Actually, I don't think Java even has Unicode
normalization functions, which are much more useful than divining
languages from code points.

> Do you know of any java packages to address these NLP issues? or, if
> you don't, is there something like that for text processing in ANSI C
> or C++? ~ Thanks lbrtchx


What are you really trying to do? If you are trying to detect languages
based on codepoints, that is not going to work that well. You would be
far better trying to guess language based on letter frequency, or even
just parsing it different languages and seeing which language has the
least "misspelled" words.

--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth
 
Reply With Quote
 
 
 
 
Roedy Green
Guest
Posts: n/a
 
      12-05-2010
On Sat, 04 Dec 2010 21:00:37 -0500, Joshua Cranmer
<(E-Mail Removed)> wrote, quoted or indirectly quoted someone
who said :

>Language is not so simple. First of all, code points don't necessarily
>map to a `character' in a language--you can represen


Then there is Arabic where the Unicode is just a hint as what needs to
rendered. It was originally designed to be written cursively, so
there are special forms for starting and ending, and bits can shift
around. It is more like a 2D tessellation problem.

From the little I learned about it, I am impressed anyone ever figured
out how to use computers to typeset books. The results are
aesthetically quite pleasing, though I can only read a few words.

If anyone speaks Arabic, I would like to know how close what you see
on computer screens when programming comes to the classical form used
in books.
--
Roedy Green Canadian Mind Products
http://mindprod.com

In programming, and documenting programs, keep vocabulary consistent and precisely defined! Variation in vocabulary to relieve the tedium is for novels.
 
Reply With Quote
 
 
 
 
Owen Jacobson
Guest
Posts: n/a
 
      12-05-2010
On 2010-12-04 21:00:37 -0500, Joshua Cranmer said:

> [...] I don't think Java even has Unicode normalization functions,
> which are much more useful than divining languages from code points.


java.text.Normalizer - Hope that helps.

-o

 
Reply With Quote
 
Owen Jacobson
Guest
Posts: n/a
 
      12-05-2010
On 2010-12-04 21:00:37 -0500, Joshua Cranmer said:

> [...] I don't think Java even has Unicode normalization functions,
> which are much more useful than divining languages from code points.


java.text.Normalizer - Hope that helps.

-o

 
Reply With Quote
 
Owen Jacobson
Guest
Posts: n/a
 
      12-05-2010
On 2010-12-04 21:00:37 -0500, Joshua Cranmer said:

> [...] I don't think Java even has Unicode normalization functions,
> which are much more useful than divining languages from code points.


java.text.Normalizer - Hope that helps.

-o

 
Reply With Quote
 
Tom Anderson
Guest
Posts: n/a
 
      12-05-2010
On Sat, 4 Dec 2010, Roedy Green wrote:

> On Sat, 04 Dec 2010 21:00:37 -0500, Joshua Cranmer
> <(E-Mail Removed)> wrote, quoted or indirectly quoted someone
> who said :
>
>> Language is not so simple. First of all, code points don't necessarily
>> map to a `character' in a language--you can represen

>
> Then there is Arabic where the Unicode is just a hint as what needs to
> rendered. It was originally designed to be written cursively, so
> there are special forms for starting and ending, and bits can shift
> around. It is more like a 2D tessellation problem.


I know that the Unicode consortium did cock up Arabic quite badly by
starting off with a lot of precomposed characters, when they should have
gone down a more base-plus-combining route, but my impression was that it
was now possible to encode all Arabic text. The typesetting may not be
easy (indeed, i understand it's still realy rather hard), but it's very
much a matter of typesetting rather than encoding.

A good explanation of the situation is hard to come across, but most of it
is in here:

http://www.paktribune.com/pforums/posts.php?t=7389

> From the little I learned about it, I am impressed anyone ever figured
> out how to use computers to typeset books. The results are aesthetically
> quite pleasing, though I can only read a few words.
>
> If anyone speaks Arabic, I would like to know how close what you see on
> computer screens when programming comes to the classical form used in
> books.


Another - more leading! - question would be how computer-set text compares
to the typewritten text that people have been using for day-to-day work
for the immediately preceding decades. Another would be how it compares to
newspaper typesetting, which again accounts for a large amount of the text
people read, and i suspect is not set as carefully as book text. If modern
computers are better than typewriters, that's a huge amount of utility
right there; if they're better than manual newspaper typesetting, even
better.

tom

--
Now I am thoroughly confused. -- Colin Brace sums up RT3090 support
in Linux
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Crystal Reports - Visual Basic UFL that implements this function is missing (or U2lcom.dll is missing) Les Caudle ASP .Net 3 09-03-2007 02:27 AM
Re: missing feature classes and missing fields Gary Herron Python 2 07-04-2006 10:29 PM
missing wzcdlg.dll =?Utf-8?B?RGFuZGVl?= Wireless Networking 4 11-20-2004 02:14 PM
Missing Wireless Link applet =?Utf-8?B?Q2hyaXNzaWU=?= Wireless Networking 8 10-07-2004 07:24 PM
Missing Websites and no mail =?Utf-8?B?U2hpZnR3b3JrZXI0Mw==?= Wireless Networking 0 09-11-2004 12:31 PM



Advertisments