Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Java Newbie Question: Character Sets, Unicode, et al

Reply
Thread Tools

Java Newbie Question: Character Sets, Unicode, et al

 
 
BLG
Guest
Posts: n/a
 
      10-17-2003
Greetings!

I am studying several books on the Java programming language and have
come across references to the fact that the JDK uses Unicode as the
default character set. I am not convinced that I fully comprehend
what the author is telling me.

I understand that Unicode is a 16-bit character set and that true
ASCII is a 7-bit representation. When I look at my source files in a
hex editor, they appear to be extended ASCII 8-bit format (which I
assume is the Windows default for a text file). OK - I assume then
that the JRE uses Unicode character sets, but javac uses some 8-bit
character set. Is this correct?

But beyond that, should I even care what the character set is?
Assuming, of course, internationalization is not a priority for me.

Also, how do I determine what character set Windows is using? How do
I change character sets in Windows?

And lastly, what is the relationship between a character set and a
font?

I hope these questions aren't too off the wall. I am trying to
clarify in my mind this character set concept. In the past, my only
concern has been ASCII vs EBCDIC.

Regards!

 
Reply With Quote
 
 
 
 
Chris Smith
Guest
Posts: n/a
 
      10-17-2003
BLG wrote:
> I understand that Unicode is a 16-bit character set and that true
> ASCII is a 7-bit representation.


Actually, that's a more accurate statement than most people would have
made. If you want to be really picky, though, I'd make one correction:
Unicode and ASCII are both character sets. ASCII is *also* a character
encoding (which may be what you meant by representation), which Unicode
is not (instead, there are several common encodings for the Unicode
character set, including UTF-8, UTF-16LE and UTF-16BE).

> When I look at my source files in a
> hex editor, they appear to be extended ASCII 8-bit format (which I
> assume is the Windows default for a text file).


No. There is no such thing as "extended ASCII 8-bit format" as some
specific entity. There are, actually, quite a large number of different
8-bit character encodings, including Cp1252, ISO8859-1, ISO8859-2,
ISO8859-3, and so on and so forth, and practically all of them are
different extensions to ASCII. ASCII itself, as a character encoding,
is in fact an 8-bit encoding as well, with the high-order bit always
being set to zero.

> OK - I assume then
> that the JRE uses Unicode character sets, but javac uses some 8-bit
> character set. Is this correct?


Not necessarily. The javac compiler uses the platform default character
encoding. What that is depends on what platform you're developing on.

More to the point, though, you're looking to the wrong place for that
answer. The javac utility is a *consumer* of your source files; it
doesn't create them. The format of your source files comes from
whatever tool you've used to write them, and I don't know what tool that
is. You're fortunate that it happens to be compatible with what javac
expects to read (which will generally happen if you stick to ASCII
characters in your source code, but can be a problem if not)

> But beyond that, should I even care what the character set is?


Sure. If you expect to write robust character-based I/O code, that is
(in any language, not just Java).

> Assuming, of course, internationalization is not a priority for me.


That's not necessarily relevant. Encodings vary between platforms,
languages and language configurations, and applications, among other
things. They are often specified in protocol and file format
descriptions. You don't have to be doing i18n code to care about the
definition of a character encoding.

> Also, how do I determine what character set Windows is using? How do
> I change character sets in Windows?


Actually, I have no idea. I've never needed to do it.

> And lastly, what is the relationship between a character set and a
> font?


A font provides glyphs (visual appearances) for some set of characters.
The relationship, I suppose, is that if you want to reliably display
content in a certain character set, your font had better have the
appropriate glyphs for at least the common characters in that character
set. In Java, fonts map their glyphs directly to Unicode characters, so
there's no direct relationship between the smaller character sets like
ASCII and fonts.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
Reply With Quote
 
 
 
 
Michael Borgwardt
Guest
Posts: n/a
 
      10-17-2003
BLG wrote:
> I understand that Unicode is a 16-bit character set and that true
> ASCII is a 7-bit representation.


Actually, Unicode is not really a character set in the way ASCII is,
and it is not restricted to 16 bits.

Unicode is a standard that assigns glyphs (characters) to numeric codes.
How these codes are concretely represented as bytes is what an encoding
or charset specifies, which is what ASCII is. There are encodings where
the number of bits used varies depending on each character, like UTF-8.
There are even stateful encodings.

> When I look at my source files in a
> hex editor, they appear to be extended ASCII 8-bit format (which I
> assume is the Windows default for a text file).


Namely Windows Codepage 1252, which is nearly the same as
ISO-8859-1, aka Latin 1, the most common encoding for western
European languages.

> OK - I assume then
> that the JRE uses Unicode character sets, but javac uses some 8-bit
> character set. Is this correct?


Nearly. How the JRE internally represents Strings is not really
specified, but the usualy way is to use 16bit per character
in a straightforward way. javac, on the other hand, uses the
platform standard encoding (unless otherwise specified on the
command line), with an additional capability to use unicode
escape sequences (\Uxxxx), when reading in source files. The
class files contain Strings encoded as UTF-8.

> But beyond that, should I even care what the character set is?
> Assuming, of course, internationalization is not a priority for me.


Yes, it still is important when writing text out to or reading from
from a file or network socket. It's quite likely that at some point
you'll use *some* non-ASCII character, and in fact it is not even
guaranteed that all encodings represent even pure ASCII text
identically.

> Also, how do I determine what character set Windows is using?


More recent Windows versions (since 2000 I think) also use Unicode
internally as far as possible, but older applications that can't
use a "traditional encoding" that differs between languages.
This is the platform default encoding.
In Java, it's a System property, file.encoding or some such.

> How do I change character sets in Windows?


There's an option in the country&language settings somwhere that
changes the default encoding used for older apps.

> And lastly, what is the relationship between a character set and a
> font?


An encoding defines relationships between numeric codes or byte
representations thereof and glyphs. A font defines how the glyphs
are drawn on the screen. Different abstract glyphs can be (and
sometimes are) assigned the same shape in a font, and nearly all
fonts contain only shapes for a subset of the glyphs defined in
Unicode.

 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      10-17-2003
On Sat, 18 Oct 2003 01:08:14 +0200, Michael Borgwardt
<(E-Mail Removed)> wrote or quoted :

>Unicode is a standard that assigns glyphs (characters) to numeric codes.
>How these codes are concretely represented as bytes is what an encoding
>or charset specifies, which is what ASCII is. There are encodings where
>the number of bits used varies depending on each character, like UTF-8.
>There are even stateful encodings.


There is only one way you can encode ASCII as bytes, but there are
several variants for encoding Unicode with combinations of big/little
endian, marked/unmarked, 8-bit/16-bit encoding.

see http://mindprod.com/jgloss/encoding.html


--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
 
Reply With Quote
 
brougham5@yahoo.com
Guest
Posts: n/a
 
      10-18-2003
http://www.velocityreviews.com/forums/(E-Mail Removed)se (BLG) wrote:

>I understand that Unicode is a 16-bit character set and that true
>ASCII is a 7-bit representation.


That is incorrect.

This is a recent article on unicode that serves as a good introduction:

http://www.joelonsoftware.com/articles/Unicode.html
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      10-18-2003

On Sat, 18 Oct 2003 12:23:56 -0500, (E-Mail Removed) wrote or
quoted :

>>I understand that Unicode is a 16-bit character set and that true
>>ASCII is a 7-bit representation.

>
>That is incorrect.


Unicode is a 16 bit character set allowing 64K different glyphs/codes.


ASCII is a 7-bit character set allowing 128 different glyphs/codes.

ASCII is written in octets, usually with the high bit off.

Unicode is written many different ways. See
http://mindprod.com/jgloss/encoding.html


--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
 
Reply With Quote
 
brougham5@yahoo.com
Guest
Posts: n/a
 
      10-18-2003
Roedy Green <(E-Mail Removed)> wrote:

>Unicode is a 16 bit character set allowing 64K different glyphs/codes.


Nope.

But don't take my word as gospel. You might wish to start browsing here, to
get the information straight from the source:

http://www.unicode.org/faq/
 
Reply With Quote
 
Mark Thornton
Guest
Posts: n/a
 
      10-18-2003
Roedy Green wrote:

> On Sat, 18 Oct 2003 12:23:56 -0500, (E-Mail Removed) wrote or
> quoted :
>
>
>>>I understand that Unicode is a 16-bit character set and that true
>>>ASCII is a 7-bit representation.

>>
>>That is incorrect.

>
>
> Unicode is a 16 bit character set allowing 64K different glyphs/codes.


Not any more; it hasn't been 16 bit for some time. The current
incarnation of Unicode requires at least 20 bits. See
http://www.unicode.org/versions/Unicode4.0.0/
Note that there are 96248 'graphic' characters defined.

Mark Thornton

 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      10-19-2003
On Sat, 18 Oct 2003 15:45:38 -0500, (E-Mail Removed) wrote or
quoted :

>>Unicode is a 16 bit character set allowing 64K different glyphs/codes.

>
>Nope.


In what sense nope? I presume you are being picky about the precise
meanings of "encoding", "character set" and "glyph". Am I wrong in
any sense that would make a difference to anyone but a linguist?


--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      10-19-2003
On Sat, 18 Oct 2003 22:15:29 +0100, Mark Thornton
<(E-Mail Removed)> wrote or quoted :

>> Unicode is a 16 bit character set allowing 64K different glyphs/codes.

>
>Not any more; it hasn't been 16 bit for some time.


Unicode without a trailing number means Unicode-16 does it not? or has
that changed?

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
character encoding +missing character sequence raavi Java 2 03-02-2006 05:01 AM
getting the character code of a character in a string Velvet ASP .Net 9 01-19-2006 09:27 PM
warning: multi-character character constant...help me! mimmo C Programming 4 04-10-2004 08:15 PM
Character reference "&#c" is an invalid XML character cgbusch XML 6 09-02-2003 07:04 PM
question: reading character for character from stdin KwikRick Python 1 08-22-2003 05:56 PM



Advertisments