On Wed, 4 Jan 2006, Jürgen Exner wrote:
> wrote:
> >
> > Unicode character is double-byte
>
> Not necessarily.
"Unicode character" is an abstract concept, which associates the
character with an integer value between 0 and 0x10FFFF.
It's impossible to talk about that abstract concept in practical terms
without considering a specific "Character Encoding Form", which
specifies how to represent that integer value using different sized
units. There exist definitions for how to use 8-bit units (utf-

,
16-bit units (utf-16), and 32-bit units (utf-32).
See Chapter 2 of the Unicode specification, in particular sections
2.5 and 2.6 where the terms "Character Encoding Form" and "Character
Encoding Scheme" are elucidated.
e.g at
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf
> UTF-8 uses anything from 1 to 4(?) bytes.
Indeed. The original utf-8 encoding scheme included definitions of
how to represent integers up to 32 bits, using sequences of up to 6
octets (8-bit bytes). But Unicode has now firmly set their upper
limit at 0x10FFFF (for whatever reason they picked that rather odd
endpoint), meaning that utf-8 sequences of more than 4 octets won't be
needed in practice.
h t h