Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   A few questiosn about encoding (http://www.velocityreviews.com/forums/t961558-a-few-questiosn-about-encoding.html)

06-09-2013 10:44 AM

A few questiosn about encoding
 
A few questiosn about encoding please:

>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
>> values up to 256?


>Because then how do you tell when you need one byte, and when you need
>two? If you read two bytes, and see 0x4C 0xFA, does that mean two
>characters, with ordinal values 0x4C and 0xFA, or one character with
>ordinal value 0x4CFA?


I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256.


>> UTF-8 and UTF-16 and UTF-32
>> I though the number beside of UTF- was to declare how many bits the
>> character set was using to store a character into the hdd, no?


>Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
>UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
>values to make a surrogate pair.


A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters?
Is this what a surrogate is? a pari of 2 chars?


>UTF-8 uses 8-bit values, but sometimes
>it combines two, three or four of them to represent a single code-point.


'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
'' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )
'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (sinceordinal > 65000 )

The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table?

Fbio Santos 06-09-2013 12:18 PM

Re: A few questiosn about encoding
 
On 9 Jun 2013 11:49, " " <nikos.gr33k@gmail.com> wrote:
>
> A few questiosn about encoding please:
>
> >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
> >> values up to 256?

>
> >Because then how do you tell when you need one byte, and when you need
> >two? If you read two bytes, and see 0x4C 0xFA, does that mean two
> >characters, with ordinal values 0x4C and 0xFA, or one character with
> >ordinal value 0x4CFA?

>
> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant

up to 256, not above 256.
>
>
> >> UTF-8 and UTF-16 and UTF-32
> >> I though the number beside of UTF- was to declare how many bits the
> >> character set was using to store a character into the hdd, no?

>
> >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
> >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
> >values to make a surrogate pair.

>
> A surrogate pair is like itting for example Ctrl-A, which means is a

combination character that consists of 2 different characters?
> Is this what a surrogate is? a pari of 2 chars?
>
>
> >UTF-8 uses 8-bit values, but sometimes
> >it combines two, three or four of them to represent a single code-point.

>
> 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
> '' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is >

127 )
> 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ?

(since ordinal > 65000 )
>
> The amount of bytes needed to store a character solely depends on the

character's ordinal value in the Unicode table?
> --
> http://mail.python.org/mailman/listinfo/python-list


In short, a utf-8 character takes 1 to 4 bytes. A utf-16 character takes 2
to 4 bytes. A utf-32 always takes 4 bytes.

The process of encoding bytes to characters is called encoding. The
opposite is decoding. This is all made transparent in python with the
encode() and decode() methods. You normally don't care about this kind of
things.


Nobody 06-09-2013 05:01 PM

Re: A few questiosn about encoding
 
On Sun, 09 Jun 2013 03:44:57 -0700, Νικόλαος Κούρας wrote:

>>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
>>> values up to 256?

>
>>Because then how do you tell when you need one byte, and when you need
>>two? If you read two bytes, and see 0x4C 0xFA, does that mean two
>>characters, with ordinal values 0x4C and 0xFA, or one character with
>>ordinal value 0x4CFA?

>
> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I
> meant up to 256, not above 256.


But then you've used up all 256 possible bytes for storing the first 256
characters, and there aren't any left for use in multi-byte sequences.

You need some means to distinguish between a single-byte character and an
individual byte within a multi-byte sequence.

UTF-8 does that by allocating specific ranges to specific purposes.
0x00-0x7F are single-byte characters, 0x80-0xBF are continuation bytes of
multi-byte sequences, 0xC0-0xFF are leading bytes of multi-byte sequences.

This scheme has the advantage of making UTF-8 non-modal, i.e. if a byte is
corrupted, added or removed, it will only affect the character containing
that particular byte; the encoder can re-synchronise at the beginning of
the following character.

OTOH, with encodings such as UTF-16, UTF-32 or ISO-2022, adding or
removing a byte will result in desyncronisation, with all subsequent
characters being corrupted.

> A surrogate pair is like itting for example Ctrl-A, which means is a
> combination character that consists of 2 different characters? Is this
> what a surrogate is? a pari of 2 chars?


A surrogate pair is a pair of 16-bit codes used to represent a single
Unicode character whose code is greater than 0xFFFF.

The 2048 codepoints from 0xD800 to 0xDFFF inclusive aren't used to
represent characters, but "surrogates". Unicode characters with codes
in the range 0x10000-0x10FFFF are represented in UTF-16 as a pair of
surrogates. First, 0x10000 is subtracted from the code, giving a value in
the range 0-0xFFFFF (20 bits). The top ten bits are added to 0xD800 to
give a value in the range 0xD800-0xDBFF, while the bottom ten bits are
added to 0xDC00 to give a value in the range 0xDC00-0xDFFF.

Because the codes used for surrogates aren't valid as individual
characters, scanning a string for a particular character won't
accidentally match part of a multi-word character.

> 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
> 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is
> > 127 ) 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be

> stored ? (since ordinal > 65000 )


Most Chinese, Japanese and Korean (CJK) characters have codepoints within
the BMP (i.e. <= 0xFFFF), so they only require 3 bytes in UTF-8. The
codepoints above the BMP are mostly for archaic ideographs (those no
longer in normal use), mathematical symbols, dead languages, etc.

> The amount of bytes needed to store a character solely depends on the
> character's ordinal value in the Unicode table?


Yes. UTF-8 is essentially a mechanism for representing 31-bit unsigned
integers such that smaller integers require fewer bytes than larger
integers (subsequent revisions of Unicode cap the range of possible
codepoints to 0x10FFFF, as that's all that UTF-16 can handle).


Chris “Kwpolska” Warrick 06-09-2013 05:12 PM

Re: A few questiosn about encoding
 
On Sun, Jun 9, 2013 at 12:44 PM, Νικόλαος Κούρας <nikos.gr33k@gmail.com> wrote:
> A few questiosn about encoding please:
>
>>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
>>> values up to 256?

>
>>Because then how do you tell when you need one byte, and when you need
>>two? If you read two bytes, and see 0x4C 0xFA, does that mean two
>>characters, with ordinal values 0x4C and 0xFA, or one character with
>>ordinal value 0x4CFA?

>
> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meantup to 256, not above 256.


It is required so the computer can know where characters begin.
0x0080 (first non-ASCII character) becomes 0xC280 in UTF-8. Further
details here: http://en.wikipedia.org/wiki/UTF-8#Description

>>> UTF-8 and UTF-16 and UTF-32
>>> I though the number beside of UTF- was to declare how many bits the
>>> character set was using to store a character into the hdd, no?

>
>>Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
>>UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
>>values to make a surrogate pair.

>
> A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters?
> Is this what a surrogate is? a pari of 2 chars?


http://en.wikipedia.org/wiki/UTF-16#..._to_U.2B10FFFF

Long story short: codepoint - 0x10000 (up to 20 bits) → two 10-bit
numbers → 0xD800 + first_half 0xDC00 + second_half. Rephrasing:

We take MATHEMATICAL BOLD CAPITAL B (U+1D401). If you have UTF-8: 𝐁

It is over 0xFFFF, and we need to use surrogate pairs. We end up with
0xD401, or 0b1101010000000001. Both representations are worthless, as
we have a 16-bit number, not a 20-bit one. We throw in some leading
zeroes and end up with 0b00001101010000000001. Split it in half and
we get 0b0000110101 and 0b0000000001, which we can now shorten to
0b110101 and 0b1, or translate to hex as 0x0035 and 0x0001. 0xD800 +
0x0035 and 0xDC00 + 0x0035 → 0xD835 0xDC00. Type it into python and:

>>> b'\xD8\x35\xDC\x01'.decode('utf-16be')

'𝐁'

And before you ask: that “BE” stands for Big-Endian. Little-Endian
would mean reversing the bytes in a codepoint, which would make it
'\x35\xD8\x01\xDC' (the name is based on the first 256 characters,
which are 0x6500 for 'a' in a little-endian encoding.

Another question you may ask: 0xD800…0xDFFF are reserved in Unicode
for the purposes of UTF-16, so there is no conflicts.

>>UTF-8 uses 8-bit values, but sometimes
>>it combines two, three or four of them to represent a single code-point.

>
> 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
> 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )


yup. α is at 0x03B1, or 945 decimal.

> 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal > 65000 )


Not necessarily, as CJK characters start at U+2E80, which is in the
3-byte range (0x0800 through 0xFFFF) — the table is here:
http://en.wikipedia.org/wiki/UTF-8#Description

--
Kwpolska <http://kwpolska.tk> | GPG KEY: 5EAAEA16
stop html mail | always bottom-post
http://asciiribbon.org | http://caliburn.nl/topposting.html

Steven D'Aprano 06-12-2013 09:24 AM

Re: A few questiosn about encoding
 
On Wed, 12 Jun 2013 09:09:05 +0000, Νικόλαος Κούρας wrote:

> Isn't 14 bits way to many to store a character ?


No.

There are 1114111 possible characters in Unicode. (And in Japan, they
sometimes use TRON instead of Unicode, which has even more.)

If you list out all the combinations of 14 bits:

0000 0000 0000 00
0000 0000 0000 01
0000 0000 0000 10
0000 0000 0000 11
[...]
1111 1111 1111 10
1111 1111 1111 11

you will see that there are only 32767 (2**15-1) such values. You can't
fit 1114111 characters with just 32767 values.



--
Steven

Νικόλαος Κούρας 06-12-2013 11:23 AM

Re: A few questiosn about encoding
 
On 12/6/2013 12:24 μμ, Steven D'Aprano wrote:
> On Wed, 12 Jun 2013 09:09:05 +0000, Νικόλαος Κούρας wrote:
>
>> Isn't 14 bits way to many to store a character ?

>
> No.
>
> There are 1114111 possible characters in Unicode. (And in Japan, they
> sometimes use TRON instead of Unicode, which has even more.)
>
> If you list out all the combinations of 14 bits:
>
> 0000 0000 0000 00
> 0000 0000 0000 01
> 0000 0000 0000 10
> 0000 0000 0000 11
> [...]
> 1111 1111 1111 10
> 1111 1111 1111 11
>
> you will see that there are only 32767 (2**15-1) such values. You can't
> fit 1114111 characters with just 32767 values.
>
>
>

Thanks Steven,
So, how many bytes does UTF-8 stored for codepoints > 127 ?

example for codepoint 256, 1345, 16474 ?

Dave Angel 06-12-2013 12:43 PM

Re: A few questiosn about encoding
 
On 06/12/2013 05:24 AM, Steven D'Aprano wrote:
> On Wed, 12 Jun 2013 09:09:05 +0000, Νικόλαος Κούρας wrote:
>
>> Isn't 14 bits way to many to store a character ?

>
> No.
>
> There are 1114111 possible characters in Unicode. (And in Japan, they
> sometimes use TRON instead of Unicode, which has even more.)
>
> If you list out all the combinations of 14 bits:
>
> 0000 0000 0000 00
> 0000 0000 0000 01
> 0000 0000 0000 10
> 0000 0000 0000 11
> [...]
> 1111 1111 1111 10
> 1111 1111 1111 11
>
> you will see that there are only 32767 (2**15-1) such values. You can't
> fit 1114111 characters with just 32767 values.
>
>


Actually, it's worse. There are 16536 such values (2**14), assuming you
include null, which you did in your list.

--
DaveA

Ulrich Eckhardt 06-12-2013 12:52 PM

Re: A few questiosn about encoding
 
Am 12.06.2013 13:23, schrieb Νικόλαος Κούρας:
> So, how many bytes does UTF-8 stored for codepoints > 127 ?


What has your research turned up? I personally consider it lazy and
respectless to get lots of pointers that you could use for further
research and ask for more info before you even followed these links.


> example for codepoint 256, 1345, 16474 ?


Yes, examples exist. Gee, if there only was an information network that
you could access and where you could locate information on various
programming-related topics somehow. Seriously, someone should invent
this thing! But still, even without it, you have all the tools (i.e.
Python) in your hand to generate these examples yourself! Check out ord,
bin, encode, decode for a start.


Uli


Nobody 06-12-2013 08:30 PM

Re: A few questiosn about encoding
 
On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:

> So, how many bytes does UTF-8 stored for codepoints > 127 ?


U+0000..U+007F 1 byte
U+0080..U+07FF 2 bytes
U+0800..U+FFFF 3 bytes
>=U+10000 4 bytes


So, 1 byte for ASCII, 2 bytes for other Latin characters, Greek, Cyrillic,
Arabic, and Hebrew, 3 bytes for Chinese/Japanese/Korean, 4 bytes for dead
languages and mathematical symbols.

The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a total
of 31 bits, but UTF-16 is limited to U+10FFFF (slightly more than 20 bits).


Steven D'Aprano 06-13-2013 12:13 AM

Re: A few questiosn about encoding
 
On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:

> So, how many bytes does UTF-8 stored for codepoints > 127 ?


Two, three or four, depending on the codepoint.


> example for codepoint 256, 1345, 16474 ?


You can do this yourself. I have already given you enough information in
previous emails to answer this question on your own, but here it is again:

Open an interactive Python session, and run this code:

c = ord(16474)
len(c.encode('utf-8'))


That will tell you how many bytes are used for that example.



--
Steven


All times are GMT. The time now is 04:45 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.