Velocity Reviews > A few questiosn about encoding

# A few questiosn about encoding

Νικόλαος Κούρας
Guest
Posts: n/a

 06-13-2013
On 13/6/2013 11:20 πμ, Chris Angelico wrote:
> On Thu, Jun 13, 2013 at 6:08 PM, Νικόλαος Κούρας <(E-Mail Removed)> wrote:
>> On 13/6/2013 10:58 πμ, Chris Angelico wrote:
>>>
>>> On Thu, Jun 13, 2013 at 5:42 PM, �������� ������ <(E-Mail Removed)>
>>> wrote:
>>>
>>>> On 13/6/2013 10:11 ��, Steven D'Aprano wrote:
>>>>>
>>>>> No! That creates a string from 16474 in base two:
>>>>> '0b100000001011010'
>>>>
>>>>
>>>> I disagree here.
>>>> 16474 is a number in base 10. Doing bin(16474) we get the binary
>>>> representation of number 16474 and not a string.
>>>> Why you say we receive a string while python presents a binary number?
>>>
>>>
>>> You can disagree all you like. Steven cited a simple point of fact,
>>> one which can be verified in any Python interpreter. Nikos, you are
>>> flat wrong here; bin(16474) creates a string.

>>
>>
>> Indeed python embraced it in single quoting '0b100000001011010' and not as
>> 0b100000001011010 which in fact makes it a string.
>>
>> But since bin(16474) seems to create a string rather than an expected
>> number(at leat into my mind) then how do we get the binary representation of
>> the number 16474 as a number?

>
> In Python 2:
>>>> 16474

typing 16474 in interactive session both in python 2 and 3 gives back
the number 16474

while we want the the binary representation of the number 16474

Nobody
Guest
Posts: n/a

 06-13-2013
On Thu, 13 Jun 2013 12:01:55 +1000, Chris Angelico wrote:

> On Thu, Jun 13, 2013 at 11:40 AM, Steven D'Aprano
> <(E-Mail Removed)> wrote:
>> The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but
>> that's not UTF-8, that's UTF-8-plus-extra-codepoints.

>
> And a proper UTF-8 decoder will reject "\xC0\x80" and "\xed\xa0\x80", even
> though mathematically they would translate into U+0000 and U+D800
> respectively. The UTF-16 *mechanism* is limited to no more than Unicode
> has currently used, but I'm left wondering if that's actually the other
> way around - that Unicode planes were deemed to stop at the point where
> UTF-16 can't encode any more.

Indeed. 5-byte and 6-byte sequences were originally part of the UTF-8
specification, allowing for 31 bits. Later revisions of the standard
imposed the UTF-16 limit on Unicode as a whole.

Steven D'Aprano
Guest
Posts: n/a

 06-13-2013
On Thu, 13 Jun 2013 12:41:41 +0300, Νικόλαος Κούρας wrote:

>> In Python 2:
>>>>> 16474

> typing 16474 in interactive session both in python 2 and 3 gives back
> the number 16474
>
> while we want the the binary representation of the number 16474

Python does not work that way. Ints *always* display in decimal.
Regardless of whether you enter the decimal in binary:

py> 0b100000001011010
16474

octal:

py> 0o40132
16474

py> 0x405A
16474

ints always display in decimal. The only way to display in another base
is to build a string showing what the int would look like in a different
base:

py> hex(16474)
'0x405a'

Notice that the return value of bin, oct and hex are all strings. If they
were ints, then they would display in decimal, defeating the purpose!

--
Steven

Νικόλαος Κούρας
Guest
Posts: n/a

 06-13-2013
On 13/6/2013 2:49 μμ, Steven D'Aprano wrote:

Please confirm these are true statement:

A code-point and the code-point's ordinal value are associated into a
Unicode charset. They have the so called 1:1 mapping.

So, i was under the impression that by encoding the code-point into
utf-8 was the same as encoding the code-point's ordinal value into utf-8.

So, now i believe they are two different things.
The code-point *is what actually* needs to be encoded and *not* its
ordinal value.

> The leading 0b is just syntax to tell you "this is base 2, not base 8
> (0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.

But byte objects are represented as '\x' instead of the aforementioned
'0x'. Why is that?

> ints always display in decimal. The only way to display in another base
> is to build a string showing what the int would look like in a different
> base:
>
> py> hex(16474)
> '0x405a'
>
> Notice that the return value of bin, oct and hex are all strings. If they
> were ints, then they would display in decimal, defeating the purpose!

Thank you didn't knew that! indeed it working like this.

To encode a number we have to turn it into a string first.

"16474".encode('utf-8')
b'16474'

That 'b' stand for bytes.
How can i view this byte's object representation as hex() or as bin()?

============
Also:
>>> len('0b100000001011010')

17

You said this string consists of 17 chars.
Why the leading syntax of '0b' counts as bits as well? Shouldn't be 15

Dennis Lee Bieber
Guest
Posts: n/a

 06-13-2013
On Wed, 12 Jun 2013 09:09:05 +0000 (UTC), ???????? ??????
<(E-Mail Removed)> declaimed the following:

>>> (*) infact UTF8 also indicates the end of each character

>
>> Up to a point. The initial byte encodes the length and the top few
>> bits, but the subsequent octets aren’t distinguishable as final in
>> isolation. 0x80-0xBF can all be either medial or final.

>
>
>So, the first high-bits are a directive that UTF-8 uses to know how many
>bytes each character is being represented as.
>
>0-127 codepoints(characters) use 1 bit to signify they need 1 bit for
>storage and the rest 7 bits to actually store the character ?
>

Not quite... The leading bit is a 0 -> which means 0..127 are sent
as-is, no manipulation.

>while
>
>128-256 codepoints(characters) use 2 bit to signify they need 2 bits for
>storage and the rest 14 bits to actually store the character ?
>

128..255 -- in what encoding? These all have the leading bit with a
value of 1. In 8-bit encodings (ISO-Latin-1) the meaning of those values is
inherent in the specified encoding and they are sent as-is.

BUT, in UTF-8, a byte with a leading 1-bit signals that the byte
identifies a multi-byte sequence. CF:
https://en.wikipedia.org/wiki/UTF-8#Description

So anything that starts with bits 110 is a two byte sequence (and the

1110 starts a three byte sequence, 11110 starts a four byte sequence...
Basically, count the number of leading 1-bits before a 0 bit, and that
tells you how many bytes are in the multi-byte sequence -- and all bytes
that start with 10 are supposed to be the continuations of a multibyte set
(and not a signal that this is a 1-byte entry -- those only have a leading
0)

>Isn't 14 bits way to many to store a character ?

Original UTF-8 allowed for 31-bits to specify a character in the Unicode
set. It used 6 bytes -- 48 bits total, but 7 bits of the first byte were
the flag (6 leading 1 bits and a 0 bit), and two bits (leading 10) of each
continuation.

--
Wulfraed Dennis Lee Bieber AF6VN
http://www.velocityreviews.com/forums/(E-Mail Removed) HTTP://wlfraed.home.netcom.com/

Cameron Simpson
Guest
Posts: n/a

 06-14-2013
On 13Jun2013 17:19, Nikos as SuperHost Support <(E-Mail Removed)> wrote:
| A code-point and the code-point's ordinal value are associated into
| a Unicode charset. They have the so called 1:1 mapping.
|
| So, i was under the impression that by encoding the code-point into
| utf-8 was the same as encoding the code-point's ordinal value into
| utf-8.
|
| So, now i believe they are two different things.
| The code-point *is what actually* needs to be encoded and *not* its
| ordinal value.

Because there is a 1:1 mapping, these are the same thing: a code
point is directly _represented_ by the ordinal value, and the ordinal
value is encoded for storage as bytes.

| > The leading 0b is just syntax to tell you "this is base 2, not base 8
| > (0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.
|
| But byte objects are represented as '\x' instead of the
| aforementioned '0x'. Why is that?

You're confusing a "string representation of a single number in
some base (eg 2 or 16)" with the "string-ish representation of a
bytes object".

The former is just notation for writing a number in different bases, eg:

27 base 10
1b base 16
33 base 8
11011 base 2

A common convention, and the one used by hex(), oct() and bin() in
Python, is to prefix the non-base-10 representations with "0x" for
base 16, "0o" for base 8 ("o"ctal) and "0b" for base 2 ("b"inary):

27
0x1b
0o33
0b11011

This allows the human reader or a machine lexer to decide what base
the number is written in, and therefore to figure out what the
underlying numeric value is.

Conversely, consider the bytes object consisting of the values [97,
98, 99, 27, 10]. In ASCII (and UTF-8 and the iso-8859-x encodings)
these may all represent the characters ['a', 'b', 'c', ESC, NL].
So when "printing" a bytes object, which is a sequence of small integers representing
values stored in bytes, it is compact to print:

b'abc\x1b\n'

which is ['a', 'b', 'c', chr(27), newline].

The slosh (\) is the common convention in C-like languages and many
others for representing special characters not directly represents
by themselves. So "\\" for a slosh, "\n" for a newline and "\x1b"
for character 27 (ESC).

The bytes object is still just a sequence on integers, but because
it is very common to have those integers represent text, and very
common to have some text one want represented as bytes in a direct
1:1 mapping, this compact text form is useful and readable. It is
also legal Python syntax for making a small bytes object.

To demonstrate that this is just a _representation_, run this:

>>> [ i for i in b'abc\x1b\n' ]

[97, 98, 99, 27, 10]

at an interactive Python 3 prompt. See? Just numbers.

| To encode a number we have to turn it into a string first.
|
| "16474".encode('utf-8')
| b'16474'
|
| That 'b' stand for bytes.

http://docs.python.org/3/reference/l...bytes-literals

| How can i view this byte's object representation as hex() or as bin()?

See above. A bytes is a _sequence_ of values. hex() and bin() print
individual values in hexadecimal or binary respectively. You could
do this:

for value in b'16474':
print(value, hex(value), bin(value))

Cheers,
--
Cameron Simpson <(E-Mail Removed)>

Uhlmann's Razor: When stupidity is a sufficient explanation, there is no need
to have recourse to any other.
- Michael M. Uhlmann, assistant attorney general
for legislation in the Ford Administration

Nick the Gr33k
Guest
Posts: n/a

 06-14-2013
On 14/6/2013 1:46 πμ, Dennis Lee Bieber wrote:
> On Wed, 12 Jun 2013 09:09:05 +0000 (UTC), ???????? ??????
> <(E-Mail Removed)> declaimed the following:
>
>>>> (*) infact UTF8 also indicates the end of each character

>>
>>> Up to a point. The initial byte encodes the length and the top few
>>> bits, but the subsequent octets aren’t distinguishable as final in
>>> isolation. 0x80-0xBF can all be either medial or final.

>>
>>
>> So, the first high-bits are a directive that UTF-8 uses to know how many
>> bytes each character is being represented as.
>>
>> 0-127 codepoints(characters) use 1 bit to signify they need 1 bit for
>> storage and the rest 7 bits to actually store the character ?
>>

> Not quite... The leading bit is a 0 -> which means 0..127 are sent
> as-is, no manipulation.

So, in utf-8, the leading bit which is a zero 0, its actually a flag to
tell that the code-point needs 1 byte to be stored and the rest 7 bits
is for the actual value of 0-127 code-points ?

>> 128-256 codepoints(characters) use 2 bit to signify they need 2 bits for
>> storage and the rest 14 bits to actually store the character ?
>>

> 128..255 -- in what encoding? These all have the leading bit with a
> value of 1. In 8-bit encodings (ISO-Latin-1) the meaning of those values is
> inherent in the specified encoding and they are sent as-is.

So, latin-iso or greek-iso, the leading 0 is not a flag like it is in
utf-8 encoding because latin-iso and greek-iso and all *-iso use all 8
bits for storage?

But, in utf-8, the leading bit, which is 1, is to tell that the
code-point needs 2 byte to be stored and the rest 7 bits is for the
actual value of 128-255 code-points ?

But why 2 bytes? leading 1 is a flag and the rest 7 bits can hold the
encoded value.

Bu that is not the case since we know that utf-8 needs 2 bytes to store
code-points 127-255

> 1110 starts a three byte sequence, 11110 starts a four byte sequence...
> Basically, count the number of leading 1-bits before a 0 bit, and that
> tells you how many bytes are in the multi-byte sequence -- and all bytes
> that start with 10 are supposed to be the continuations of a multibyte set
> (and not a signal that this is a 1-byte entry -- those only have a leading
> 0)

Why doesn't it work like this?

leading 0 = 1 byte flag
leading 1 = 2 bytes flag
leading 00 = 3 bytes flag
leading 01 = 4 bytes flag
leading 10 = 5 bytes flag
leading 11 = 6 bytes flag

Wouldn't it be more logical?

> Original UTF-8 allowed for 31-bits to specify a character in the Unicode
> set. It used 6 bytes -- 48 bits total, but 7 bits of the first byte were
> the flag (6 leading 1 bits and a 0 bit), and two bits (leading 10) of each
> continuation.

utf8 6 byted = 48 bits - 7 bits(from first bytes) - 2 bits(for each
continuation) * 5 = 48 - 7 - 10 = 31 bits indeed to store the actual
code-point. But 2^31 is still a huge number to store any kind of
character isnt it?

--
What is now proved was at first only imagined!

Zero Piraeus
Guest
Posts: n/a

 06-14-2013
:

On 14 June 2013 01:34, Nick the Gr33k <(E-Mail Removed)> wrote:
> Why doesn't it work like this?
>
> leading 0 = 1 byte flag
> leading 1 = 2 bytes flag
> leading 00 = 3 bytes flag
> leading 01 = 4 bytes flag
> leading 10 = 5 bytes flag
> leading 11 = 6 bytes flag
>
> Wouldn't it be more logical?

indicates "1 byte" (as is indeed the case in UTF. What things could
00 or 01 for other numbers of bytes?

.... okay, you're obviously going to need to be spoon-fed a little more
than that. Here's a byte:

01010101

Is that a single byte representing a code point in the 0-127 range, or
the first of 4 bytes representing something else, in your proposed
scheme? How can you tell?

Now look at the way UTF8 does it:
<http://en.wikipedia.org/wiki/Utf-8#Description>

reading this until you believe you understand the choices that the

Pay particular attention to the possible values for byte 1. Do you
notice the difference between that scheme, and yours:

0xxxxxxx
1xxxxxxx
00xxxxxx
01xxxxxx
10xxxxxx
11xxxxxx

If you don't see it, keep looking until you do ... this email gives
you more than enough hints to work it out. Don't ask someone here to
explain it to you. If you want to become competent, you must use your
brain.

-[]z.

Nick the Gr33k
Guest
Posts: n/a

 06-14-2013
On 14/6/2013 4:00 πμ, Cameron Simpson wrote:
> On 13Jun2013 17:19, Nikos as SuperHost Support <(E-Mail Removed)> wrote:
> | A code-point and the code-point's ordinal value are associated into
> | a Unicode charset. They have the so called 1:1 mapping.
> |
> | So, i was under the impression that by encoding the code-point into
> | utf-8 was the same as encoding the code-point's ordinal value into
> | utf-8.
> |
> | So, now i believe they are two different things.
> | The code-point *is what actually* needs to be encoded and *not* its
> | ordinal value.
>
> Because there is a 1:1 mapping, these are the same thing: a code
> point is directly _represented_ by the ordinal value, and the ordinal
> value is encoded for storage as bytes.

So, you are saying that:

chr(16474).encode('utf-8') #being the code-point encoded

ord(chr(16474)).encode('utf-8') #being the code-point's ordinal
encoded which gives an error.

that shows us that a character is what is being be encoded to utf-8 but
the character's ordinal cannot.

So, whay you say "....and the ordinal value is encoded for storage as
bytes." ?

> | > The leading 0b is just syntax to tell you "this is base 2, not base 8
> | > (0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.
> |
> | But byte objects are represented as '\x' instead of the
> | aforementioned '0x'. Why is that?
>
> You're confusing a "string representation of a single number in
> some base (eg 2 or 16)" with the "string-ish representation of a
> bytes object".

>>> bin(16474)

'0b100000001011010'
that is a binary format string representation of number 16474, yes?

>>> hex(16474)

'0x405a'
that is a hexadecimal format string representation of number 16474, yes?

WHILE:

b'abc\x1b\n' = a string representation of a byte, which in turn is a
series of integers, so that makes this a string representation of
integers, is this correct?

\x1b = ESC character

\ = for seperating bytes
x = to flag that the following bytes are going to be represented as hex
values? whats exactly 'x' means here? character perhaps?

Still its not clear into my head what the difference of '0x1b' and
'\x1b' is:

i think:
0x1b = an integer represented in hex format

\x1b = a character represented in hex format

id this true?

> | How can i view this byte's object representation as hex() or as bin()?
>
> See above. A bytes is a _sequence_ of values. hex() and bin() print
> individual values in hexadecimal or binary respectively.

>>> for value in b'\x97\x98\x99\x27\x10':

.... print(value, hex(value), bin(value))
....
151 0x97 0b10010111
152 0x98 0b10011000
153 0x99 0b10011001
39 0x27 0b100111
16 0x10 0b10000

>>> for value in b'abc\x1b\n':

.... print(value, hex(value), bin(value))
....
97 0x61 0b1100001
98 0x62 0b1100010
99 0x63 0b1100011
27 0x1b 0b11011
10 0xa 0b1010

Why these two give different values when printed?
--
What is now proved was at first only imagined!

Nick the Gr33k
Guest
Posts: n/a

 06-14-2013
On 14/6/2013 9:00 πμ, Zero Piraeus wrote:
> :
>
> On 14 June 2013 01:34, Nick the Gr33k <(E-Mail Removed)> wrote:
>> Why doesn't it work like this?
>>
>> leading 0 = 1 byte flag
>> leading 1 = 2 bytes flag
>> leading 00 = 3 bytes flag
>> leading 01 = 4 bytes flag
>> leading 10 = 5 bytes flag
>> leading 11 = 6 bytes flag
>>
>> Wouldn't it be more logical?

>
> indicates "1 byte" (as is indeed the case in UTF. What things could
> 00 or 01 for other numbers of bytes?
>
> ... okay, you're obviously going to need to be spoon-fed a little more
> than that. Here's a byte:
>
> 01010101
>
> Is that a single byte representing a code point in the 0-127 range, or
> the first of 4 bytes representing something else, in your proposed
> scheme? How can you tell?

Indeed.

You cannot tell if it stands for 1 byte or a 4 byte sequence:

0 + 1010101 = leading 0 stands for 1byte representation of a code-point

01 + 010101 = leading 01 stands for 4byte representation of a code-point

the problem here in my scheme of how utf8 encoding works is that you
cannot tell whether the flag is '0' or '01'

Same happen with leading '1' and '11'. You cannot tell what the flag is,
so you cannot know if the Unicode code-point is being represented as
2-byte sequence or 6 bye sequence

Understood

> Now look at the way UTF8 does it:
> <http://en.wikipedia.org/wiki/Utf-8#Description>
>
> Really, follow the link and study the table carefully. Don't continue
> reading this until you believe you understand the choices that the
>
> Pay particular attention to the possible values for byte 1. Do you
> notice the difference between that scheme, and yours:
>
> 0xxxxxxx
> 1xxxxxxx
> 00xxxxxx
> 01xxxxxx
> 10xxxxxx
> 11xxxxxx
>
> If you don't see it, keep looking until you do ... this email gives
> you more than enough hints to work it out. Don't ask someone here to
> explain it to you. If you want to become competent, you must use your
> brain.

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

1. '110' is the flag for 2-byte code-point
2. why the in the 2nd byte and every subsequent byte leading flag has to
be '10'

--
What is now proved was at first only imagined!