Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > byte count unicode string

Reply
Thread Tools

byte count unicode string

 
 
willie
Guest
Posts: n/a
 
      09-20-2006
Martin v. Löwis:

>willie schrieb:
>
>> Thank you for your patience and for educating me.
>> (Though I still have a long way to go before enlightenment)
>> I thought Python might have a small weakness in
>> lacking an efficient way to get the number of bytes
>> in a "UTF-8 encoded Python string object" (proper?),
>> but I've been disabused of that notion.

>
>Well, to get to the enlightenment, you have to understand
>that Unicode and UTF-8 are *not* synonyms.
>
>A Python Unicode string is an abstract sequence of
>characters. It does have an in-memory representation,
>but that is irrelevant and depends on what microprocessor
>you use. A byte string is a sequence of quantities with
>8 bits each (called bytes).
>
>For each of them, the notion of "length" exists: For
>a Unicode string, it's the number of characters; for
>a byte string, the number of bytes.
>
>UTF-8 is a character encoding; it is only meaningful
>to say that byte strings have an encoding (where
>"UTF-8", "cp1252", "iso-2022-jp" are really very
>similar). For a character encoding, "what is the
>number of bytes?" is a meaningful question. For
>a Unicode string, this question is not meaningful:
>you have to specify the encoding first.
>
>Now, there is no len(unicode_string, encoding) function:
>len takes a single argument. To specify both the string
>and the encoding, you have to write
>len(unicode_string.encode(encoding)). This, as a
>side effect, actually computes the encoding.
>
>While it would be possible to answer the question
>"how many bytes has Unicode string S in encoding E?"
>without actually encoding the string, doing so would
>require codecs to implement their algorithm twice:
>once to count the number of bytes, and once to
>actually perform the encoding. Since this operation
>is not that frequent, it was chosen not to put the
>burden of implementing the algorithm twice (actually,
>doing so was never even considered).



Thanks for the thorough explanation. One last question
about terminology then I'll go away
What is the proper way to describe "ustr" below?

>>> ustr = buf.decode('UTF-8')
>>> type(ustr)

<type 'unicode'>


Is it a "unicode object that contains a UTF-8 encoded
string object?"

 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      09-20-2006

willie wrote:
>
> Thanks for the thorough explanation. One last question
> about terminology then I'll go away
> What is the proper way to describe "ustr" below?
>
> >>> ustr = buf.decode('UTF-8')
> >>> type(ustr)

> <type 'unicode'>
>
>
> Is it a "unicode object that contains a UTF-8 encoded
> string object?"


No. It is a Python unicode object, period.

1. If it did contain another object you would be (quite justifiably)
screaming your peripherals off about the waste of memory
2. You don't need to concern yourself with the internals of a unicode
object; however rest assured that it is *not* stored as UTF-8 -- so if
you are hoping for a quick "number of utf 8 bytes without actually
producing a str object" method, you are out of luck.

Consider this example: you have a str object which contains some
Russian text, encoded in cp1251.

str1 = russian_text
unicode1 = str1.decode('cp1251')
str2 = unicode1.encode('utf-8')
unicode2 = str2.decode('utf-8')
Then unicode2 == unicode1, repr(unicode2) == repr(unicode1), there is
no way (without the above history) of determining how it was created --
and you don't need to care how it was created.

HTH,
John

 
Reply With Quote
 
 
 
 
Paul Rubin
Guest
Posts: n/a
 
      09-22-2006
willie <(E-Mail Removed)> writes:

> >>> ustr = buf.decode('UTF-8')
> >>> type(ustr)

> <type 'unicode'>
> Is it a "unicode object that contains a UTF-8 encoded
> string object?"


No, it's just unicode, which is a string over a certain character set.
UTF-8 is a way to encode unicode strings as byte strings.

You should read the wikipedia article about unicode, it will help you
understand.

http://en.wikipedia.org/wiki/Unicode
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
byte count unicode string willie Python 7 09-20-2006 11:45 PM
byte count unicode string willie Python 2 09-20-2006 05:29 PM
byte count unicode string willie Python 2 09-20-2006 10:19 AM
byte count unicode string willie Python 1 09-20-2006 09:15 AM
byte count unicode string willie Python 2 09-20-2006 06:43 AM



Advertisments