Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > How to find number of characters in a unicode string?

Reply
Thread Tools

How to find number of characters in a unicode string?

 
 
Preben Randhol
Guest
Posts: n/a
 
      09-18-2006
Hi

If I use len() on a string containing unicode letters I get the number
of bytes the string uses. This means that len() can report size 6 when
the unicode string only contains 3 characters (that one would write by
hand or see on the screen). Is there a way to calculate in characters
and not in bytes to represent the characters.

The reason for asking is that PyGTK needs number of characters to set
the width of Entry widgets to a certain length, and it expects viewable
characters and not number of bytes to represent them.


Thanks in advance


Preben
 
Reply With Quote
 
 
 
 
Marc 'BlackJack' Rintsch
Guest
Posts: n/a
 
      09-18-2006
In <20060918221814.08625ea2.randhol+valid_for_reply_f (E-Mail Removed)>,
Preben Randhol wrote:

> If I use len() on a string containing unicode letters I get the number
> of bytes the string uses. This means that len() can report size 6 when
> the unicode string only contains 3 characters (that one would write by
> hand or see on the screen). Is there a way to calculate in characters
> and not in bytes to represent the characters.


Yes and you already seem to know the answer: Decode the byte string and
use `len()` on the unicode string.

Ciao,
Marc 'BlackJack' Rintsch
 
Reply With Quote
 
 
 
 
faulkner
Guest
Posts: n/a
 
      09-18-2006
are you sure you're using unicode objects?
len(u'\uffff') == 1
the encodings module should help you turn '\xff\xff' into u'\uffff'.

Preben Randhol wrote:
> Hi
>
> If I use len() on a string containing unicode letters I get the number
> of bytes the string uses. This means that len() can report size 6 when
> the unicode string only contains 3 characters (that one would write by
> hand or see on the screen). Is there a way to calculate in characters
> and not in bytes to represent the characters.
>
> The reason for asking is that PyGTK needs number of characters to set
> the width of Entry widgets to a certain length, and it expects viewable
> characters and not number of bytes to represent them.
>
>
> Thanks in advance
>
>
> Preben


 
Reply With Quote
 
Preben Randhol
Guest
Posts: n/a
 
      09-19-2006
On Mon, 18 Sep 2006 22:29:20 +0200
Marc 'BlackJack' Rintsch <(E-Mail Removed)> wrote:

> Yes and you already seem to know the answer: Decode the byte string
> and use `len()` on the unicode string.


..decode("utf-8") did the trick. Thanks!

Preben
 
Reply With Quote
 
Lawrence D'Oliveiro
Guest
Posts: n/a
 
      09-29-2006
In message <(E-Mail Removed)>, Marc 'BlackJack'
Rintsch wrote:

> In <20060918221814.08625ea2.randhol+valid_for_reply_f (E-Mail Removed)>,
> Preben Randhol wrote:
>
>> Is there a way to calculate in characters
>> and not in bytes to represent the characters.

>
> Decode the byte string and use `len()` on the unicode string.


Hmmm, for some reason

len(u"C\u0327")

returns 2.
 
Reply With Quote
 
Marc 'BlackJack' Rintsch
Guest
Posts: n/a
 
      09-29-2006
In <efija1$357$(E-Mail Removed)>, Lawrence D'Oliveiro wrote:

> In message <(E-Mail Removed)>, Marc 'BlackJack'
> Rintsch wrote:
>
>> In <20060918221814.08625ea2.randhol+valid_for_reply_f (E-Mail Removed)>,
>> Preben Randhol wrote:
>>
>>> Is there a way to calculate in characters
>>> and not in bytes to represent the characters.

>>
>> Decode the byte string and use `len()` on the unicode string.

>
> Hmmm, for some reason
>
> len(u"C\u0327")
>
> returns 2.


Okay, decode and normalize and then use `len()` on the unicode string.

Ciao,
Marc 'BlackJack' Rintsch

 
Reply With Quote
 
Gabriel Genellina
Guest
Posts: n/a
 
      09-29-2006
At Friday 29/9/2006 04:52, Lawrence D'Oliveiro wrote:

> >> Is there a way to calculate in characters
> >> and not in bytes to represent the characters.

> >
> > Decode the byte string and use `len()` on the unicode string.

>
>Hmmm, for some reason
>
> len(u"C\u0327")
>
>returns 2.


That's correct, these are two unicode characters,
C and combining-cedilla; display as Ç. From
<http://en.wikipedia.org/wiki/Unicode>:

"Unicode takes the role of providing a unique
code point — a number, not a glyph — for each
character. In other words, Unicode represents a
character in an abstract way, and leaves the
visual rendering (size, shape, font or style) to
other software [...] This simple aim becomes
complicated, however, by concessions made by
Unicode's designers, in the hope of encouraging a
more rapid adoption of Unicode. [...] A lot of
essentially identical characters were encoded
multiple times at different code points to
preserve distinctions used by legacy encodings
and therefore allow conversion from those
encodings to Unicode (and back) without losing
any information. [...] Also, while Unicode allows
for combining characters, it also contains
precomposed versions of most letter/diacritic
combinations in normal use. These make conversion
to and from legacy encodings simpler and allow
applications to use Unicode as an internal text
format without having to implement combining
characters. For example é can be represented in
Unicode as U+0065 (Latin small letter e) followed
by U+0301 (combining acute) but it can also be
represented as the precomposed character U+00E9
(Latin small letter e with acute)."

Gabriel Genellina
Softlab SRL





__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

 
Reply With Quote
 
Leif K-Brooks
Guest
Posts: n/a
 
      09-29-2006
Lawrence D'Oliveiro wrote:
> Hmmm, for some reason
>
> len(u"C\u0327")
>
> returns 2.


Is len(unicodedata.normalize('NFC', u"C\u0327")) what you want?
 
Reply With Quote
 
Leo Kislov
Guest
Posts: n/a
 
      10-11-2006

Lawrence D'Oliveiro wrote:
> In message <(E-Mail Removed)>, Marc 'BlackJack'
> Rintsch wrote:
>
> > In <20060918221814.08625ea2.randhol+valid_for_reply_f (E-Mail Removed)>,
> > Preben Randhol wrote:
> >
> >> Is there a way to calculate in characters
> >> and not in bytes to represent the characters.

> >
> > Decode the byte string and use `len()` on the unicode string.

>
> Hmmm, for some reason
>
> len(u"C\u0327")
>
> returns 2.


If python ever provide this functionality it would be I guess
u"C\u0327".width() == 1. But it's not clear when unicode.org will
provide recommended fixed font character width information for *all*
characters. I recently stumbled upon Tamil language, where for example
u'\u0b95\u0bcd', u'\u0b95\u0bbe', u'\u0b95\u0bca', u'\u0b95\u0bcc'
looks like they have width 1,2,3 and 4 columns. To add insult to injury
these 4 symbols are all considered *single* letter symbols If your
email reader is able to show them, here they are in all their glory:
க், கா, கொ, கௌ.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Python unicode utf-8 characters and MySQL unicode utf-8 characters Grzegorz ¦liwiñski Python 2 01-19-2011 07:31 AM
Re: convert unicode characters to visibly similar ascii characters Laszlo Nagy Python 6 07-02-2008 04:42 PM
Re: convert unicode characters to visibly similar ascii characters Terry Reedy Python 0 07-01-2008 07:46 PM
Interrogating string for number of characters, response.writing identical number of characters on new line Ken Fine ASP General 2 02-05-2004 03:40 AM



Advertisments