Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > textwrap and combining diacritical marks

Reply
Thread Tools

textwrap and combining diacritical marks

 
 
Berteun Damman
Guest
Posts: n/a
 
      06-28-2007
Hello,

When using the textwrap module, the wrap will always use len() to
determine the length of the string being wrapped. This might be a
sensible thing to do in many circumstances, but I think there are
circumstances where this does not lead to the desired result.

I assume many applications of this module are found in applications
where text is formatted to be presented to a user, e.g. a console
application. The number of characters in the string, as determined by
len() might not be the number of columns occupied. Some of the
characters might be combining diacritical marks, which go on top of the
previous character, i.e. the string de'ge'ne're' (where the ' indicate
combing accute accents) will only display with a width of 8 characters.

The string might also include some characters that'll switch the console
to bold or underline mode, which have zero display width. If this
happens a lot, the resuling text might seem very badly formatted because
of all these zerowidth character-strings.

It is of course impossible to handle all these scenario's in which some
characters might influence the width of the displayed string, but
wouldn't it be convenient to have a 'chunk_width' method or something
which can be overridden in a derived class, so that a user might give a
custom implementation? The default of this chunk_width might just be
'len()'.

And that leasts to another question, does Python have a function akin to
wcwidth() which gives the number of column positions a unicode character
needs?

Berteun
 
Reply With Quote
 
 
 
 
Berteun Damman
Guest
Posts: n/a
 
      06-28-2007
On Thu, 28 Jun 2007 09:19:20 +0000 (UTC), Berteun Damman
<berteun@NO_SPAMdds.nl> wrote:
> And that leasts to another question, does Python have a function akin to
> wcwidth() which gives the number of column positions a unicode character
> needs?


After playing around a bit with unicodedata.normalize, but seeing how
this fails when there is no precomposed form, I've decided to take
Marcus Kuhns implementation [1], and made a Python version [2].

This will try to guess the column width of a character. Non printable
characters will report a -1 width (this includes '\n' and '\t' for
example.), except for \0, which has width 0. Composing characters will
report '0', normal latin characters 1 and full-width forms for example
'2'.

Of course, real output depends on the capabilities of the display
device. xterm is capable of handling combining characters, whereas OS
X's Terminal.app can not do it for Greek or Russian characters for
example.

All in all, I think it is a reasonable start. There is one issue though,
namely involving Plane 1 chars. On 64 bit systems, so it seems, these
are stored as one character, on 32 bit systems as a surrogate pair. I
don't know how this works exactly, but the code should basically ignore
Plane 1 characters on 32 bit systems (i.e. always report display width
'1' even though they're combining or full-width).

Berteun

[1] http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
[2] http://berteun.nl/tmp/wcwidth.py
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Form mailto and diacritical marks ? Dado HTML 5 06-25-2006 09:49 PM
removing diacritical marks Paul Barry Ruby 2 03-17-2006 09:49 AM
Diacritical marks in array don't translate jiverbean Javascript 15 11-12-2005 04:04 PM
French diacritical marks adamskim Java 4 12-13-2004 12:16 PM
Diacritical marks in HTML? Girish Sharma HTML 11 12-01-2004 12:25 AM



Advertisments