textwrap and combining diacritical marks
When using the textwrap module, the wrap will always use len() to
determine the length of the string being wrapped. This might be a
sensible thing to do in many circumstances, but I think there are
circumstances where this does not lead to the desired result.
I assume many applications of this module are found in applications
where text is formatted to be presented to a user, e.g. a console
application. The number of characters in the string, as determined by
len() might not be the number of columns occupied. Some of the
characters might be combining diacritical marks, which go on top of the
previous character, i.e. the string de'ge'ne're' (where the ' indicate
combing accute accents) will only display with a width of 8 characters.
The string might also include some characters that'll switch the console
to bold or underline mode, which have zero display width. If this
happens a lot, the resuling text might seem very badly formatted because
of all these zerowidth character-strings.
It is of course impossible to handle all these scenario's in which some
characters might influence the width of the displayed string, but
wouldn't it be convenient to have a 'chunk_width' method or something
which can be overridden in a derived class, so that a user might give a
custom implementation? The default of this chunk_width might just be
And that leasts to another question, does Python have a function akin to
wcwidth() which gives the number of column positions a unicode character
Re: textwrap and combining diacritical marks
On Thu, 28 Jun 2007 09:19:20 +0000 (UTC), Berteun Damman
> And that leasts to another question, does Python have a function akin to
> wcwidth() which gives the number of column positions a unicode character
After playing around a bit with unicodedata.normalize, but seeing how
this fails when there is no precomposed form, I've decided to take
Marcus Kuhns implementation , and made a Python version .
This will try to guess the column width of a character. Non printable
characters will report a -1 width (this includes '\n' and '\t' for
example.), except for \0, which has width 0. Composing characters will
report '0', normal latin characters 1 and full-width forms for example
Of course, real output depends on the capabilities of the display
device. xterm is capable of handling combining characters, whereas OS
X's Terminal.app can not do it for Greek or Russian characters for
All in all, I think it is a reasonable start. There is one issue though,
namely involving Plane 1 chars. On 64 bit systems, so it seems, these
are stored as one character, on 32 bit systems as a surrogate pair. I
don't know how this works exactly, but the code should basically ignore
Plane 1 characters on 32 bit systems (i.e. always report display width
'1' even though they're combining or full-width).
|All times are GMT. The time now is 11:04 PM.|
Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.