Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Text::Wrap and unicode

Reply
Thread Tools

Text::Wrap and unicode

 
 
wing328hk@gmail.com
Guest
Posts: n/a
 
      01-04-2006
Hi,

I'm using Text::Wrap and Unicode and found that the function wrap
doesn't handle unicode properly.

Unicode character is double-byte and it seems that wrap basically uses
the function length, which basically return the number of bytes stored
in a variable, to decide where to wrap the input.

For example, say column is 10 and consider the following
aXXXXX where X is a double-byte character

The last unicode X will be corrupted by wrap, which will split the last
unicode character, the 10th and 11th byte of the string, into two.

Does anyone know how to configure wrap such that it works properly with
unicode?

Thanks,
Wing

 
Reply With Quote
 
 
 
 
Paul Lalli
Guest
Posts: n/a
 
      01-04-2006
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> I'm using Text::Wrap and Unicode and found that the function wrap
> doesn't handle unicode properly.
>
> Unicode character is double-byte and it seems that wrap basically uses
> the function length, which basically return the number of bytes stored
> in a variable,


No, it doesn't. length() returns the number of characters in a string.
perldoc -f length

> to decide where to wrap the input.
>
> For example, say column is 10 and consider the following
> aXXXXX where X is a double-byte character
>
> The last unicode X will be corrupted by wrap, which will split the last
> unicode character, the 10th and 11th byte of the string, into two.
>
> Does anyone know how to configure wrap such that it works properly with
> unicode?


Well, first, this should only be a problem for "words" that are greater
in length than the wrapping limit and if you have $huge set to 'wrap'.
You could consider setting $huge to 'overflow' instead. (See perldoc
Text::Wrap for examples)

However, perhaps a module that is meant to deal with Unicode
specifically would suit you better?
http://search.cpan.org/~nesting/Unic...p-0.03/Wrap.pm

(Disclaimer: I've never used Unicode::Wrap. It's just one of the first
results when I search CPAN for 'unicode wrap')

Paul Lalli

 
Reply With Quote
 
 
 
 
Jürgen Exner
Guest
Posts: n/a
 
      01-04-2006
(E-Mail Removed) wrote:
> I'm using Text::Wrap and Unicode and found that the function wrap
> doesn't handle unicode properly.
>
> Unicode character is double-byte


Not necessarily. UTF-8 uses anything from 1 to 4(?) bytes.

> and it seems that wrap basically uses
> the function length, which basically return the number of bytes stored
> in a variable,


Wrong. length() returns the number of characters, not bytes.

jue


 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      01-04-2006
(E-Mail Removed) schreef:

> I'm using Text::Wrap and Unicode and found that the function wrap
> doesn't handle unicode properly.


Where is your code?

--
Affijn, Ruud

"Gewoon is een tijger."
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      01-04-2006
On Wed, 4 Jan 2006, Jürgen Exner wrote:

> (E-Mail Removed) wrote:
> >
> > Unicode character is double-byte

>
> Not necessarily.


"Unicode character" is an abstract concept, which associates the
character with an integer value between 0 and 0x10FFFF.

It's impossible to talk about that abstract concept in practical terms
without considering a specific "Character Encoding Form", which
specifies how to represent that integer value using different sized
units. There exist definitions for how to use 8-bit units (utf-,
16-bit units (utf-16), and 32-bit units (utf-32).

See Chapter 2 of the Unicode specification, in particular sections
2.5 and 2.6 where the terms "Character Encoding Form" and "Character
Encoding Scheme" are elucidated.

e.g at http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf

> UTF-8 uses anything from 1 to 4(?) bytes.


Indeed. The original utf-8 encoding scheme included definitions of
how to represent integers up to 32 bits, using sequences of up to 6
octets (8-bit bytes). But Unicode has now firmly set their upper
limit at 0x10FFFF (for whatever reason they picked that rather odd
endpoint), meaning that utf-8 sequences of more than 4 octets won't be
needed in practice.

h t h
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Python unicode utf-8 characters and MySQL unicode utf-8 characters Grzegorz ¦liwiñski Python 2 01-19-2011 07:31 AM
Help for Unicode char and Unicode char based string in Ruby Chirag Mistry Ruby 6 02-08-2008 12:45 PM
Unicode digit to unicode string Gabriele *darkbard* Farina Python 2 05-16-2006 01:15 PM
unicode wrap unicode object? ygao Python 6 04-08-2006 09:54 AM
Unicode + jsp + mysql + tomcat = unicode still not displaying Robert Mark Bram Java 0 09-28-2003 05:37 AM



Advertisments