Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: UTF-16-LE and split() under MS-Windows XP

Reply
Thread Tools

Re: UTF-16-LE and split() under MS-Windows XP

 
 
Martin v. =?iso-8859-15?q?L=F6wis?=
Guest
Posts: n/a
 
      07-09-2003
"Colin S. Miller" <colinsm.spam-me-> writes:

> Where have I gone wrong, and what is the correct method
> to verify the BOM mark?


readline is not supported in the UTF-16 codec. You have to read the
entire file, and perform .split. Looking at the BOM should not be
necessary, as the UTF-16 codec will do so on its own.

Regards,
Martin

 
Reply With Quote
 
 
 
 
Colin S. Miller
Guest
Posts: n/a
 
      07-10-2003
Martin v. Löwis wrote:
> "Colin S. Miller" <colinsm.spam-me-> writes:
>
>
>>Where have I gone wrong, and what is the correct method
>>to verify the BOM mark?

>
>
> readline is not supported in the UTF-16 codec. You have to read the
> entire file, and perform .split. Looking at the BOM should not be
> necessary, as the UTF-16 codec will do so on its own.

Is there any reason why readline() isn't supported?
AFAIK,
the prefered UNICODE standard line endings are
0x2028 (Line seperator)
0x2029 (Paragraph seperator)
but 0x10 (Line feed) and 0x13 (carriage return) are
also supported for legacy support.


I'm using
file.read().splitlines() now, but am slightly worried
about perfomance/memory when there a few hundered lines.

TIA,
Colin S. Miller


>
> Regards,
> Martin
>


 
Reply With Quote
 
 
 
 
Martin v. =?iso-8859-15?q?L=F6wis?=
Guest
Posts: n/a
 
      07-10-2003
"Colin S. Miller" <colinsm.spam-me-> writes:

> Is there any reason why readline() isn't supported?


Because it hasn't been implemented. The naive approach of calling the
readline of the underlying stream (as all other codecs do) does not
work for UTF-16.

> AFAIK, the prefered UNICODE standard line endings are 0x2028 (Line
> seperator) 0x2029 (Paragraph seperator) but 0x10 (Line feed) and
> 0x13 (carriage return) are also supported for legacy support.


Add that on top of that. One should support all line breaking
characters for UTF-16, atleast in Universal Newline (U) mode.

> I'm using file.read().splitlines() now, but am slightly worried
> about perfomance/memory when there a few hundered lines.


Feel free to implement and contribute a patch. It has been that way
for some years now, and it likely will stay the same for the coming
years unless somebody contributes a patch.

Regards,
Martin

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Motorola under neocon control, Sec.Gates against the wall, McCainclueless again and Energy Independence under attack practicalfusion@gmail.com Computer Information 0 04-11-2008 10:55 PM
Tomcat 5.5+ On a Mac, Under Eclipse, Under OS X Edward V. Berard Java 4 04-04-2006 05:14 AM
help : my jar file is not running under linux terminal , but it runs under JbuilderX ide bronby Java 1 07-15-2005 07:23 AM
[newbie]How to install python under DOS and is there any Wxpython can be installed under dos? john san Python 19 02-18-2005 12:05 PM
Java application developped under Linux running ridiculously slow under Windows hshdude Java 12 11-04-2004 05:49 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57