Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Is there a unicode EOF mark like DOS ascii ctl-z or unix crl-d ?

Reply
Thread Tools

Is there a unicode EOF mark like DOS ascii ctl-z or unix crl-d ?

 
 
=?ISO-8859-1?Q?Gerhard_H=E4ring?=
Guest
Posts: n/a
 
      09-08-2003
Bob Gailer wrote:
> [...] UniCodeError: UTF-16 decoding error: truncated data


If I remove the last character of the example line you posted, I can
sucessfully convert it to a Unicode string:

>>> s = '\xff\xfe"\x00T\x00a\x00s\x00k\x00

\x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00
\x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00 \n'
>>> unicode(s, "utf-16")

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf16' codec can't decode byte 0xa in position 52:
truncate
d data
>>> unicode(s[:-1], "utf-16")

u'"Task Scheduler Service"\r'
>>>


I'm using Python 2.3, which apparently gives more useful encoding errors
(including the position of the error).

-- Gerhard

 
Reply With Quote
 
 
 
 
Peter Hansen
Guest
Posts: n/a
 
      09-08-2003
Duncan Booth wrote:
>
> Bob Gailer <(E-Mail Removed)> wrote in
> news:(E-Mail Removed):
>
> > That's a good start. I presume I need to use codecs.open(filename,
> > mode[, encoding[, errors[, buffering]]]) to read the file. What is the
> > actual value of the "encoding[" parameter for "Little-endian UTF-16
> > Unicode character data, with CR line terminators"

>
> Try:
>
> myFile = codecs.open(filename, "r", "utf16")


I don't do unicode, but might you not want "rb" instead of just "r"
in the above? Does that argument apply to the low-level "open" or
to the codec open? In other words, when would CR-LF translation be
happening if you specified just "r"?

-Peter
 
Reply With Quote
 
 
 
 
Colin S. Miller
Guest
Posts: n/a
 
      09-08-2003
Bob Gailer wrote:
> At 07:31 AM 9/8/2003, Duncan Booth wrote:
>
>> Bob Gailer <(E-Mail Removed)> wrote in
>> news:(E-Mail Removed):
>>
>> > That's a good start. I presume I need to use codecs.open(filename,
>> > mode[, encoding[, errors[, buffering]]]) to read the file. What is the
>> > actual value of the "encoding[" parameter for "Little-endian UTF-16
>> > Unicode character data, with CR line terminators"

>>
>> Try:
>>
>> myFile = codecs.open(filename, "r", "utf16")
>>
>> If the file starts with a UTF-16 marker (either little or big endian) it
>> will be read correctly. If it doesn't start with either marker reading
>> from
>> it will throw a UnicodeError.

>
>
> Interesting error:
>
> UniCodeError: UTF-16 decoding error: truncated data

Are you doing readline on the unicode file?
I bashed my head off this problem a few months ago, and ended up doing
codecs.open(...).read().splitline()

I think what happens is the codecs::readline calls the underlying
readline code, which doesn't respect unicode, and instead splits at the
first \r or \n it finds; in little-endian this will result in a string
with an odd-number of bytes.

Colin Miller

>
> Bob Gailer
> http://www.velocityreviews.com/forums/(E-Mail Removed)
> 303 442 2625
>
>
> ------------------------------------------------------------------------
>
>
> ---
> Outgoing mail is certified Virus Free.
> Checked by AVG anti-virus system (http://www.grisoft.com).
> Version: 6.0.506 / Virus Database: 303 - Release Date: 8/1/2003


 
Reply With Quote
 
Piet van Oostrum
Guest
Posts: n/a
 
      09-10-2003
>>>>> Bob Gailer <(E-Mail Removed)> (BG) wrote:

BG> On Win 2K the Task Scheduler writes a log file that appears to be encoded.
BG> The first line is:

BG> '\xff\xfe"\x00T\x00a\x00s\x00k\x00
BG> \x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00
BG> \x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00 \n'

BG> My goal is to read this file and process it using Python string
BG> processing.

BG> I am disappointed in the codecs module documentation. I had hoped to find
BG> the answer there, but can't.

BG> I presume this is an encoding, and that '\xff\xfe' defines the encoding.
BG> How does one map '\xff\xfe' to an "encoding".

It's Unicode, actually Little Endian UTF-16, which is the standard encoding
on Win2K. The '\xff\xfe' is the Byte Order mark (BOM) which signifies it
as Little Endian.

>>> import codecs
>>> codecs.BOM_UTF16_LE

'\xff\xfe'

But there is a trailing 0 byte missing (it should have an even number of
bytes, as each character occupies two bytes). Of course this comes because
you think a line ends with '\n', whereas in UTF-16LE it ends with '\n\x00'.
This also means you cannot read them with methods like readline().

>>> st='\xff\xfe"\x00T\x00a\x00s\x00k\x00 \x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00 \x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00 \n\x00'
>>> stu=unicode(st,"utf_16le")
>>> stu

u'"Task Scheduler Service"\r\n'
>>> stu.encode('iso-8859-1')

'"Task Scheduler Service"\r\n'

--
Piet van Oostrum <(E-Mail Removed)>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: (E-Mail Removed)
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[Windows] Any way to distinguish ^C Induced EOF from ^Z EOF? Jan Burse Java 67 03-14-2012 12:21 AM
FAQ 8.29 Why can't my script read from STDIN after I gave it EOF (^D on Unix, ^Z on MS-DOS)? PerlFAQ Server Perl Misc 19 04-28-2011 07:02 PM
ifstream eof not reporting eof? SpreadTooThin C++ 10 06-15-2007 08:49 AM
if EOF = -1, can't a valid character == EOF and cause problems? Kobu C Programming 10 03-04-2005 10:40 PM
my own perl "dos->unix"/"unix->dos" Robert Wallace Perl Misc 7 01-22-2004 10:59 PM



Advertisments