Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > split() can help to read UTF-16 encoded file without codecs support,why?

Reply
Thread Tools

split() can help to read UTF-16 encoded file without codecs support,why?

 
 
Zhongjian Lu
Guest
Posts: n/a
 
      03-17-2006
Hi Guys,

I was processing a UTF-16 coded file with BOM and was not aware of the
codecs package at first. I wrote the following code:
===== Code 1============================
for i in open("d:\python24\lzjtest.xml", 'r').readlines():
i = i.decode("utf-16")
print i
=======================================
Output was:
Traceback (most recent call last):
File "D:\Python24\testutf-16.py", line 4, in -toplevel-
i = i.decode("utf-16")
File "D:\Python24\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position
84: truncated data

I searched google and found an article on the similar problem saying to use
split(). I had not quite caught the meaning of the article and recode as:
==== Code 2==============================
for i in open("d:\python24\lzjtest.xml", 'r').read().split('\r\n'):
i = i.decode("utf-16")
print i
=======================================
Then it worked (echo the file).

Later I get to know codecs and write the following code:

==== Code 3 =============================
import codecs
for i in codecs.open("d:\python24\lzjtesttvs2.xml", 'r', 'utf-16').readlines():
print i
=======================================
It worked and echo the file.

I am wondering what is the problem with the first code and why the bug
is fixed in
the second.

Thanks in advance.

-Zhongjian
 
Reply With Quote
 
 
 
 
Fuzzyman
Guest
Posts: n/a
 
      03-17-2006

Zhongjian Lu wrote:
> Hi Guys,
>
> I was processing a UTF-16 coded file with BOM and was not aware of the
> codecs package at first. I wrote the following code:
> ===== Code 1============================
> for i in open("d:\python24\lzjtest.xml", 'r').readlines():
> i = i.decode("utf-16")
> print i
> =======================================
> Output was:
> Traceback (most recent call last):
> File "D:\Python24\testutf-16.py", line 4, in -toplevel-
> i = i.decode("utf-16")
> File "D:\Python24\lib\encodings\utf_16.py", line 16, in decode
> return codecs.utf_16_decode(input, errors, True)
> UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position
> 84: truncated data
>


UTF16 is a 'two-byte encoding'. This means that '\r\n' is represented
using :

'\r\x00\n\x00'

When you use readlines to split this up it splits on byte boundaries.
This probably returns something like :

'\r', '\x00\n', '\x00'

You can see how the last bit is 'truncated' (single byte only) because
the data has been split on bytes instead of characters.


> I searched google and found an article on the similar problem saying to use
> split(). I had not quite caught the meaning of the article and recode as:
> ==== Code 2==============================
> for i in open("d:\python24\lzjtest.xml", 'r').read().split('\r\n'):
> i = i.decode("utf-16")
> print i
> =======================================
> Then it worked (echo the file).
>


You will probably find that '\r\n' never occurs in the byte-string, so
this does it *all* in one line, but the decode succeeds.

HTH

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

> Later I get to know codecs and write the following code:
>
> ==== Code 3 =============================
> import codecs
> for i in codecs.open("d:\python24\lzjtesttvs2.xml", 'r', 'utf-16').readlines():
> print i
> =======================================
> It worked and echo the file.
>
> I am wondering what is the problem with the first code and why the bug
> is fixed in
> the second.
>
> Thanks in advance.
>
> -Zhongjian


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Issues with `codecs.register` and `codecs.CodecInfo` objects Karl Knechtel Python 2 07-10-2012 02:49 PM
codecs in a chroot / without fs access Philipp Hagemeister Python 3 01-10-2012 04:42 PM
File.read(fname) vs. File.read(fname,File.size(fname)) Alex Dowad Ruby 4 05-01-2010 08:20 AM
cant read encoded messages moe paquette Computer Support 5 10-30-2006 01:20 AM
Can I decode the content with java which is encoded with DESCryptoServiceProvider wolf ASP .Net 0 03-20-2005 06:14 PM



Advertisments