Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Unicode string handling problem

Reply
Thread Tools

Unicode string handling problem

 
 
Richard Schulman
Guest
Posts: n/a
 
      09-06-2006
The following program fragment works correctly with an ascii input
file.

But the file I actually want to process is Unicode (utf-16 encoding).
The file must be Unicode rather than ASCII or Latin-1 because it
contains mixed Chinese and English characters.

When I run the program below I get an attribute_count of zero, which
is incorrect for the input file, which should give a value of fifteen
or sixteen. In other words, the count function isn't recognizing the
", characters in the line being read. Here's the program:

in_file = open("c:\\pythonapps\\in-graf1.my","rU")
try:
# Skip the first line; make the second available for processing
in_file.readline()
in_line = readline()
attribute_count = in_line.count('",')
print attribute_count
finally:
in_file.close()

Any suggestions?

Richard Schulman
(For email reply, delete the 'xx' characters)
 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      09-06-2006
Richard Schulman wrote:
> The following program fragment works correctly with an ascii input
> file.
>
> But the file I actually want to process is Unicode (utf-16 encoding).
> The file must be Unicode rather than ASCII or Latin-1 because it
> contains mixed Chinese and English characters.
>
> When I run the program below I get an attribute_count of zero, which
> is incorrect for the input file, which should give a value of fifteen
> or sixteen. In other words, the count function isn't recognizing the
> ", characters in the line being read. Here's the program:
>
> in_file = open("c:\\pythonapps\\in-graf1.my","rU")
> try:
> # Skip the first line; make the second available for processing
> in_file.readline()
> in_line = readline()


You mean in_line = in_file.readline(), I hope. Do please copy/paste
actual code, not what you think you ran.

> attribute_count = in_line.count('",')
> print attribute_count


Insert
print type(in_line)
print repr(in_line)
here [also make the appropriate changes to get the same info from the
first line], run it again, copy/paste what you get, show us what you
see.

If you're coy about that, then you'll have to find out yourself if it
has a BOM at the front, and if not whether it's little/big/endian.

> finally:
> in_file.close()
>
> Any suggestions?
>


1. Read the Unicode HOWTO.
2. Read the docs on the codecs module ...

You'll need to use

in_file = codecs.open(filepath, mode, encoding="utf16???????")

It would also be a good idea to get into the habit of using unicode
constants like u'",'

HTH,
John

 
Reply With Quote
 
 
 
 
John Roth
Guest
Posts: n/a
 
      09-06-2006

Richard Schulman wrote:
> The following program fragment works correctly with an ascii input
> file.
>
> But the file I actually want to process is Unicode (utf-16 encoding).
> The file must be Unicode rather than ASCII or Latin-1 because it
> contains mixed Chinese and English characters.
>
> When I run the program below I get an attribute_count of zero, which
> is incorrect for the input file, which should give a value of fifteen
> or sixteen. In other words, the count function isn't recognizing the
> ", characters in the line being read. Here's the program:
>
> in_file = open("c:\\pythonapps\\in-graf1.my","rU")
> try:
> # Skip the first line; make the second available for processing
> in_file.readline()
> in_line = readline()
> attribute_count = in_line.count('",')
> print attribute_count
> finally:
> in_file.close()
>
> Any suggestions?
>
> Richard Schulman
> (For email reply, delete the 'xx' characters)


You're not detecting the file encoding and then
using it in the open statement. If you know this is
utf-16le or utf-16be, you need to say so in the
open. If you don't, then you should read it into
a string, go through some autodetect logic, and
then decode it with the <string>.decode(encoding)
method.

A clue: a properly formatted utf-16 or utf-32
file MUST have a BOM as the first character.
That's mandated in the unicode standard. If
it doesn't have a BOM, then try ascii and
utf-8 in that order. The first
one that succeeds is correct. If neither succeeds,
you're on your own in guessing the file encoding.

John Roth

 
Reply With Quote
 
Richard Schulman
Guest
Posts: n/a
 
      09-06-2006
Thanks for your excellent debugging suggestions, John. See below for
my follow-up:

Richard Schulman:
>> The following program fragment works correctly with an ascii input
>> file.
>>
>> But the file I actually want to process is Unicode (utf-16 encoding).
>> The file must be Unicode rather than ASCII or Latin-1 because it
>> contains mixed Chinese and English characters.
>>
>> When I run the program below I get an attribute_count of zero, which
>> is incorrect for the input file, which should give a value of fifteen
>> or sixteen. In other words, the count function isn't recognizing the
>> ", characters in the line being read. Here's the program:
>>...


John Machin:
>Insert
> print type(in_line)
> print repr(in_line)
>here [also make the appropriate changes to get the same info from the
>first line], run it again, copy/paste what you get, show us what you
>see.


Here's the revised program, per your suggestion:

================================================== ===

# This program processes a UTF-16 input file that is
# to be loaded later into a mySQL table. The input file
# is not yet ready for prime time. The purpose of this
# program is to ready it.

in_file = open("c:\\pythonapps\\in-graf1.my","rU")
try:
# The first line read is a SQL INSERT statement; no
# processing will be required.
in_line = in_file.readline()
print type(in_line) #For debugging
print repr(in_line) #For debugging

# The second line read is the first data row.
in_line = in_file.readline()
print type(in_line) #For debugging
print repr(in_line) #For debugging

# For this and subsequent rows, we must count all
# the < ", > character-pairs in a given line/row.
# This will provide an n-1 measure of the attributes
# for a SQL insert of this row. All rows must have
# sixteen attributes, but some don't yet.
attribute_count = in_line.count('",')
print attribute_count
finally:
in_file.close()

================================================== ===

The output of this program, which I ran at the command line,
must needs to be copied by hand and abridged, but I think I
have included the relevant information:

C:\pythonapps>python graf_correction.py
<type 'str'>
'\xff\xfeI\x00N\x00S... [the beginning of a SQL INSERT statement]
....\x00U\x00E\x00S\x00\n' [the VALUES keyword at the end of the row,
followed by an end-of-line]
<type 'str'>
'\x00\n' [oh-oh! For the second row, all we're seeing
is an end-of-line character. Is that from
the first row? Wasn't the "rU" mode
supposed to handle that]
0 [the counter value. It's hardly surprising
it's only zero, given that most of the row
never got loaded, just an eol mark]

J.M.:
>If you're coy about that, then you'll have to find out yourself if it
>has a BOM at the front, and if not whether it's little/big/endian.


The BOM is little-endian, I believe.

R.S.:
>> Any suggestions?


J.M.
>1. Read the Unicode HOWTO.
>2. Read the docs on the codecs module ...
>
>You'll need to use
>
>in_file = codecs.open(filepath, mode, encoding="utf16???????")


Right you are. Here is the output produced by so doing:

<type 'unicode'>
u'\ufeffINSERT INTO [...] VALUES\N'
<type 'unicode'>
u'\n'
0 [The counter value]

>It would also be a good idea to get into the habit of using unicode
>constants like u'",'


Right.

>HTH,
>John


Yes, it did. Many thanks! Now I've got to figure out the best way to
handle that \n\n at the end of each row, which the program is
interpreting as two rows. That represents two surprises: first, I
thought that Microsoft files ended as \n\r ; second, I thought that
Python mode "rU" was supposed to be the universal eol handler and
would handle the \n\r as one mark.

Richard Schulman
 
Reply With Quote
 
Richard Schulman
Guest
Posts: n/a
 
      09-06-2006
On 5 Sep 2006 19:50:27 -0700, "John Roth" <(E-Mail Removed)>
wrote:

>> [T]he file I actually want to process is Unicode (utf-16 encoding).
>>...
>> in_file = open("c:\\pythonapps\\in-graf1.my","rU")
>>...


John Roth:
>You're not detecting the file encoding and then
>using it in the open statement. If you know this is
>utf-16le or utf-16be, you need to say so in the
>open. If you don't, then you should read it into
>a string, go through some autodetect logic, and
>then decode it with the <string>.decode(encoding)
>method.
>
>A clue: a properly formatted utf-16 or utf-32
>file MUST have a BOM as the first character.
>That's mandated in the unicode standard. If
>it doesn't have a BOM, then try ascii and
>utf-8 in that order. The first
>one that succeeds is correct. If neither succeeds,
>you're on your own in guessing the file encoding.


Thanks for this further information. I'm now using the codec with
improved results, but am still puzzled as to how to handle the row
termination of \n\n, which is being interpreted as two rows instead of
one.
 
Reply With Quote
 
Richard Schulman
Guest
Posts: n/a
 
      09-06-2006
On Wed, 06 Sep 2006 03:55:18 GMT, Richard Schulman
<(E-Mail Removed)> wrote:

>...I'm now using the codec with
>improved results, but am still puzzled as to how to handle the row
>termination of \n\n, which is being interpreted as two rows instead of
>one.


Of course, I could do a double read on each row and ignore the second
read, which merely fetches the final of the two u'\n' characters. But
that's not very elegant, and I'm sure there's a better way to do it
(hint, hint someone).

Richard Schulman (for email, drop the 'xx' in the reply-to)
 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      09-06-2006
Richard Schulman wrote:
[big snip]
>
> The BOM is little-endian, I believe.

Correct.

> >in_file = codecs.open(filepath, mode, encoding="utf16???????")

>
> Right you are. Here is the output produced by so doing:


You don't say which encoding you used, but I guess that you used
utf_16_le.

>
> <type 'unicode'>
> u'\ufeffINSERT INTO [...] VALUES\N'


Use utf_16 -- it will strip off the BOM for you.

> <type 'unicode'>
> u'\n'
> 0 [The counter value]
>

[snip]
> Yes, it did. Many thanks! Now I've got to figure out the best way to
> handle that \n\n at the end of each row, which the program is
> interpreting as two rows.


Well we don't know yet exactly what you have there. We need a byte dump
of the first few bytes of your file. Get into the interactive
interpreter and do this:

open('yourfile', 'rb').read(200)
(the 'b' is for binary, in case you are on Windows)
That will show us exactly what's there, without *any* EOL
interpretation at all.


> That represents two surprises: first, I
> thought that Microsoft files ended as \n\r ;


Nah. Wrong on two counts. In text mode, Microsoft *lines* end in \r\n
(not \n\r); *files* may end in ctrl-Z aka chr(26) -- an inheritance
from CP/M.

Ummmm ... are you saying the file has \n\r at the end of each row?? How
did you know that if you didn't know what if any BOM it had??? Who
created the file????

> second, I thought that
> Python mode "rU" was supposed to be the universal eol handler and
> would handle the \n\r as one mark.


Nah again. It contemplates only \n, \r, and \r\n as end of line. See
the docs. Thus \n\r becomes *two* newlines when read with "rU".

Having "\n\r" at the end of each row does fit with your symptoms:

| >>> bom = u"\ufeff"
| >>> guff = '\n\r'.join(['abc', 'def', 'ghi'])
| >>> guffu = unicode(guff)
| >>> import codecs
| >>> f = codecs.open('guff.utf16le', 'wb', encoding='utf_16_le')
| >>> f.write(bom+guffu)
| >>> f.close()

| >>> open('guff.utf16le', 'rb').read() #### see exactly what we've got

|
'\xff\xfea\x00b\x00c\x00\n\x00\r\x00d\x00e\x00f\x0 0\n\x00\r\x00g\x00h\x00i\x00'

| >>> codecs.open('guff.utf16le', 'r', encoding='utf_16').read()
| u'abc\n\rdef\n\rghi' ######### Look, Mom, no BOM!

| >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16').read()
| u'abc\n\ndef\n\nghi' #### U means \r -> \n

| >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16_le').read()
| u'\ufeffabc\n\ndef\n\nghi' ######### reproduces your second
experience

| >>> open('guff.utf16le', 'rU').readlines()
| ['\xff\xfea\x00b\x00c\x00\n', '\x00\n', '\x00d\x00e\x00f\x00\n',
'\x00\n', '\x00
| g\x00h\x00i\x00']
| >>> f = open('guff.utf16le', 'rU')
| >>> f.readline()
| '\xff\xfea\x00b\x00c\x00\n'
| >>> f.readline()
| '\x00\n' ######### reproduces your first experience
| >>> f.readline()
| '\x00d\x00e\x00f\x00\n'
| >>>

If that file is a one-off, you can obviously fix it by
throwing away every second line. Otherwise, if it's an ongoing
exercise, you need to talk sternly to the file's creator

HTH,
John

 
Reply With Quote
 
Richard Schulman
Guest
Posts: n/a
 
      09-07-2006
Many thanks for your help, John, in giving me the tools to work
successfully in Python with Unicode from here on out.

It turns out that the Unicode input files I was working with (from MS
Word and MS Notepad) were indeed creating eol sequences of \r\n, not
\n\n as I had originally thought. The file reading statement that I
was using, with unpredictable results, was

#in_file =
codecs.open("c:\\pythonapps\\in-graf2.my","rU",encoding="utf-16LE")

This was reading to the \n on first read (outputting the whole line,
including the \n but, weirdly, not the preceding \r). Then, also
weirdly, the next readline would read the same \n again, interpreting
that as the entirety of a phantom second line. So each input file line
ended up producing two output lines.

Once the mode string "rU" was dropped, as in

in_file =
codecs.open("c:\\pythonapps\\in-graf2.my",encoding="utf-16LE")

all suddenly became well: no more doubled readlines, and one could see
the \r\n termination of each line.

This behavior of "rU" was not at all what I had expected from the
brief discussion of it in _Python Cookbook_. Which all goes to point
out how difficult it is to cook challenging dishes with sketchy
recipes alone. There is no substitute for the helpful advice of an
experienced chef.

-Richard Schulman
(remove "xx" for email reply)

On 5 Sep 2006 22:29:59 -0700, "John Machin" <(E-Mail Removed)>
wrote:

>Richard Schulman wrote:
>[big snip]
>>
>> The BOM is little-endian, I believe.

>Correct.
>
>> >in_file = codecs.open(filepath, mode, encoding="utf16???????")

>>
>> Right you are. Here is the output produced by so doing:

>
>You don't say which encoding you used, but I guess that you used
>utf_16_le.
>
>>
>> <type 'unicode'>
>> u'\ufeffINSERT INTO [...] VALUES\N'

>
>Use utf_16 -- it will strip off the BOM for you.
>
>> <type 'unicode'>
>> u'\n'
>> 0 [The counter value]
>>

>[snip]
>> Yes, it did. Many thanks! Now I've got to figure out the best way to
>> handle that \n\n at the end of each row, which the program is
>> interpreting as two rows.

>
>Well we don't know yet exactly what you have there. We need a byte dump
>of the first few bytes of your file. Get into the interactive
>interpreter and do this:
>
>open('yourfile', 'rb').read(200)
>(the 'b' is for binary, in case you are on Windows)
>That will show us exactly what's there, without *any* EOL
>interpretation at all.
>
>
>> That represents two surprises: first, I
>> thought that Microsoft files ended as \n\r ;

>
>Nah. Wrong on two counts. In text mode, Microsoft *lines* end in \r\n
>(not \n\r); *files* may end in ctrl-Z aka chr(26) -- an inheritance
>from CP/M.
>
>Ummmm ... are you saying the file has \n\r at the end of each row?? How
>did you know that if you didn't know what if any BOM it had??? Who
>created the file????
>
>> second, I thought that
>> Python mode "rU" was supposed to be the universal eol handler and
>> would handle the \n\r as one mark.

>
>Nah again. It contemplates only \n, \r, and \r\n as end of line. See
>the docs. Thus \n\r becomes *two* newlines when read with "rU".
>
>Having "\n\r" at the end of each row does fit with your symptoms:
>
>| >>> bom = u"\ufeff"
>| >>> guff = '\n\r'.join(['abc', 'def', 'ghi'])
>| >>> guffu = unicode(guff)
>| >>> import codecs
>| >>> f = codecs.open('guff.utf16le', 'wb', encoding='utf_16_le')
>| >>> f.write(bom+guffu)
>| >>> f.close()
>
>| >>> open('guff.utf16le', 'rb').read() #### see exactly what we've got
>
>|
>'\xff\xfea\x00b\x00c\x00\n\x00\r\x00d\x00e\x00f\x 00\n\x00\r\x00g\x00h\x00i\x00'
>
>| >>> codecs.open('guff.utf16le', 'r', encoding='utf_16').read()
>| u'abc\n\rdef\n\rghi' ######### Look, Mom, no BOM!
>
>| >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16').read()
>| u'abc\n\ndef\n\nghi' #### U means \r -> \n
>
>| >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16_le').read()
>| u'\ufeffabc\n\ndef\n\nghi' ######### reproduces your second
>experience
>
>| >>> open('guff.utf16le', 'rU').readlines()
>| ['\xff\xfea\x00b\x00c\x00\n', '\x00\n', '\x00d\x00e\x00f\x00\n',
>'\x00\n', '\x00
>| g\x00h\x00i\x00']
>| >>> f = open('guff.utf16le', 'rU')
>| >>> f.readline()
>| '\xff\xfea\x00b\x00c\x00\n'
>| >>> f.readline()
>| '\x00\n' ######### reproduces your first experience
>| >>> f.readline()
>| '\x00d\x00e\x00f\x00\n'
>| >>>
>
>If that file is a one-off, you can obviously fix it by
>throwing away every second line. Otherwise, if it's an ongoing
>exercise, you need to talk sternly to the file's creator
>
>HTH,
>John

 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      09-07-2006

Richard Schulman wrote:
> It turns out that the Unicode input files I was working with (from MS
> Word and MS Notepad) were indeed creating eol sequences of \r\n, not
> \n\n as I had originally thought. The file reading statement that I
> was using, with unpredictable results, was
>
> #in_file =
> codecs.open("c:\\pythonapps\\in-graf2.my","rU",encoding="utf-16LE")
>
> This was reading to the \n on first read (outputting the whole line,
> including the \n but, weirdly, not the preceding \r). Then, also
> weirdly, the next readline would read the same \n again, interpreting
> that as the entirety of a phantom second line. So each input file line
> ended up producing two output lines.
>
> Once the mode string "rU" was dropped, as in
>
> in_file =
> codecs.open("c:\\pythonapps\\in-graf2.my",encoding="utf-16LE")
>
> all suddenly became well: no more doubled readlines, and one could see
> the \r\n termination of each line.


You are on Windows. I would *not* describe as "well" lines read in (the
default) text mode ending in u"\r\n". It would expect it to convert the
line endings to u"\n". At best, this should be documented. Perhaps
someone with some knowledge of the intended treatment of line endings
by codecs.open() in text mode could comment? The two problems are
succintly described below:

File created in Windows Notepad and saved with "Unicode" encoding.
Results in UTF-16LE encoding, line terminator is CR LF, has BOM (LE) at
front -- as show below.

| Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
on win32
| Type "help", "copyright", "credits" or "license" for more
information.
| >>> open('notepad_uc.txt', 'rb').read()
|
'\xff\xfea\x00b\x00c\x00\r\x00\n\x00d\x00e\x00f\x0 0\r\x00\n\x00g\x00h\x00i\x00\r
| \x00\n\x00'
| >>> import codecs
| >>> codecs.open('notepad_uc.txt', 'r',
encoding='utf_16_le').readlines()
| [u'\ufeffabc\r\n', u'def\r\n', u'ghi\r\n']
| >>> codecs.open('notepad_uc.txt', 'r', encoding='utf_16').readlines()
| [u'abc\r\n', u'def\r\n', u'ghi\r\n']
### presence ot u'\r' was *not* expected
| >>> codecs.open('notepad_uc.txt', 'rU',
encoding='utf_16_le').readlines()
| [u'\ufeffabc\n', u'\n', u'def\n', u'\n', u'ghi\n', u'\n']
| >>> codecs.open('notepad_uc.txt', 'rU',
encoding='utf_16').readlines()
| [u'abc\n', u'\n', u'def\n', u'\n', u'ghi\n', u'\n']
### 'U' flag does change the behaviour, but *not* as expected.

Cheers,
John

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem Regarding Handling of Unicode string joy99 Python 4 08-16-2009 11:19 AM
Help for Unicode char and Unicode char based string in Ruby Chirag Mistry Ruby 6 02-08-2008 12:45 PM
[unicode] inconvenient unicode conversion of non-string arguments Holger Joukl Python 5 12-13-2006 10:10 PM
Unicode string handling problem (revised) Richard Schulman Python 1 09-06-2006 01:46 AM
Unicode digit to unicode string Gabriele *darkbard* Farina Python 2 05-16-2006 01:15 PM



Advertisments