Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Ascii to Unicode.

Reply
Thread Tools

Ascii to Unicode.

 
 
Joe Goldthwaite
Guest
Posts: n/a
 
      07-28-2010
Thanks to all of you who responded. I guess I was working from the wrong
premise. I was thinking that a file could write any kind of data and that
once I had my Unicode string, I could just write it out with a standard
file.write() operation.

What is actually happening is the file.write() operation was generating the
error until I re-encoded the string as utf-8. This is what worked;

import unicodedata

input = file('ascii.csv', 'rb')
output = file('unicode.csv','wb')

for line in input.xreadlines():
unicodestring = unicode(line, 'latin1')
output.write(unicodestring.encode('utf-8')) # This second encode is
what I was missing.

input.close()
output.close()

A number of you pointed out what I was doing wrong but I couldn't understand
it until I realized that the write operation didn't work until it was using
a properly encoded Unicode string. I thought I was getting the error on the
initial latin Unicode conversion not in the write operation.

This still seems odd to me. I would have thought that the unicode function
would return a properly encoded byte stream that could then simply be
written to disk. Instead it seems like you have to re-encode the byte stream
to some kind of escaped Ascii before it can be written back out.

Thanks to all of you who took the time to respond. I really do appreciate
it. I think with my mental block, I couldn't have figure it out without
your help.


 
Reply With Quote
 
 
 
 
Steven D'Aprano
Guest
Posts: n/a
 
      07-29-2010
On Wed, 28 Jul 2010 15:58:01 -0700, Joe Goldthwaite wrote:

> This still seems odd to me. I would have thought that the unicode
> function would return a properly encoded byte stream that could then
> simply be written to disk. Instead it seems like you have to re-encode
> the byte stream to some kind of escaped Ascii before it can be written
> back out.


I'm afraid that's not even wrong. The unicode function returns a unicode
string object, not a byte-stream, just as the list function returns a
sequence of objects, not a byte-stream.

Perhaps this will help:

http://www.joelonsoftware.com/articles/Unicode.html


Summary:

ASCII is not a synonym for bytes, no matter what some English-speakers
think. ASCII is an encoding from bytes like \x41 to characters like "A".

Unicode strings are a sequence of code points. A code point is a number,
implemented in some complex fashion that you don't need to care about.
Each code point maps conceptually to a letter; for example, the English
letter A is represented by the code point U+0041 and the Arabic letter
Ain is represented by the code point U+0639.

You shouldn't make any assumptions about the size of each code-point, or
how they are put together. You shouldn't expect to write code points to a
disk and have the result make sense, any more than you could expect to
write a sequence of tuples or sets or dicts to disk in any sensible
fashion. You have to serialise it to bytes first, and that's what the
encode method does. Decode does the opposite, taking bytes and creating
unicode strings from them.

For historical reasons -- backwards compatibility with files already
created, back in the Bad Old Days before unicode -- there are a whole
slew of different encodings available. There is no 1:1 mapping between
bytes and strings. If all you have are the bytes, there is literally no
way of knowing what string they represent (although sometimes you can
guess). You need to know what the encoding used was, or take a guess, or
make repeated decodings until something doesn't fail and hope that's the
right one.

As a general rule, Python will try encoding/decoding using the ASCII
encoding unless you tell it differently.

Any time you are writing to disk, you need to serialise the objects,
regardless of whether they are floats, or dicts, or unicode strings.


--
Steven
 
Reply With Quote
 
 
 
 
Ulrich Eckhardt
Guest
Posts: n/a
 
      07-29-2010
Joe Goldthwaite wrote:
> import unicodedata
>
> input = file('ascii.csv', 'rb')
> output = file('unicode.csv','wb')
>
> for line in input.xreadlines():
> unicodestring = unicode(line, 'latin1')
> output.write(unicodestring.encode('utf-8')) # This second encode
> is what I was missing.


Actually, I see two problems here:
1. "ascii.csv" is not an ASCII file but a Latin-1 encoded file, so there
starts the first confusion.
2. "unicode.csv" is not a "Unicode" file, because Unicode is not a file
format. Rather, it is a UTF-8 encoded file, which is one encoding of
Unicode. This is the second confusion.

> A number of you pointed out what I was doing wrong but I couldn't
> understand it until I realized that the write operation didn't work until
> it was using a properly encoded Unicode string.


The write function wants bytes! Encoding a string in your favourite encoding
yields bytes.

> This still seems odd to me. I would have thought that the unicode
> function would return a properly encoded byte stream that could then
> simply be written to disk.


No, unicode() takes a byte stream and decodes it according to the given
encoding. You then get an internal representation of the string, a unicode
object. This representation typically resembles UCS2 or UCS4, which are
more suitable for internal manipulation than UTF-8. This object is a string
btw, so typical stuff like concatenation etc are supported. However, the
internal representation is a sequence of Unicode codepoints but not a
guaranteed sequence of bytes which is what you want in a file.

> Instead it seems like you have to re-encode the byte stream to some
> kind of escaped Ascii before it can be written back out.


As mentioned above, you have a string. For writing, that string needs to be
transformed to bytes again.


Note: You can also configure a file to read one encoding or write another.
You then get unicode objects from the input which you can feed to the
output. The important difference is that you only specify the encoding in
one place and it will probably even be more performant. I'd have to search
to find you the according library calls though, but starting point is
http://docs.python.org.

Good luck!

Uli

--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

 
Reply With Quote
 
Joe Goldthwaite
Guest
Posts: n/a
 
      07-29-2010
Hi Steven,

I read through the article you referenced. I understand Unicode better now.
I wasn't completely ignorant of the subject. My confusion is more about how
Python is handling Unicode than Unicode itself. I guess I'm fighting my own
misconceptions. I do that a lot. It's hard for me to understand how things
work when they don't function the way I *think* they should.

Here's the main source of my confusion. In my original sample, I had read a
line in from the file and used the unicode function to create a
unicodestring object;

unicodestring = unicode(line, 'latin1')

What I thought this step would do is translate the line to an internal
Unicode representation. The problem character \xe1 would have been
translated into a correct Unicode representation for the accented "a"
character.

Next I tried to write the unicodestring object to a file thusly;

output.write(unicodestring)

I would have expected the write function to request the byte string from the
unicodestring object and simply write that byte string to a file. I thought
that at this point, I should have had a valid Unicode latin1 encoded file.
Instead get an error that the character \xe1 is invalid.

The fact that the \xe1 character is still in the unicodestring object tells
me it wasn't translated into whatever python uses for its internal Unicode
representation. Either that or the unicodestring object returns the
original string when it's asked for a byte stream representation.

Instead of just writing the unicodestring object, I had to do this;

output.write(unicodestring.encode('utf-8'))

This is doing what I thought the other steps were doing. It's translating
the internal unicodestring byte representation to utf-8 and writing it out.
It still seems strange and I'm still not completely clear as to what is
going on at the byte stream level for each of these steps.



 
Reply With Quote
 
Joe Goldthwaite
Guest
Posts: n/a
 
      07-29-2010
Hi Ulrich,

Ascii.csv isn't really a latin-1 encoded file. It's an ascii file with a
few characters above the 128 range that are causing Postgresql Unicode
errors. Those characters work fine in the Windows world but they're not the
correct byte representation for Unicode. What I'm attempting to do is
translate those upper range characters into the correct Unicode
representations so that they look the same in the Postgresql database as
they did in the CSV file.

I wrote up the source of my confusion to Steven so I won't duplicate it
here. You're comment on defining the encoding of the file directly instead
of using functions to encode and decode the data lead me to the codecs
module. Using it, I can define the encoding a file open time and then just
read and write the lines. I ended up with this;

import codecs

input = codecs.open('ascii.csv', encoding='cp1252')
output = codecs.open('unicode.csv', mode='wb', encoding='utf-8')

output.writelines(input.readlines())

input.close()
output.close()

This is doing exactly the same thing but it's much clearer to me. Readlines
translates the input using the cp1252 codec and writelines encodes it to
utf-8 and writes it out. And as you mentioned, it's probably higher
performance. I haven't tested that but since both programs do the job in
seconds, performance isn't and issue.

Thanks again to everyone who posted. I really do appreciate it.


 
Reply With Quote
 
Ethan Furman
Guest
Posts: n/a
 
      07-29-2010
Joe Goldthwaite wrote:
> Hi Steven,
>
> I read through the article you referenced. I understand Unicode better now.
> I wasn't completely ignorant of the subject. My confusion is more about how
> Python is handling Unicode than Unicode itself. I guess I'm fighting my own
> misconceptions. I do that a lot. It's hard for me to understand how things
> work when they don't function the way I *think* they should.
>
> Here's the main source of my confusion. In my original sample, I had read a
> line in from the file and used the unicode function to create a
> unicodestring object;
>
> unicodestring = unicode(line, 'latin1')
>
> What I thought this step would do is translate the line to an internal
> Unicode representation. The problem character \xe1 would have been
> translated into a correct Unicode representation for the accented "a"
> character.


Correct. At this point you have unicode string.

> Next I tried to write the unicodestring object to a file thusly;
>
> output.write(unicodestring)
>
> I would have expected the write function to request the byte string from the
> unicodestring object and simply write that byte string to a file. I thought
> that at this point, I should have had a valid Unicode latin1 encoded file.
> Instead get an error that the character \xe1 is invalid.


Here's the problem -- there is no byte string representing the unicode
string, they are completely different. There are dozens of different
possible encodings to go from unicode to a byte-string (of which UTF-8
is one such possibility).

> The fact that the \xe1 character is still in the unicodestring object tells
> me it wasn't translated into whatever python uses for its internal Unicode
> representation. Either that or the unicodestring object returns the
> original string when it's asked for a byte stream representation.


Wrong. It so happens that some of the unicode points are the same as
some (but not all) of the ascii and upper-ascii values. When you
attempt to write a unicode string without specifying which encoding you
want, python falls back to ascii (not upper-ascii) so any character
outside the 0-127 range is going to raise an error.

> Instead of just writing the unicodestring object, I had to do this;
>
> output.write(unicodestring.encode('utf-8'))
>
> This is doing what I thought the other steps were doing. It's translating
> the internal unicodestring byte representation to utf-8 and writing it out.
> It still seems strange and I'm still not completely clear as to what is
> going on at the byte stream level for each of these steps.



Don't think of unicode as a byte stream. It's a bunch of numbers that
map to a bunch of symbols. The byte stream only comes into play when
you want to send unicode somewhere (file, socket, etc) and you then have
to encode the unicode into bytes.

Hope this helps!

~Ethan~
 
Reply With Quote
 
Carey Tilden
Guest
Posts: n/a
 
      07-29-2010
On Thu, Jul 29, 2010 at 10:59 AM, Joe Goldthwaite <(E-Mail Removed)> wrote:
> Hi Ulrich,
>
> Ascii.csv isn't really a latin-1 encoded file. *It's an ascii file with a
> few characters above the 128 range that are causing Postgresql Unicode
> errors. *Those characters work fine in the Windows world but they're not the
> correct byte representation for Unicode. What I'm attempting to do is
> translate those upper range characters into the correct Unicode
> representations so that they look the same in the Postgresql database as
> they did in the CSV file.


Having bytes outside of the ASCII range means, by definition, that the
file is not ASCII encoded. ASCII only defines bytes 0-127. Bytes
outside of that range mean either the file is corrupt, or it's in a
different encoding. In this case, you've been able to determine the
correct encoding (latin-1) for those errant bytes, so the file itself
is thus known to be in that encoding.

Carey
 
Reply With Quote
 
Ethan Furman
Guest
Posts: n/a
 
      07-29-2010
Joe Goldthwaite wrote:
> Hi Ulrich,
>
> Ascii.csv isn't really a latin-1 encoded file. It's an ascii file with a
> few characters above the 128 range . . .


It took me a while to get this point too (if you already have "gotten
it", I apologize, but the above comment leads me to believe you haven't).

*Every* file is an encoded file... even your UTF-8 file is encoded using
the UTF-8 format. Someone correct me if I'm wrong, but I believe
lower-ascii (0-127) matches up to the first 128 Unicode code points, so
while those first 128 code-points translate easily to ascii, ascii is
still an encoding, and if you have characters higher than 127, you don't
really have an ascii file -- you have (for example) a cp1252 file (which
also, not coincidentally, shares the first 128 characters/code points
with ascii).

Hopefully I'm not adding to the confusion.

~Ethan~
 
Reply With Quote
 
John Nagle
Guest
Posts: n/a
 
      07-29-2010
On 7/28/2010 3:58 PM, Joe Goldthwaite wrote:
> This still seems odd to me. I would have thought that the unicode function
> would return a properly encoded byte stream that could then simply be
> written to disk. Instead it seems like you have to re-encode the byte stream
> to some kind of escaped Ascii before it can be written back out.


Here's what's really going on.

Unicode strings within Python have to be indexable. So the internal
representation of Unicode has (usually) two bytes for each character,
so they work like arrays.

UTF-8 is a stream format for Unicode. It's slightly compressed;
each character occupies 1 to 4 bytes, and the base ASCII characters
(0..127 only, not 128..255) occupy one byte each. The format is
described in "http://en.wikipedia.org/wiki/UTF-8". A UTF-8 file or
stream has to be parsed from the beginning to keep track of where each
Unicode character begins. So it's not a suitable format for
data being actively worked on in memory; it can't be easily indexed.

That's why it's necessary to convert to UTF-8 before writing
to a file or socket.

John Nagle
 
Reply With Quote
 
MRAB
Guest
Posts: n/a
 
      07-29-2010
John Nagle wrote:
> On 7/28/2010 3:58 PM, Joe Goldthwaite wrote:
>> This still seems odd to me. I would have thought that the unicode
>> function
>> would return a properly encoded byte stream that could then simply be
>> written to disk. Instead it seems like you have to re-encode the byte
>> stream
>> to some kind of escaped Ascii before it can be written back out.

>
> Here's what's really going on.
>
> Unicode strings within Python have to be indexable. So the internal
> representation of Unicode has (usually) two bytes for each character,
> so they work like arrays.
>
> UTF-8 is a stream format for Unicode. It's slightly compressed;
> each character occupies 1 to 4 bytes, and the base ASCII characters
> (0..127 only, not 128..255) occupy one byte each. The format is
> described in "http://en.wikipedia.org/wiki/UTF-8". A UTF-8 file or
> stream has to be parsed from the beginning to keep track of where each
> Unicode character begins. So it's not a suitable format for
> data being actively worked on in memory; it can't be easily indexed.
>

Not entirely correct. The advantage of UTF-8 is that although different
codepoints might be encoded into different numbers of bytes it's easy to
tell whether a particular byte is the first in its sequence, so you
don't have to parse from the start of the file. It is true, however, it
can't be easily indexed.

> That's why it's necessary to convert to UTF-8 before writing
> to a file or socket.
>

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex with ASCII and non-ASCII chars TOXiC Python 5 01-31-2007 04:48 PM
[FR/EN] how to convert the characters ASCII(0-255) to ASCII(0-127) Alextophi Perl Misc 8 12-30-2005 10:43 AM
Wow, ascii's gone BERSERK! Poly-poly man Firefox 2 03-25-2005 01:37 PM
Hex and ASCII Keys -TC- Wireless Networking 0 06-20-2004 06:18 PM
routine/module to translate microsoft extended ascii to plain ascii James O'Brien Perl Misc 3 03-05-2004 04:33 PM



Advertisments