Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Python UTF-8 and codecs (http://www.velocityreviews.com/forums/t359092-python-utf-8-and-codecs.html)

Mike Currie 06-27-2006 06:48 PM

Python UTF-8 and codecs
 
I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
them. Every configuration I try I get a UnicodeError: ascii codec can't
decode byte 0x85 in position 255: oridinal not in range(128)

I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict')
and that doesn't work and I've also try wrapping the file in an utf8_writer
using codecs.lookup('utf8')

Any clues?

Thanks
Mike



Dennis Benzinger 06-27-2006 07:10 PM

Re: Python UTF-8 and codecs
 
Mike Currie wrote:
> I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
> them. Every configuration I try I get a UnicodeError: ascii codec can't
> decode byte 0x85 in position 255: oridinal not in range(128)
>
> I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict')
> and that doesn't work
> [...]


You want to write to a file but you used the 'rU' mode. This should be
'wU'. Don't know if this is the only reason it doesn't work. Could you
show more of your code?


Bye,
Dennis

Serge Orlov 06-27-2006 08:29 PM

Re: Python UTF-8 and codecs
 
On 6/27/06, Mike Currie <dev@null.com> wrote:
> I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
> them. Every configuration I try I get a UnicodeError: ascii codec can't
> decode byte 0x85 in position 255: oridinal not in range(128)
>
> I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict')
> and that doesn't work and I've also try wrapping the file in an utf8_writer
> using codecs.lookup('utf8')
>
> Any clues?


Use unicode strings for non-ascii characters. The following program "works":

import codecs

c1 = unichr(0x85)
f = codecs.open('foo.txt', 'wU', 'utf-8')
f.write(c1)
f.close()

But unichr(0x85) is a control characters, are you sure you want it?
What is the encoding of your data?

Mike Currie 06-27-2006 08:38 PM

Re: Python UTF-8 and codecs
 
I did make a mistake, it should have been 'wU'.

The starting data is ASCII.

What I'm doing is data processing on files with new line and tab characters
inside quoted fields. The idea is to convert all the new line and
characters to 0x85 and 0x88 respectivly, then process the files. Finally
right before importing them into a database convert them back to new line
and tab's thus preserving the field values.

Will python not handle the control characters correctly?


"Serge Orlov" <serge.orlov@gmail.com> wrote in message
news:mailman.7516.1151440194.27775.python-list@python.org...
> On 6/27/06, Mike Currie <dev@null.com> wrote:
>> I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
>> them. Every configuration I try I get a UnicodeError: ascii codec can't
>> decode byte 0x85 in position 255: oridinal not in range(128)
>>
>> I've tried using the codecs.open('foo.txt', 'rU', 'utf-8',
>> errors='strict')
>> and that doesn't work and I've also try wrapping the file in an
>> utf8_writer
>> using codecs.lookup('utf8')
>>
>> Any clues?

>
> Use unicode strings for non-ascii characters. The following program
> "works":
>
> import codecs
>
> c1 = unichr(0x85)
> f = codecs.open('foo.txt', 'wU', 'utf-8')
> f.write(c1)
> f.close()
>
> But unichr(0x85) is a control characters, are you sure you want it?
> What is the encoding of your data?




Mike Currie 06-27-2006 09:22 PM

Re: Python UTF-8 and codecs
 
Okay,

Here is a sample of what I'm doing:


Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> filterMap = {}
>>> for i in range(0,255):

.... filterMap[chr(i)] = chr(i)
....
>>> filterMap[chr(9)] = chr(136)
>>> filterMap[chr(10)] = chr(133)
>>> filterMap[chr(136)] = chr(9)
>>> filterMap[chr(133)] = chr(10)
>>> line = '''this has

.... tabs and line
.... breaks'''
>>> filteredLine = ''.join([ filterMap[a] for a in line])
>>> import codecs
>>> f = codecs.open('foo.txt', 'wU', 'utf-8')
>>> print filteredLine

thisÍhasÍŗtabsÍandÍlineŗbreaks
>>> f.write(filteredLine)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python24\lib\codecs.py", line 501, in write
return self.writer.write(data)
File "C:\Python24\lib\codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
ordinal
not in range(128)
>>>


"Mike Currie" <dev@null.com> wrote in message
news:5Hgog.627$Gv.173@fed1read09...
>I did make a mistake, it should have been 'wU'.
>
> The starting data is ASCII.
>
> What I'm doing is data processing on files with new line and tab
> characters inside quoted fields. The idea is to convert all the new line
> and characters to 0x85 and 0x88 respectivly, then process the files.
> Finally right before importing them into a database convert them back to
> new line and tab's thus preserving the field values.
>
> Will python not handle the control characters correctly?
>
>
> "Serge Orlov" <serge.orlov@gmail.com> wrote in message
> news:mailman.7516.1151440194.27775.python-list@python.org...
>> On 6/27/06, Mike Currie <dev@null.com> wrote:
>>> I'm trying to write out files that have utf-8 characters 0x85 and 0x08
>>> in
>>> them. Every configuration I try I get a UnicodeError: ascii codec can't
>>> decode byte 0x85 in position 255: oridinal not in range(128)
>>>
>>> I've tried using the codecs.open('foo.txt', 'rU', 'utf-8',
>>> errors='strict')
>>> and that doesn't work and I've also try wrapping the file in an
>>> utf8_writer
>>> using codecs.lookup('utf8')
>>>
>>> Any clues?

>>
>> Use unicode strings for non-ascii characters. The following program
>> "works":
>>
>> import codecs
>>
>> c1 = unichr(0x85)
>> f = codecs.open('foo.txt', 'wU', 'utf-8')
>> f.write(c1)
>> f.close()
>>
>> But unichr(0x85) is a control characters, are you sure you want it?
>> What is the encoding of your data?

>
>




Serge Orlov 06-27-2006 10:11 PM

Re: Python UTF-8 and codecs
 
On 6/27/06, Mike Currie <dev@null.com> wrote:
> Okay,
>
> Here is a sample of what I'm doing:
>
>
> Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
> win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> filterMap = {}
> >>> for i in range(0,255):

> ... filterMap[chr(i)] = chr(i)
> ...
> >>> filterMap[chr(9)] = chr(136)
> >>> filterMap[chr(10)] = chr(133)
> >>> filterMap[chr(136)] = chr(9)
> >>> filterMap[chr(133)] = chr(10)


This part is incorrect, it should be:

filterMap = {}
for i in range(0,128):
filterMap[chr(i)] = chr(i)

filterMap[chr(9)] = unichr(136)
filterMap[chr(10)] = unichr(133)
filterMap[unichr(136)] = chr(9)
filterMap[unichr(133)] = chr(10)

Mike Currie 06-27-2006 11:13 PM

Re: Python UTF-8 and codecs
 
Well, not really. It doesn't affect the result. I still get the error
message. Did you get a different result?


"Serge Orlov" <serge.orlov@gmail.com> wrote in message
news:mailman.7522.1151446300.27775.python-list@python.org...
> On 6/27/06, Mike Currie <dev@null.com> wrote:
>> Okay,
>>
>> Here is a sample of what I'm doing:
>>
>>
>> Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
>> win32
>> Type "help", "copyright", "credits" or "license" for more information.
>> >>> filterMap = {}
>> >>> for i in range(0,255):

>> ... filterMap[chr(i)] = chr(i)
>> ...
>> >>> filterMap[chr(9)] = chr(136)
>> >>> filterMap[chr(10)] = chr(133)
>> >>> filterMap[chr(136)] = chr(9)
>> >>> filterMap[chr(133)] = chr(10)

>
> This part is incorrect, it should be:
>
> filterMap = {}
> for i in range(0,128):
> filterMap[chr(i)] = chr(i)
>
> filterMap[chr(9)] = unichr(136)
> filterMap[chr(10)] = unichr(133)
> filterMap[unichr(136)] = chr(9)
> filterMap[unichr(133)] = chr(10)




Serge Orlov 06-27-2006 11:34 PM

Re: Python UTF-8 and codecs
 
On 6/27/06, Mike Currie <dev@null.com> wrote:
> Well, not really. It doesn't affect the result. I still get the error
> message. Did you get a different result?


Yes, the program succesfully wrote text file. Without magic abilities
to read the screen of your computer I guess you now get exception in
print statement. It is because you use legacy windows console (I use
unicode-capable console of lightning compiler
<http://www.python.org/pypi/Lightning%20Compiler> to run snippets of
code). You can either change console or comment out print statement or
change your program to print unicode representation: print
repr(filteredLine)


All times are GMT. The time now is 02:04 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.