Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > unicode compare errors

Reply
Thread Tools

unicode compare errors

 
 
Ross
Guest
Posts: n/a
 
      12-10-2010
I've a character encoding issue that has stumped me (not that hard to
do). I am parsing a small text file with some possibility of various
currencies being involved, and want to handle them without messing up.

Initially I was simply doing:

currs = [u'$', u'£', u'€', u'¥']
aFile = open(thisFile, 'r')
for mline in aFile: # mline might be "£5.50"
if item[0] in currs:
item = item[1:]

But the problem was:
SyntaxError: Non-ASCII character '\xa3' in file

The remedy was of course to declare the file encoding for my Python
module, at the start of the file I used:

# -*- coding: UTF-8 -*-

That allowed me to progress. But now when I come to line item that is
a non $ currency, I get this error:

views.py:3364: UnicodeWarning: Unicode equal comparison failed to
convert both arguments to Unicode - interpreting them as being
unequal.

…which I think means Python's unable to convert the char's in the file
I'm reading from into unicode to compare to the items in the list
currs.

I think this is saying that u'£' == '£' is false.
(I hope those chars show up okay in my post here)

Since I can't control the encoding of the input file that users
submit, how to I get past this? How do I make such comparisons be
True?

Thanks in advance for any suggestions
Ross.



 
Reply With Quote
 
 
 
 
Ross
Guest
Posts: n/a
 
      12-10-2010
On Dec 10, 2:51*pm, Ross <ros...@gmail.com> wrote:

> Initially I was simply doing:
>
> * currs = [u'$', u'£', u'€', u'¥']
> * aFile = open(thisFile, 'r')
> * for mline in aFile: * * * * * * *# mline might be "£5..50"
> * * *if item[0] in currs:
> * * * * * item = item[1:]
>


Don't you love it when someone solves their own problem? Posting a
reply here so that other poor chumps like me can get around this...

I found I could import codecs that allow me to read the file with my
desired encoding. Huzzah!

Instead of opening the file with a standard
aFile = open(thisFile, 'r')

I instead ensure I've imported the codecs:

import codecs

.... and then I used a specific encoding on the file read:

aFile = codecs.open(thisFile, encoding='utf-8')

Then all my compares seem to work fine.
If I'm off-base and kludgey here and should be doing something
differently please give me a poke.

Regards,
Ross.
 
Reply With Quote
 
 
 
 
Nobody
Guest
Posts: n/a
 
      12-10-2010
On Fri, 10 Dec 2010 11:51:44 -0800, Ross wrote:

> Since I can't control the encoding of the input file that users
> submit, how to I get past this? How do I make such comparisons be
> True?


On Fri, 10 Dec 2010 12:07:19 -0800, Ross wrote:

> I found I could import codecs that allow me to read the file with my
> desired encoding. Huzzah!


> If I'm off-base and kludgey here and should be doing something


Er, do you know the file's encoding or don't you? Using:

aFile = codecs.open(thisFile, encoding='utf-8')

is telling Python that the file /is/ in utf-8. If it isn't in utf-8,
you'll get decoding errors.

If you are given a file with no known encoding, then you can't reliably
determine what /characters/ it contains, and thus can't reliably compare
the contents of the file against strings of characters, only against
strings of bytes.

About the best you can do is to use an autodetection library such as:

http://chardet.feedparser.org/


 
Reply With Quote
 
Ross
Guest
Posts: n/a
 
      12-13-2010
On Dec 10, 4:09*pm, Nobody <nob...@nowhere.com> wrote:
> On Fri, 10 Dec 2010 11:51:44 -0800, Ross wrote:
> > Since I can't control the encoding of the input file that users
> > submit, how to I get past this? *How do I make such comparisons be
> > True?

> On Fri, 10 Dec 2010 12:07:19 -0800, Ross wrote:
> > I found I could import codecs that allow me to read the file with my
> > desired encoding. Huzzah!
> > If I'm off-base and kludgey here and should be doing something

>
> Er, do you know the file's encoding or don't you? Using:
>
> * * aFile = codecs.open(thisFile, encoding='utf-8')
>
> is telling Python that the file /is/ in utf-8. If it isn't in utf-8,
> you'll get decoding errors.
>
> If you are given a file with no known encoding, then you can't reliably
> determine what /characters/ it contains, and thus can't reliably compare
> the contents of the file against strings of characters, only against
> strings of bytes.
>
> About the best you can do is to use an autodetection library such as:
>
> * * * *http://chardet.feedparser.org/


That's right I don't know what encoding the user will have used. The
use of autodetection sounds good - I'll look into that. Thx.

R.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
compare unicode to non-unicode strings Asterix Python 5 08-31-2008 07:31 PM
Unicode digit to unicode string Gabriele *darkbard* Farina Python 2 05-16-2006 01:15 PM
unicode wrap unicode object? ygao Python 6 04-08-2006 09:54 AM
Errors, errors, errors Mark Goldin ASP .Net 2 01-17-2004 08:05 PM
Unicode + jsp + mysql + tomcat = unicode still not displaying Robert Mark Bram Java 0 09-28-2003 05:37 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57