Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: Unicode support

Reply
Thread Tools

Re: Unicode support

 
 
Richy2004
Guest
Posts: n/a
 
      08-06-2004
code:
import sys,codecs
file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16")
print (file.readline())

output:
File "./test.py", line 5, in ?
print (file.readline())
File "C:\Python23\lib\codecs.py", line 384, in readline
return self.reader.readline(size)
File "c:\Python23\lib\encodings\utf_16.py", line 57, in readline
raise NotImplementedError, '.readline() is not implemented for
UTF-16'
NotImplementedError: .readline() is not implemented for UTF-16

================================================== ====
code:
import sys, codecs
file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16")
print (file.read())

output:
Traceback (most recent call last):
File "./test.py", line 5, in ?
print (file.read())
File "c:\Python23\lib\encodings\cp850.py", line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position
0-2: character maps to <undefined>

================================================== ====
code:
import sys, codecs
file = codecs.open("accountmgr_words_arb.txt", "rb", "utf-16")
lines = file.readlines()
print lines

this works !, output:
[u'\u0646\u0648\u0639 \u062d\u0633\u0627\u0628 \u062c\u062f\u064a\u062f
\u0645\u062e\u062a\u0627\u0631.\r\n']

if I add these lines:
line = lines[0]
tokens = line.split("\\u")
print tokens[0]

I get this:
Traceback (most recent call last):
File "./test.py", line 8, in ?
print tokens[0]
File "c:\Python23\lib\encodings\cp850.py", line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position
0-2: character maps to <undefined>

Thanks,
Richard

 
Reply With Quote
 
 
 
 
vincent wehren
Guest
Posts: n/a
 
      08-06-2004
Richy2004 wrote:

> code:
> import sys,codecs
> file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16")
> print (file.readline())
>
> output:
> File "./test.py", line 5, in ?
> print (file.readline())
> File "C:\Python23\lib\codecs.py", line 384, in readline
> return self.reader.readline(size)
> File "c:\Python23\lib\encodings\utf_16.py", line 57, in readline
> raise NotImplementedError, '.readline() is not implemented for
> UTF-16'
> NotImplementedError: .readline() is not implemented for UTF-16
>
> ================================================== ====
> code:
> import sys, codecs
> file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16")
> print (file.read())
>
> output:
> Traceback (most recent call last):
> File "./test.py", line 5, in ?
> print (file.read())
> File "c:\Python23\lib\encodings\cp850.py", line 18, in encode
> return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode characters in position
> 0-2: character maps to <undefined>
>
> ================================================== ====
> code:
> import sys, codecs
> file = codecs.open("accountmgr_words_arb.txt", "rb", "utf-16")
> lines = file.readlines()
> print lines


> this works !, output:
> [u'\u0646\u0648\u0639 \u062d\u0633\u0627\u0628 \u062c\u062f\u064a\u062f
> \u0645\u062e\u062a\u0627\u0631.\r\n']


You understand this is just one line, and not multiple lines? Just
checking. The reason why it works is that you are getting a
representation of the list.

> line = lines[0]
> tokens = line.split("\\u")

This line doesn't make sense. Do you want to split up the line into a
list of individual characters as in:
>> tokens = list(lines[0])
>> print tokens

[u'\u0646', u'\u0648', u'\u0639', u'\u062d', u'\u0633', u'\u0627',
u'\u0628', u'\u062c', u'\u062f', u'\u064a', u'\u062f', u'\u0645',
u'\u062e', u'\u062a', u'\u0627', u'\u0631', u'.', u'\r', u'\n']


> print tokens[0]
>
> I get this:
> Traceback (most recent call last):
> File "./test.py", line 8, in ?
> print tokens[0]
> File "c:\Python23\lib\encodings\cp850.py", line 18, in encode
> return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode characters in position
> 0-2: character maps to <undefined>


Anyway, you are trying to print to the console window. AFAIK, Python 2.3
guesses the console encoding, which in your case is cp850.py, and uses
it as single- byte encoding to encode your unicode characters before
writing them to stdout. Unfortunately, you cannot print which I believe
are Arabic characters to a CP850 encoded console (as a matter of fact,
you can't print any of the so-called 'complex scripts' to any windows
console, but that is a different matter).

If you run the same script in a lets say, IDLE you won't have that
problem. In other words, if you need to print these characters, you have
to either print them as unicode characters to a unicode-savy output, or
encode them in an appropriate single-byte encoding (e.g. "cp1256") and
output them to an output window that nows how to deal with it.

--
Vincent Wehren
>
> Thanks,
> Richard
>

 
Reply With Quote
 
 
 
 
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Guest
Posts: n/a
 
      08-06-2004
Richy2004 wrote:
> NotImplementedError: .readline() is not implemented for UTF-16


As it says: this is, unfortunately, not implemented. Use readlines
instead.

> print (file.read())

[...]
> UnicodeEncodeError: 'charmap' codec can't encode characters in position
> 0-2: character maps to <undefined>


The the .read works perfectly. Don't try to print it, though!
You can only print when the terminal actually supports the characters,
which your terminal doesn't. Try

print repr(file.read())

instead.

> print tokens[0]

[...]
> UnicodeEncodeError: 'charmap' codec can't encode characters in position
> 0-2: character maps to <undefined>


Same issue: As Vincent explains, you can't print ARABIC LETTER NOON
to your terminal, as your terminal simply cannot display that character.

Regards,
Martin
 
Reply With Quote
 
Hye-Shik Chang
Guest
Posts: n/a
 
      08-07-2004
On 6 Aug 2004 07:57:44 -0700, Richy2004 <(E-Mail Removed)> wrote:
> code:
> import sys,codecs
> file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16")
> print (file.readline())
>
> output:
> File "./test.py", line 5, in ?
> print (file.readline())
> File "C:\Python23\lib\codecs.py", line 384, in readline
> return self.reader.readline(size)
> File "c:\Python23\lib\encodings\utf_16.py", line 57, in readline
> raise NotImplementedError, '.readline() is not implemented for
> UTF-16'
> NotImplementedError: .readline() is not implemented for UTF-16
>


UTF-16 readline is being supported by CJKCodecs 1.1.

>>> import codecs
>>> codecs.open("u16test", "r", "cjkcodecs.utf-16")

<open file 'u16test', mode 'rb' at 0x81ab7e0>
>>> _.readline()

u'\u25ce \ud30c\uc774\uc36c(Python)\uc740 \ubc30\uc6b0\uae30
\uc27d\uace0, \uac15\ub825\ud55c \ud504\ub85c\uadf8\ub798\ubc0d
\uc5b8\uc5b4\uc785\ub2c8\ub2e4. \ud30c\uc774\uc36c\uc740\n'


Hye-Shik
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: os.lisdir, gets unicode, returns unicode... USUALLY?!?!? Jean-Paul Calderone Python 23 11-21-2006 10:25 AM
os.lisdir, gets unicode, returns unicode... USUALLY?!?!? gabor Python 13 11-18-2006 09:23 AM
Unicode digit to unicode string Gabriele *darkbard* Farina Python 2 05-16-2006 01:15 PM
unicode wrap unicode object? ygao Python 6 04-08-2006 09:54 AM
Unicode + jsp + mysql + tomcat = unicode still not displaying Robert Mark Bram Java 0 09-28-2003 05:37 AM



Advertisments