Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Re: Unicode support (http://www.velocityreviews.com/forums/t334032-re-unicode-support.html)

Richy2004 08-06-2004 02:57 PM

Re: Unicode support
 
code:
import sys,codecs
file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16")
print (file.readline())

output:
File "./test.py", line 5, in ?
print (file.readline())
File "C:\Python23\lib\codecs.py", line 384, in readline
return self.reader.readline(size)
File "c:\Python23\lib\encodings\utf_16.py", line 57, in readline
raise NotImplementedError, '.readline() is not implemented for
UTF-16'
NotImplementedError: .readline() is not implemented for UTF-16

================================================== ====
code:
import sys, codecs
file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16")
print (file.read())

output:
Traceback (most recent call last):
File "./test.py", line 5, in ?
print (file.read())
File "c:\Python23\lib\encodings\cp850.py", line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position
0-2: character maps to <undefined>

================================================== ====
code:
import sys, codecs
file = codecs.open("accountmgr_words_arb.txt", "rb", "utf-16")
lines = file.readlines()
print lines

this works !, output:
[u'\u0646\u0648\u0639 \u062d\u0633\u0627\u0628 \u062c\u062f\u064a\u062f
\u0645\u062e\u062a\u0627\u0631.\r\n']

if I add these lines:
line = lines[0]
tokens = line.split("\\u")
print tokens[0]

I get this: :(
Traceback (most recent call last):
File "./test.py", line 8, in ?
print tokens[0]
File "c:\Python23\lib\encodings\cp850.py", line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position
0-2: character maps to <undefined>

Thanks,
Richard


vincent wehren 08-06-2004 05:16 PM

Re: Unicode support
 
Richy2004 wrote:

> code:
> import sys,codecs
> file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16")
> print (file.readline())
>
> output:
> File "./test.py", line 5, in ?
> print (file.readline())
> File "C:\Python23\lib\codecs.py", line 384, in readline
> return self.reader.readline(size)
> File "c:\Python23\lib\encodings\utf_16.py", line 57, in readline
> raise NotImplementedError, '.readline() is not implemented for
> UTF-16'
> NotImplementedError: .readline() is not implemented for UTF-16
>
> ================================================== ====
> code:
> import sys, codecs
> file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16")
> print (file.read())
>
> output:
> Traceback (most recent call last):
> File "./test.py", line 5, in ?
> print (file.read())
> File "c:\Python23\lib\encodings\cp850.py", line 18, in encode
> return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode characters in position
> 0-2: character maps to <undefined>
>
> ================================================== ====
> code:
> import sys, codecs
> file = codecs.open("accountmgr_words_arb.txt", "rb", "utf-16")
> lines = file.readlines()
> print lines


> this works !, output:
> [u'\u0646\u0648\u0639 \u062d\u0633\u0627\u0628 \u062c\u062f\u064a\u062f
> \u0645\u062e\u062a\u0627\u0631.\r\n']


You understand this is just one line, and not multiple lines? Just
checking. The reason why it works is that you are getting a
representation of the list.

> line = lines[0]
> tokens = line.split("\\u")

This line doesn't make sense. Do you want to split up the line into a
list of individual characters as in:
>> tokens = list(lines[0])
>> print tokens

[u'\u0646', u'\u0648', u'\u0639', u'\u062d', u'\u0633', u'\u0627',
u'\u0628', u'\u062c', u'\u062f', u'\u064a', u'\u062f', u'\u0645',
u'\u062e', u'\u062a', u'\u0627', u'\u0631', u'.', u'\r', u'\n']


> print tokens[0]
>
> I get this: :(
> Traceback (most recent call last):
> File "./test.py", line 8, in ?
> print tokens[0]
> File "c:\Python23\lib\encodings\cp850.py", line 18, in encode
> return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode characters in position
> 0-2: character maps to <undefined>


Anyway, you are trying to print to the console window. AFAIK, Python 2.3
guesses the console encoding, which in your case is cp850.py, and uses
it as single- byte encoding to encode your unicode characters before
writing them to stdout. Unfortunately, you cannot print which I believe
are Arabic characters to a CP850 encoded console (as a matter of fact,
you can't print any of the so-called 'complex scripts' to any windows
console, but that is a different matter).

If you run the same script in a lets say, IDLE you won't have that
problem. In other words, if you need to print these characters, you have
to either print them as unicode characters to a unicode-savy output, or
encode them in an appropriate single-byte encoding (e.g. "cp1256") and
output them to an output window that nows how to deal with it.

--
Vincent Wehren
>
> Thanks,
> Richard
>


=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= 08-06-2004 07:22 PM

Re: Unicode support
 
Richy2004 wrote:
> NotImplementedError: .readline() is not implemented for UTF-16


As it says: this is, unfortunately, not implemented. Use readlines
instead.

> print (file.read())

[...]
> UnicodeEncodeError: 'charmap' codec can't encode characters in position
> 0-2: character maps to <undefined>


The the .read works perfectly. Don't try to print it, though!
You can only print when the terminal actually supports the characters,
which your terminal doesn't. Try

print repr(file.read())

instead.

> print tokens[0]

[...]
> UnicodeEncodeError: 'charmap' codec can't encode characters in position
> 0-2: character maps to <undefined>


Same issue: As Vincent explains, you can't print ARABIC LETTER NOON
to your terminal, as your terminal simply cannot display that character.

Regards,
Martin

Hye-Shik Chang 08-07-2004 10:31 AM

Re: Unicode support
 
On 6 Aug 2004 07:57:44 -0700, Richy2004 <richard.scothern@gmail.com> wrote:
> code:
> import sys,codecs
> file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16")
> print (file.readline())
>
> output:
> File "./test.py", line 5, in ?
> print (file.readline())
> File "C:\Python23\lib\codecs.py", line 384, in readline
> return self.reader.readline(size)
> File "c:\Python23\lib\encodings\utf_16.py", line 57, in readline
> raise NotImplementedError, '.readline() is not implemented for
> UTF-16'
> NotImplementedError: .readline() is not implemented for UTF-16
>


UTF-16 readline is being supported by CJKCodecs 1.1. :)

>>> import codecs
>>> codecs.open("u16test", "r", "cjkcodecs.utf-16")

<open file 'u16test', mode 'rb' at 0x81ab7e0>
>>> _.readline()

u'\u25ce \ud30c\uc774\uc36c(Python)\uc740 \ubc30\uc6b0\uae30
\uc27d\uace0, \uac15\ub825\ud55c \ud504\ub85c\uadf8\ub798\ubc0d
\uc5b8\uc5b4\uc785\ub2c8\ub2e4. \ud30c\uc774\uc36c\uc740\n'


Hye-Shik


All times are GMT. The time now is 02:17 PM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57