![]() |
Re: Unicode support
code:
import sys,codecs file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16") print (file.readline()) output: File "./test.py", line 5, in ? print (file.readline()) File "C:\Python23\lib\codecs.py", line 384, in readline return self.reader.readline(size) File "c:\Python23\lib\encodings\utf_16.py", line 57, in readline raise NotImplementedError, '.readline() is not implemented for UTF-16' NotImplementedError: .readline() is not implemented for UTF-16 ================================================== ==== code: import sys, codecs file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16") print (file.read()) output: Traceback (most recent call last): File "./test.py", line 5, in ? print (file.read()) File "c:\Python23\lib\encodings\cp850.py", line 18, in encode return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to <undefined> ================================================== ==== code: import sys, codecs file = codecs.open("accountmgr_words_arb.txt", "rb", "utf-16") lines = file.readlines() print lines this works !, output: [u'\u0646\u0648\u0639 \u062d\u0633\u0627\u0628 \u062c\u062f\u064a\u062f \u0645\u062e\u062a\u0627\u0631.\r\n'] if I add these lines: line = lines[0] tokens = line.split("\\u") print tokens[0] I get this: :( Traceback (most recent call last): File "./test.py", line 8, in ? print tokens[0] File "c:\Python23\lib\encodings\cp850.py", line 18, in encode return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to <undefined> Thanks, Richard |
Re: Unicode support
Richy2004 wrote:
> code: > import sys,codecs > file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16") > print (file.readline()) > > output: > File "./test.py", line 5, in ? > print (file.readline()) > File "C:\Python23\lib\codecs.py", line 384, in readline > return self.reader.readline(size) > File "c:\Python23\lib\encodings\utf_16.py", line 57, in readline > raise NotImplementedError, '.readline() is not implemented for > UTF-16' > NotImplementedError: .readline() is not implemented for UTF-16 > > ================================================== ==== > code: > import sys, codecs > file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16") > print (file.read()) > > output: > Traceback (most recent call last): > File "./test.py", line 5, in ? > print (file.read()) > File "c:\Python23\lib\encodings\cp850.py", line 18, in encode > return codecs.charmap_encode(input,errors,encoding_map) > UnicodeEncodeError: 'charmap' codec can't encode characters in position > 0-2: character maps to <undefined> > > ================================================== ==== > code: > import sys, codecs > file = codecs.open("accountmgr_words_arb.txt", "rb", "utf-16") > lines = file.readlines() > print lines > this works !, output: > [u'\u0646\u0648\u0639 \u062d\u0633\u0627\u0628 \u062c\u062f\u064a\u062f > \u0645\u062e\u062a\u0627\u0631.\r\n'] You understand this is just one line, and not multiple lines? Just checking. The reason why it works is that you are getting a representation of the list. > line = lines[0] > tokens = line.split("\\u") This line doesn't make sense. Do you want to split up the line into a list of individual characters as in: >> tokens = list(lines[0]) >> print tokens [u'\u0646', u'\u0648', u'\u0639', u'\u062d', u'\u0633', u'\u0627', u'\u0628', u'\u062c', u'\u062f', u'\u064a', u'\u062f', u'\u0645', u'\u062e', u'\u062a', u'\u0627', u'\u0631', u'.', u'\r', u'\n'] > print tokens[0] > > I get this: :( > Traceback (most recent call last): > File "./test.py", line 8, in ? > print tokens[0] > File "c:\Python23\lib\encodings\cp850.py", line 18, in encode > return codecs.charmap_encode(input,errors,encoding_map) > UnicodeEncodeError: 'charmap' codec can't encode characters in position > 0-2: character maps to <undefined> Anyway, you are trying to print to the console window. AFAIK, Python 2.3 guesses the console encoding, which in your case is cp850.py, and uses it as single- byte encoding to encode your unicode characters before writing them to stdout. Unfortunately, you cannot print which I believe are Arabic characters to a CP850 encoded console (as a matter of fact, you can't print any of the so-called 'complex scripts' to any windows console, but that is a different matter). If you run the same script in a lets say, IDLE you won't have that problem. In other words, if you need to print these characters, you have to either print them as unicode characters to a unicode-savy output, or encode them in an appropriate single-byte encoding (e.g. "cp1256") and output them to an output window that nows how to deal with it. -- Vincent Wehren > > Thanks, > Richard > |
Re: Unicode support
Richy2004 wrote:
> NotImplementedError: .readline() is not implemented for UTF-16 As it says: this is, unfortunately, not implemented. Use readlines instead. > print (file.read()) [...] > UnicodeEncodeError: 'charmap' codec can't encode characters in position > 0-2: character maps to <undefined> The the .read works perfectly. Don't try to print it, though! You can only print when the terminal actually supports the characters, which your terminal doesn't. Try print repr(file.read()) instead. > print tokens[0] [...] > UnicodeEncodeError: 'charmap' codec can't encode characters in position > 0-2: character maps to <undefined> Same issue: As Vincent explains, you can't print ARABIC LETTER NOON to your terminal, as your terminal simply cannot display that character. Regards, Martin |
Re: Unicode support
On 6 Aug 2004 07:57:44 -0700, Richy2004 <richard.scothern@gmail.com> wrote:
> code: > import sys,codecs > file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16") > print (file.readline()) > > output: > File "./test.py", line 5, in ? > print (file.readline()) > File "C:\Python23\lib\codecs.py", line 384, in readline > return self.reader.readline(size) > File "c:\Python23\lib\encodings\utf_16.py", line 57, in readline > raise NotImplementedError, '.readline() is not implemented for > UTF-16' > NotImplementedError: .readline() is not implemented for UTF-16 > UTF-16 readline is being supported by CJKCodecs 1.1. :) >>> import codecs >>> codecs.open("u16test", "r", "cjkcodecs.utf-16") <open file 'u16test', mode 'rb' at 0x81ab7e0> >>> _.readline() u'\u25ce \ud30c\uc774\uc36c(Python)\uc740 \ubc30\uc6b0\uae30 \uc27d\uace0, \uac15\ub825\ud55c \ud504\ub85c\uadf8\ub798\ubc0d \uc5b8\uc5b4\uc785\ub2c8\ub2e4. \ud30c\uc774\uc36c\uc740\n' Hye-Shik |
| All times are GMT. The time now is 02:17 PM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.