Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: string to unicode

Reply
Thread Tools

Re: string to unicode

 
 
Chris Angelico
Guest
Posts: n/a
 
      08-15-2011
On Mon, Aug 15, 2011 at 4:20 PM, Artie Ziff <> wrote:
> if I am using the standard csv library to read contents of a csv file which
> contains Unicode strings (short example: '\xe8\x9f\x92\xe8\x9b\x87'), how do
> I use a python Unicode method such as decode or encode to transform this
> string type into a python unicode type? Must I know the encoding (byte
> groupings) of the Unicode? Can I get this from the file? Perhaps I need to
> open the file with particular attributes?
>


Start here:

http://www.joelonsoftware.com/articles/Unicode.html

The CSV file, being stored on disk, cannot contain Unicode strings; it
can only contain bytes. If you know the encoding (eg UTF-8, UCS-2,
etc), then you can decode it using that. If you don't, your best bet
is to ask the origin of the file; failing that, check the first few
bytes - if it's "\xFF\xFE" or "\xFE\xFF" or "\xEF\xBB\xBF", then it's
probably UTF-16LE, UTF-16BE, or UTF-8, respectively (those being the
encodings of the BOM). There may be other clues, too, but normally
it's best to get the encoding separately from the data rather than try
to decode it from the data itself.

Chris Angelico
 
Reply With Quote
 
 
 
 
Thomas 'PointedEars' Lahn
Guest
Posts: n/a
 
      08-15-2011
Chris Angelico wrote:

> On Mon, Aug 15, 2011 at 4:20 PM, Artie Ziff <> wrote:
>> if I am using the standard csv library to read contents of a csv file
>> which contains Unicode strings (short example:
>> '\xe8\x9f\x92\xe8\x9b\x87'), how do I use a python Unicode method such as
>> decode or encode to transform this string type into a python unicode
>> type? Must I know the encoding (byte groupings) of the Unicode? Can I get
>> this from the file? Perhaps I need to open the file with particular
>> attributes?

>
> Start here:
>
> http://www.joelonsoftware.com/articles/Unicode.html
>
> The CSV file, being stored on disk, cannot contain Unicode strings; it
> can only contain bytes. If you know the encoding (eg UTF-8, UCS-2,
> etc), then you can decode it using that. If you don't, your best bet
> is to ask the origin of the file; failing that, check the first few
> bytes - if it's "\xFF\xFE" or "\xFE\xFF" or "\xEF\xBB\xBF", then it's
> probably UTF-16LE, UTF-16BE, or UTF-8, respectively (those being the
> encodings of the BOM). There may be other clues, too, but normally
> it's best to get the encoding separately from the data rather than try
> to decode it from the data itself.


As this problem really is not a new one, there are several more – if I may
say so – pythonic approaches:

<http://stackoverflow.com/questions/4...here-a-way-to-
determine-the-encoding-of-text-file>

Improving Billy Mays' "matching brackets" checker, chardet worked for me
(the test file was UTF-8-encoded). Watch for word-wrap:

-----------------------------------------------------------------------
# encoding: utf-8
'''
Created on 2011-07-18

@author: Thomas 'PointedEars' Lahn <>, based on an idea of
Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6@myhashis myemail.com>
in <news:j01ph6$knt$>
'''
import sys, os, chardet

pairs = {u'}': u'{', u')': u'(', u']': u'[',
u'”': u'“', u'›': u'‹', u'»': u'«',
u'】': u'【', u'〉': u'〈', u'》': u'《',
u'」': u'「', u'』': u'『'}
valid = set(v for pair in pairs.items() for v in pair)

if __name__ == '__main__':
for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
for name in filenames:
stack = [' ']

file_path = os.path.join(dirpath, name)

with open(file_path, 'rb') as f:
reported = False
lines = enumerate(f, 1)

encoding = chardet.detect(''.join(map(lambda x: x[1],
lines)))['encoding']

chars = ((c, line_no, col) for line_no, line in lines for
col, c in enumerate(line.decode(encoding), 1) if c in valid)
for c, line_no, col in chars:
if c in pairs:
if stack[-1] == pairs[c]:
stack.pop()
else:
if not reported:
first_bad = (c, line_no, col)
reported = True
else:
stack.append(c)

print '%s: %s' % (name, ("good" if len(stack) == 1 else "bad
'%s' at %s:%s" % first_bad))
-----------------------------------------------------------------------

HTH

--
PointedEars

Bitte keine Kopien per E-Mail. / Please do not Cc: me.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Help for Unicode char and Unicode char based string in Ruby Chirag Mistry Ruby 6 02-08-2008 12:45 PM
[unicode] inconvenient unicode conversion of non-string arguments Holger Joukl Python 5 12-13-2006 10:10 PM
Unicode digit to unicode string Gabriele *darkbard* Farina Python 2 05-16-2006 01:15 PM
how to convert string like '\u5927' to unicode string u'\u5927' Chris Song Python 3 12-27-2005 11:40 AM
how to convert string like '\u5927' to unicode string u'\u5927' Chris Song Python 1 12-27-2005 11:06 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57