Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > A Unicode problem -HELP

Reply
Thread Tools

A Unicode problem -HELP

 
 
manstey
Guest
Posts: n/a
 
      05-12-2006
I am writing a program to translate a list of ascii letters into a
different language that requires unicode encoding. This is what I have
done so far:

1. I have # -*- coding: UTF-8 -*- as my first line.
2. In Wing IDE I have set Default Encoding to UTF-8
3. I have imported codecs and opened and written my file, which doesn't
have a BOM, as encoding=UTF-8
4. I have written a dictionary for translation, with entries such as
{'F':u'\u0254'} and a function to do the translation

Everything works fine, except that my output file, when loaded in
unicode aware emeditor has
(u'F', u'\u0254')

But I want to display it as:
('F', 'ɔ') # where the ɔ is a back-to-front 'c'

So my questions are:
1. How do I do this?
2. Do I need to change any of my steps above?

 
Reply With Quote
 
 
 
 
=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=
Guest
Posts: n/a
 
      05-12-2006
manstey wrote:
> 1. I have # -*- coding: UTF-8 -*- as my first line.
> 2. In Wing IDE I have set Default Encoding to UTF-8
> 3. I have imported codecs and opened and written my file, which doesn't
> have a BOM, as encoding=UTF-8
> 4. I have written a dictionary for translation, with entries such as
> {'F':u'\u0254'} and a function to do the translation
>
> Everything works fine, except that my output file, when loaded in
> unicode aware emeditor has
> (u'F', u'\u0254')


I couldn't quite follow this description: what is "your output file"
(in what step is it created?), and how does

(u'F', u'\u0254')

get into this file? What is the precise Python statement that
produces that line of output?

> So my questions are:
> 1. How do I do this?


Most likely, you use (directly or indirectly) the repr() function
to convert a tuple into that string. You shouldn't do that;
instead, you should format the elements of the tuple yourself, e.g.
through

print >>f, u"('%s', '%s')" % value

Regards,
Martin
 
Reply With Quote
 
 
 
 
manstey
Guest
Posts: n/a
 
      05-17-2006
Hi Martin,

HEre is how I write:

input_file = open(input_file_loc, 'r')
output_file = open(output_file_loc, 'w')
for line in input_file:
output_file.write(str(word_info + parse + gloss)) # = three
functions that return tuples

(u'F', u'\u0254') are two of the many unicode tuple elements returned
by the three functions.

What am I doing wrong?

 
Reply With Quote
 
Ben Finney
Guest
Posts: n/a
 
      05-17-2006
"manstey" <(E-Mail Removed)> writes:

> input_file = open(input_file_loc, 'r')
> output_file = open(output_file_loc, 'w')
> for line in input_file:
> output_file.write(str(word_info + parse + gloss)) # = three functions that return tuples


If you mean that 'word_info', 'parse' and 'gloss' are three functions
that return tuples, then you get that return value by calling them.

>>> def foo():

... return "foo's return value"
...
>>> def bar(baz):

... return "bar's return value (including '%s')" % baz
...
>>> print foo()

foo's return value
>>> print bar

<function bar at 0x401fe80c>
>>> print bar("orange")

bar's return value (including 'orange')

--
\ "A man must consider what a rich realm he abdicates when he |
`\ becomes a conformist." -- Ralph Waldo Emerson |
_o__) |
Ben Finney

 
Reply With Quote
 
manstey
Guest
Posts: n/a
 
      05-17-2006
I'm a newbie at python, so I don't really understand how your answer
solves my unicode problem.

I have done more reading on unicode and then tried my code in IDLE
rather than WING IDE, and discovered that it works fine in IDLE, so I
think WING has a problem with unicode. For example, in WING this code
returns an error:

a={'a':u'\u0254'}
print a['a']


UnicodeEncodeError: 'ascii' codec can't encode character u'\u0254' in
position 0: ordinal not in range(12

but in IDLE it correctly prints open o

So, assuming I now work in IDLE, all I want help with is how to read in
an ascii string and convert its letters to various unicode values and
save the resulting 'string' to a utf-8 text file. Is this clear?

so in pseudo code
1. F is converted to \u0254, $ is converted to \u0283, C is converted
to \u02A6\02C1, etc.
(i want to do this using a dictionary TRANSLATE={'F':u'\u0254', etc)
2. I read in a file with lines like:
F$
FCF$
$$C$ etc
3. I convert this to
\u0254\u0283
\u0254\u02A6\02C1\u0254 etc
4. i save the results in a new file

when i read the new file in a unicode editor (EmEditor), i don't see
\u0254\u02A6\02C1\u0254, but I see the actual characters (open o, esh,
ts digraph, modified letter reversed glottal stop, etc.

I'm sure this is straightforward but I can't get it to work.

All help appreciated!

 
Reply With Quote
 
Ben Finney
Guest
Posts: n/a
 
      05-17-2006
"manstey" <(E-Mail Removed)> writes:

> I'm a newbie at python, so I don't really understand how your answer
> solves my unicode problem.


Since your replies fail to give any context of the existing
discussion, I could only go by the content of what you'd written in
that message. I didn't see a problem with anything Unicode -- I saw
three objects being added together, which you told us were function
objects. That's the problem I pointed out.

--
\ "When a well-packaged web of lies has been sold to the masses |
`\ over generations, the truth will seem utterly preposterous and |
_o__) its speaker a raving lunatic." -- Dresden James |
Ben Finney

 
Reply With Quote
 
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Guest
Posts: n/a
 
      05-17-2006
manstey wrote:
> input_file = open(input_file_loc, 'r')
> output_file = open(output_file_loc, 'w')
> for line in input_file:
> output_file.write(str(word_info + parse + gloss)) # = three
> functions that return tuples
>
> (u'F', u'\u0254') are two of the many unicode tuple elements returned
> by the three functions.
>
> What am I doing wrong?


Well, the primary problem is that you don't tell us what you are really
doing. For example, it is very hard to believe that this is the actual
code that you are running:

If word_info, parse, and gloss are functions, the code should read

input_file = open(input_file_loc, 'r')
output_file = open(output_file_loc, 'w')
for line in input_file:
output_file.write(str(word_info() + parse() + gloss()))

I.e. you need to call the functions for this code to make any sense.
You have probably chosen to edit the code in order to not show us
your real code. Unfortunately, since you are a newbie in Python,
you make errors in doing so, and omit important details. That makes
it very difficult to help you.

Regards,
Martin
 
Reply With Quote
 
manstey
Guest
Posts: n/a
 
      05-17-2006
OK, I apologise for not being clearer.

1. Here is my input data file, line 2:
gn1:1,1.2 R")$I73YT R")$IYT@ncfsa

2. Here is my output data file, line 2:
u'gn', u'1', u'1', u'1', u'2', u'-', u'R")$I73YT', u'R")$IYT',
u'R")$IYT', u'@', u'ncfsa', u'nc', '', '', '', u'f', u's', u'a', '',
'', '', '', '', '', '', '', u'B.:R")$I^YT', u'b.:cv)cv^yc', '\xc9\x94'

3. Here is my main program:
# -*- coding: UTF-8 -*-
import codecs

import splitFunctions
import surfaceIPA

# Constants for file location

# Working directory constants
dir_root = 'E:\\'
dir_relative = '2 Core\\2b Data\\Data Working\\'

# Input file constants
input_file_name = 'in.grab.txt'
input_file_loc = dir_root + dir_relative + input_file_name
# Initialise input file
input_file = codecs.open(input_file_loc, 'r', 'utf-8')

# Output file constants
output_file_name = 'out.grab.txt'
output_file_loc = dir_root + dir_relative + output_file_name
# Initialise output file
output_file = codecs.open(output_file_loc, 'w', 'utf-8') # unicode

i = 0
for line in input_file:
if line[0] != '>': # Ignore headers
i += 1
if i != 1:
word_info = splitFunctions.splitGrab(line, i)
parse=splitFunctions.splitParse(word_info[10])
gloss=surfaceIPA.surfaceIPA(word_info[6],word_info[8],word_info[9],parse)
a=str(word_info + parse + gloss).encode('utf-8')
a=a[1:len(a)-1]
output_file.write(a)
output_file.write('\n')

input_file.close()
output_file.close()

print 'done'


4. Here is my problem:
At the end of my output file, where my unicode character \u0254 (OPEN
O) appears, the file has '\xc9\x94'

What I want is an output file like:

'gn', '1', '1', '1', '2', '-', ..... 'ɔ'

where ɔ is an open O, and would display correctly in the appropriate
font.

Once I can get it to display properly, I will rewrite gloss so that it
returns a proper translation of 'R")$I73YT', which will be a string of
unicode characters.

Is this clearer? The other two functions are basic. splitGrab turns
'gn1:1,1.2 R")$I73YT R")$IYT@ncfsa' into 'gn 1 1 1 2 R")$I73YT R")$IYT
@ ncfsa' and splitParse turns the final piece of this 'ncfsa' into 'n c
f s a'. They have to be done separately as splitParse involves some
translation and program logic. SurfaceIPA reads in 'R")$I73YT' and
other data to produce the unicode string. At the moment it just returns
two dummy strings and u'\u0254'.encode('utf-8').

All help is appreciated!

Thanks

 
Reply With Quote
 
=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=
Guest
Posts: n/a
 
      05-17-2006
manstey wrote:
> a=str(word_info + parse + gloss).encode('utf-8')
> a=a[1:len(a)-1]
>
> Is this clearer?


Indeed. The problem is your usage of str() to "render" the output.
As word_info+parse+gloss is a list (or is it a tuple?), str() will
already produce "Python source code", i.e. an ASCII byte string
that can be read back into the interpreter; all Unicode is gone
from that string. If you want comma-separated output, you should
do this:

def comma_separated_utf8(items):
result = []
for item in items:
result.append(item.encode('utf-8'))
return ", ".join(result)

and then
a = comma_separated_utf8(word_info + parse + gloss)

Then you don't have to drop the parentheses from a anymore, as
it won't have parentheses in the first place.

As the encoding will be done already in the output file,
the following should also work:

a = u", ".join(word_info + parse + gloss)

This would make "a" a comma-separated unicode string, so that
the subsequent output_file.write(a) encodes it as UTF-8.

If that doesn't work, I would like to know what the exact
value of gloss is, do

print "GLOSS IS", repr(gloss)

to print it out.

Regards,
Martin
 
Reply With Quote
 
Tim Roberts
Guest
Posts: n/a
 
      05-17-2006
"manstey" <(E-Mail Removed)> wrote:
>
>I have done more reading on unicode and then tried my code in IDLE
>rather than WING IDE, and discovered that it works fine in IDLE, so I
>think WING has a problem with unicode.


Rather, its output defaults to ASCII.

>So, assuming I now work in IDLE, all I want help with is how to read in
>an ascii string and convert its letters to various unicode values and
>save the resulting 'string' to a utf-8 text file. Is this clear?
>
>so in pseudo code
>1. F is converted to \u0254, $ is converted to \u0283, C is converted
>to \u02A6\02C1, etc.
>(i want to do this using a dictionary TRANSLATE={'F':u'\u0254', etc)
>2. I read in a file with lines like:
>F$
>FCF$
>$$C$ etc
>3. I convert this to
>\u0254\u0283
>\u0254\u02A6\02C1\u0254 etc
>4. i save the results in a new file
>
>when i read the new file in a unicode editor (EmEditor), i don't see
>\u0254\u02A6\02C1\u0254, but I see the actual characters (open o, esh,
>ts digraph, modified letter reversed glottal stop, etc.


Of course. Isn't that exactly what you wanted? The Python string
u"\u0254" contains one character (Latin small open o). It does NOT contain
6 characters. If you write that to a file, that file will contain 1
character -- 2 bytes.

If you actually want the 6-character string \u0254 written to a file, then
you need to escape the \u special code: "\\u0254". However, I don't see
what good that would do you. The \u escape is a Python source code thing.

>I'm sure this is straightforward but I can't get it to work.


I think it is working exactly as you want.
--
- Tim Roberts, http://www.velocityreviews.com/forums/(E-Mail Removed)
Providenza & Boekelheide, Inc.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: os.lisdir, gets unicode, returns unicode... USUALLY?!?!? Jean-Paul Calderone Python 23 11-21-2006 10:25 AM
os.lisdir, gets unicode, returns unicode... USUALLY?!?!? gabor Python 13 11-18-2006 09:23 AM
Unicode digit to unicode string Gabriele *darkbard* Farina Python 2 05-16-2006 01:15 PM
unicode wrap unicode object? ygao Python 6 04-08-2006 09:54 AM
Unicode + jsp + mysql + tomcat = unicode still not displaying Robert Mark Bram Java 0 09-28-2003 05:37 AM



Advertisments