Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > UTF-8 and stdin/stdout?

Reply
Thread Tools

UTF-8 and stdin/stdout?

 
 
dave_140390@hotmail.com
Guest
Posts: n/a
 
      05-28-2008
Hi,

I have problems getting my Python code to work with UTF-8 encoding
when reading from stdin / writing to stdout.

Say I have a file, utf8_input, that contains a single character, ,
coded as UTF-8:

$ hexdump -C utf8_input
00000000 c3 a9
00000002

If I read this file by opening it in this Python script:

$ cat utf8_from_file.py
import codecs
file = codecs.open('utf8_input', encoding='utf-8')
data = file.read()
print "length of data =", len(data)

everything goes well:

$ python utf8_from_file.py
length of data = 1

The contents of utf8_input is one character coded as two bytes, so
UTF-8 decoding is working here.

Now, I would like to do the same with standard input. Of course, this:

$ cat utf8_from_stdin.py
import sys
data = sys.stdin.read()
print "length of data =", len(data)

does not work:

$ [/c/DiskCopy] python utf8_from_stdin.py < utf8_input
length of data = 2

Here, the contents of utf8_input is not interpreted as UTF-8, so
Python believes there are two separate characters.

The question, then:
How could one get utf8_from_stdin.py to work properly with UTF-8?
(And same question for stdout.)

I googled around, and found rather complex stuff (see, for example,
http://blog.ianbicking.org/illusive-...encoding.html), but even
that didn't work: I still get "length of data = 2" even after
successively calling sys.setdefaultencoding('utf-8').

-- dave
 
Reply With Quote
 
 
 
 
Arnaud Delobelle
Guest
Posts: n/a
 
      05-28-2008
http://www.velocityreviews.com/forums/(E-Mail Removed) writes:

> Hi,
>
> I have problems getting my Python code to work with UTF-8 encoding
> when reading from stdin / writing to stdout.
>
> Say I have a file, utf8_input, that contains a single character, ,
> coded as UTF-8:
>
> $ hexdump -C utf8_input
> 00000000 c3 a9
> 00000002
>
> If I read this file by opening it in this Python script:
>
> $ cat utf8_from_file.py
> import codecs
> file = codecs.open('utf8_input', encoding='utf-8')
> data = file.read()
> print "length of data =", len(data)
>
> everything goes well:
>
> $ python utf8_from_file.py
> length of data = 1
>
> The contents of utf8_input is one character coded as two bytes, so
> UTF-8 decoding is working here.
>
> Now, I would like to do the same with standard input. Of course, this:
>
> $ cat utf8_from_stdin.py
> import sys
> data = sys.stdin.read()
> print "length of data =", len(data)


Shouldn't you do data = data.decode('utf8') ?

> does not work:
>
> $ [/c/DiskCopy] python utf8_from_stdin.py < utf8_input
> length of data = 2


--
Arnaud

 
Reply With Quote
 
 
 
 
Chris
Guest
Posts: n/a
 
      05-28-2008
On May 28, 11:08*am, (E-Mail Removed) wrote:
> Hi,
>
> I have problems getting my Python code to work with UTF-8 encoding
> when reading from stdin / writing to stdout.
>
> Say I have a file, utf8_input, that contains a single character, ,
> coded as UTF-8:
>
> * * * * $ hexdump -C utf8_input
> * * * * 00000000 *c3 a9
> * * * * 00000002
>
> If I read this file by opening it in this Python script:
>
> * * * * $ cat utf8_from_file.py
> * * * * import codecs
> * * * * file = codecs.open('utf8_input', encoding='utf-8')
> * * * * data = file.read()
> * * * * print "length of data =", len(data)
>
> everything goes well:
>
> * * * * $ python utf8_from_file.py
> * * * * length of data = 1
>
> The contents of utf8_input is one character coded as two bytes, so
> UTF-8 decoding is working here.
>
> Now, I would like to do the same with standard input. Of course, this:
>
> * * * * $ cat utf8_from_stdin.py
> * * * * import sys
> * * * * data = sys.stdin.read()
> * * * * print "length of data =", len(data)
>
> does not work:
>
> * * * * $ [/c/DiskCopy] python utf8_from_stdin.py < utf8_input
> * * * * length of data = 2
>
> Here, the contents of utf8_input is not interpreted as UTF-8, so
> Python believes there are two separate characters.
>
> The question, then:
> How could one get utf8_from_stdin.py to work properly with UTF-8?
> (And same question for stdout.)
>
> I googled around, and found rather complex stuff (see, for example,http://blog.ianbicking.org/illusive-...encoding.html), but even
> that didn't work: I still get "length of data = 2" even after
> successively calling sys.setdefaultencoding('utf-8').
>
> -- dave


weird thing is 'c3 a9' is é on my side... and copy/pasting the
gives me 'e9' with the first script giving a result of zero and second
script gives me 1
 
Reply With Quote
 
dave_140390@hotmail.com
Guest
Posts: n/a
 
      05-28-2008
> Shouldn't you do data = data.decode('utf8') ?

Yes, that's it! Thanks.

-- dave
 
Reply With Quote
 
Ulrich Eckhardt
Guest
Posts: n/a
 
      05-28-2008
Chris wrote:
> On May 28, 11:08*am, (E-Mail Removed) wrote:
>> Say I have a file, utf8_input, that contains a single character, é,
>> coded as UTF-8:
>>
>> $ hexdump -C utf8_input
>> 00000000 *c3 a9
>> 00000002

[...]
> weird thing is 'c3 a9' is é on my side... and copy/pasting the é
> gives me 'e9' with the first script giving a result of zero and second
> script gives me 1


Don't worry, it can be that those are equivalent. The point is that some
characters exist more than once and some exist in a composite form (e with
accent) and separately (e and combining accent).

Looking at http://unicode.org/charts I see that the letter above should have
codepoint 0xe9 (combined character) or 0x61 (e) and 0x301 (accent).

0xe9 = 1110 1001 (codepoint)
0xc3 0xa9 = 1100 0011 1010 1001 (UTF-

Anyhow, further looking at this shows that your editor simply doesn't
interpret the two bytes as UTF-8 but as Latin-1 or similar encoding, where
they represent the capital A with tilde and the copyrigth sign.

Uli

--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

 
Reply With Quote
 
Martin v. Lwis
Guest
Posts: n/a
 
      05-28-2008
> $ cat utf8_from_stdin.py
> import sys
> data = sys.stdin.read()
> print "length of data =", len(data)


sys.stdin is a byte stream in Python 2, not a character stream.
To make it a character stream, do

sys.stdin = codecs.getreader("utf-8")(sys.stdin)

HTH,
Martin
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
if and and vs if and,and titi VHDL 4 03-11-2007 05:23 AM



Advertisments