Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: Python 3 encoding question: Read a filename from stdin,subsequently open that filename

Reply
Thread Tools

Re: Python 3 encoding question: Read a filename from stdin,subsequently open that filename

 
 
Peter Otten
Guest
Posts: n/a
 
      11-30-2010
Albert Hopkins wrote:

> On Tue, 2010-11-30 at 11:52 +0100, Peter Otten wrote:
> Dan Stromberg wrote:
>>
>> > I've got a couple of programs that read filenames from stdin, and

> then
>> > open those files and do things with them. These programs sort of do
>> > the *ix xargs thing, without requiring xargs.
>> >
>> > In Python 2, these work well. Irrespective of how filenames are
>> > encoded, things are opened OK, because it's all just a stream of
>> > single byte characters.

>>
>> I think you're wrong. The filenames' encoding as they are read from stdin
>> must be the same as the encoding used by the file system. If the file
>> system expects UTF-8 and you feed it ISO-8859-1 you'll run into errors.
>>

> I think this is wrong. In Unix there is no concept of filename
> encoding. Filenames can have any arbitrary set of bytes (except '/' and
> '\0'). But the filesystem itself neither knows nor cares about
> encoding.


I think you misunderstood what I was trying to say. If you write a list of
filenames into files.txt, and use an encoding (ISO-8859-1, say) other than
that used by the shell to display file names (on Linux typically UTF-8 these
days) and then write a Python script exist.py that reads filenames and
checks for the files' existence,

$ python3 exist.py < files.txt

will report that a file

b'\xe4\xf6\xfc.txt'

doesn't exist. The user looking at his editor with the encoding set to
ISO-8859-1 seeing the line

äöü.txt

and then going to the console typing

$ ls
äöü.txt

will be confused even though everything is working correctly.
The system may be shuffling bytes, but the user thinks in codepoints and
sometimes assumes that codepoints and bytes are the same.

> You always have to know either
>>
>> (a) both the file system's and stdin's actual encoding, or
>> (b) that both encodings are the same.
>>
>>

> If this is true, then I think that it is wrong to do in Python3. Any
> language should be able to deal with the filenames that the host OS
> allows.
>
> Anyway, going on with the OP.. can you open stdin so that you can accept
> arbitrary bytes instead of strings and then open using the bytes as the
> filename?


You can access the underlying stdin.buffer that feeds you the raw bytes with
no attempt to shoehorn them into codepoints. You can use filenames that are
not valid in the encoding that the system uses to display filenames:

$ ls
$ python3
Python 3.1.1+ (r311:74480, Nov 2 2009, 15:45:00)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> with open(b"\xe4\xf6\xfc.txt", "w") as f:

.... f.write("hello\n")
....
6
>>>

$ ls
???.txt

> I don't have that much experience with Python3 to say for sure.


Me neither.

Peter

 
Reply With Quote
 
 
 
 
Nobody
Guest
Posts: n/a
 
      12-01-2010
On Tue, 30 Nov 2010 18:53:14 +0100, Peter Otten wrote:

>> I think this is wrong. In Unix there is no concept of filename
>> encoding. Filenames can have any arbitrary set of bytes (except '/' and
>> '\0'). But the filesystem itself neither knows nor cares about
>> encoding.

>
> I think you misunderstood what I was trying to say. If you write a list of
> filenames into files.txt, and use an encoding (ISO-8859-1, say) other than
> that used by the shell to display file names (on Linux typically UTF-8 these
> days) and then write a Python script exist.py that reads filenames and
> checks for the files' existence,


I think you misunderstood.

In the Unix kernel, there aren't any encodings. Strings of bytes are
/just/ strings of bytes. A text file containing a list of filenames
doesn't /have/ an encoding. The filenames passed to API functions don't
/have/ an encoding.

This is why Unix filenames are case-sensitive: because there isn't any
"case". The number 65 has no more in common with the number 97 than it
does with the number 255. The fact that 65 is the ASCII code for "A" while
97 is the ASCII code for "a" doesn't come into it. Case-insensitive
filenames require knowledge of the encoding in order to determine when
filenames are "equivalent". DOS/Windows tried this and never really got it
right (it works fine on a standalone system, or within later versions of
a Windows-only ecosystem, but becomes a nightmare when files get
transferred between systems via older or non-Microsoft channels).

Python 3.x's decision to treat filenames (and environment variables) as
text even on Unix is, in short, a bug. One which, IMNSHO, will mean that
Python 2.x is still around when Python 4 is released.

 
Reply With Quote
 
 
 
 
MRAB
Guest
Posts: n/a
 
      12-01-2010
On 01/12/2010 01:28, Nobody wrote:
> On Tue, 30 Nov 2010 18:53:14 +0100, Peter Otten wrote:
>
>>> I think this is wrong. In Unix there is no concept of filename
>>> encoding. Filenames can have any arbitrary set of bytes (except '/' and
>>> '\0'). But the filesystem itself neither knows nor cares about
>>> encoding.

>>
>> I think you misunderstood what I was trying to say. If you write a list of
>> filenames into files.txt, and use an encoding (ISO-8859-1, say) other than
>> that used by the shell to display file names (on Linux typically UTF-8 these
>> days) and then write a Python script exist.py that reads filenames and
>> checks for the files' existence,

>
> I think you misunderstood.
>
> In the Unix kernel, there aren't any encodings. Strings of bytes are
> /just/ strings of bytes. A text file containing a list of filenames
> doesn't /have/ an encoding. The filenames passed to API functions don't
> /have/ an encoding.
>
> This is why Unix filenames are case-sensitive: because there isn't any
> "case". The number 65 has no more in common with the number 97 than it
> does with the number 255. The fact that 65 is the ASCII code for "A" while
> 97 is the ASCII code for "a" doesn't come into it. Case-insensitive
> filenames require knowledge of the encoding in order to determine when
> filenames are "equivalent". DOS/Windows tried this and never really got it
> right (it works fine on a standalone system, or within later versions of
> a Windows-only ecosystem, but becomes a nightmare when files get
> transferred between systems via older or non-Microsoft channels).
>
> Python 3.x's decision to treat filenames (and environment variables) as
> text even on Unix is, in short, a bug. One which, IMNSHO, will mean that
> Python 2.x is still around when Python 4 is released.
>

If the filenames are to be shown to a user then there needs to be a
mapping between bytes and glyphs. That's an encoding. If different
users use different encodings then exchange of textual data becomes
difficult. That's where encodings which can be used globally come in.
By the time Python 4 is released I'd be surprised if Unix hadn't
standardised on a single encoding like UTF-8.
 
Reply With Quote
 
Albert Hopkins
Guest
Posts: n/a
 
      12-01-2010
On Wed, 2010-12-01 at 02:14 +0000, MRAB wrote:
> If the filenames are to be shown to a user then there needs to be a
> mapping between bytes and glyphs. That's an encoding. If different
> users use different encodings then exchange of textual data becomes
> difficult.


That's presentation, that's separate. Indeed, I have my user encoding
set to UTF-8, and if there is a filename that's not valid utf-8 then my
GUI (GNOME will show "(invalid encoding)" and even allow me to rename it
and my shell (bash) will show '?' next to the invalid "characters" (and
make it a little more challenging to rename ). And I can freely copy
these "invalid" files across different (Unix) systems, because the OS
doesn't care about encoding.

But that's completely different from the actual name of the file. Unix
doesn't care about presentation in filenames. It just cares about the
data. There are not "glyphs" in Unix, only in the UI that runs on top
of it.

Or to put it another way, Unix's filename encoding is RAW-DATA. It's
not "textual" data. The fact that most filenames contain mainly
human-readable text is a convenient convention, but not required or
enforced by the OS.

> That's where encodings which can be used globally come in.
> By the time Python 4 is released I'd be surprised if Unix hadn't
> standardised on a single encoding like UTF-8.


I have serious doubts about that. At least in the Linux world the
kernel wants to stay out of encoding debates (except where it has to
like Window filesystems). But the point is that:

The world does not revolve around Python. Unix filenames have been
encoding-agnostic long before Python was around. If Python3 does not
support this then it's a regression on Python's part.


 
Reply With Quote
 
Martin v. Loewis
Guest
Posts: n/a
 
      12-01-2010
> The world does not revolve around Python. Unix filenames have been
> encoding-agnostic long before Python was around. If Python3 does not
> support this then it's a regression on Python's part.


Fortunately, Python 3 does support that.

Regards,
Martin
 
Reply With Quote
 
Nobody
Guest
Posts: n/a
 
      12-01-2010
On Wed, 01 Dec 2010 02:14:09 +0000, MRAB wrote:

> If the filenames are to be shown to a user then there needs to be a
> mapping between bytes and glyphs. That's an encoding. If different
> users use different encodings then exchange of textual data becomes
> difficult.


OTOH, the exchange of binary data is unaffected. In the worst case, users
see a few wrong glyphs, but the software doesn't care.

> That's where encodings which can be used globally come in.
> By the time Python 4 is released I'd be surprised if Unix hadn't
> standardised on a single encoding like UTF-8.


That's probably not a serious option in parts of the world which don't use
a latin-based alphabet, i.e. outside western Europe and its former
colonies. In countries with non-latin alphabets, existing encodings are
often too heavily entrenched.

There's also a lot of legacy software which can only handle unibyte
encodings, and not much incentive to fix it if 98% of your market can get
by with an ISO-8859-<whatever> locale (making software work in e.g. CJK
locales often requires a lot more work than just dealing with encodings).

And it doesn't help that Windows has negligible support for UTF-8. It's
either UTF-16-LE (i.e. the in-memory format dumped directly to file) or
one of Microsoft's non-standard encodings. At least the latter are mostly
compatible with the corresponding ISO-8859-* encoding.

Finally, ISO-8859-* encoding/decoding can't fail. The result might
be complete gibberish, but converting to gibberish then back to bytes
won't lose information.

 
Reply With Quote
 
Antoine Pitrou
Guest
Posts: n/a
 
      12-01-2010
On Tue, 30 Nov 2010 22:22:01 -0500
Albert Hopkins <> wrote:
> And I can freely copy
> these "invalid" files across different (Unix) systems, because the OS
> doesn't care about encoding.


And so can Python, thanks to PEP 383.

> > That's where encodings which can be used globally come in.
> > By the time Python 4 is released I'd be surprised if Unix hadn't
> > standardised on a single encoding like UTF-8.

>
> I have serious doubts about that. At least in the Linux world the
> kernel wants to stay out of encoding debates (except where it has to
> like Window filesystems).


That doesn't matter. Vendors (Linux distributions) have to make a
choice and that choice will probably standardize on UTF-8 in most
situations. The kernel won't have a say, since it doesn't care
about encodings anyway.

> The world does not revolve around Python. Unix filenames have been
> encoding-agnostic long before Python was around. If Python3 does not
> support this then it's a regression on Python's part.


Python 3 does support it, see other messages about using bytes
filenames.

Regards

Antoine.


 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      12-01-2010
Nobody wrote:

> Python 3.x's decision to treat filenames (and environment variables) as
> text even on Unix is, in short, a bug. One which, IMNSHO, will mean that
> Python 2.x is still around when Python 4 is released.


For filenames in Python 3 the user has the choice between "text" (str) and
bytes. If the user chooses text that will be converted to bytes using a
default encoding that hopefully matches that of the other tools on the
machine that manipulate filenames.

I see that you may run into problems with the text approach when you
encounter byte sequences that are illegal in the chosen encoding.
I therefore expect that lowlevel tools will use bytes to manipulate
filenames while end user scripts will choose text.

I don't see how a dogmatic bytes only restriction can improve the situation.

Also, you can already provide unicode filenames in Python 2.x (and a script
containing constant filenames becomes more portable if you do), so IMHO the
situation in Python 2 and 3 is similar enough as to not hinder adoption of
3.x.

Peter

 
Reply With Quote
 
Nobody
Guest
Posts: n/a
 
      12-01-2010
On Wed, 01 Dec 2010 10:34:24 +0100, Peter Otten wrote:

>> Python 3.x's decision to treat filenames (and environment variables) as
>> text even on Unix is, in short, a bug. One which, IMNSHO, will mean that
>> Python 2.x is still around when Python 4 is released.

>
> For filenames in Python 3 the user has the choice between "text" (str) and
> bytes. If the user chooses text that will be converted to bytes using a
> default encoding that hopefully matches that of the other tools on the
> machine that manipulate filenames.


However, sys.argv and os.environ are automatically converted to text. If
you want bytes, you have to convert them back explicitly.

Also, I'm unsure as to how far the choice between bytes and str will
extend beyond the core modules.

> I see that you may run into problems with the text approach when you
> encounter byte sequences that are illegal in the chosen encoding.


This was actually a critical flaw in Python 3.0, as it meant that
filenames which weren't valid in the locale's encoding simply couldn't be
passed via argv or environ. 3.1 fixed this using the "surrogateescape"
encoding, so now it's only an annoyance (i.e. you can recover the original
bytes once you've spent enough time digging through the documentation).

There could be a problem with encodings which aren't invertable (e.g.
ISO-2022), but those tend to be quite rare and Python flat-out doesn't
support those as system encodings anyhow.


 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      12-02-2010
Nobody wrote:

> This was actually a critical flaw in Python 3.0, as it meant that
> filenames which weren't valid in the locale's encoding simply couldn't be
> passed via argv or environ. 3.1 fixed this using the "surrogateescape"
> encoding, so now it's only an annoyance (i.e. you can recover the original
> bytes once you've spent enough time digging through the documentation).


Is it just that you need to harden your scripts against these byte sequences
or do you actually encounter them? If the latter, can you give some
examples?
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
problem in running a basic code in python 3.3.0 that includes HTML file Satabdi Mukherjee Python 1 04-04-2013 07:48 PM
Re: Is it possible to let a virtual file created by cStringIO havea filename so that functions can read it by its filename? Steven Howe Python 0 01-14-2011 10:32 PM
Re: Python 3 encoding question: Read a filename from stdin,subsequently open that filename Dan Stromberg Python 0 12-06-2010 05:01 AM
Re: Python 3 encoding question: Read a filename from stdin,subsequently open that filename Peter Otten Python 0 11-30-2010 10:52 AM
How to open file dialog in Ruby, and get open FileName? :-( iMelody Ooo Ruby 5 10-21-2010 04:02 PM



Advertisments