Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > os.lisdir, gets unicode, returns unicode... USUALLY?!?!?

Reply
Thread Tools

os.lisdir, gets unicode, returns unicode... USUALLY?!?!?

 
 
gabor
Guest
Posts: n/a
 
      11-16-2006
hi,

from the documentation (http://docs.python.org/lib/os-file-dir.html) for
os.listdir:

"On Windows NT/2k/XP and Unix, if path is a Unicode object, the result
will be a list of Unicode objects."

i'm on Unix. (linux, ubuntu edgy)

so it seems that it does not always return unicode filenames.

it seems that it tries to interpret the filenames using the filesystem's
encoding, and if that fails, it simply returns the filename as byte-string.

so you get back let's say an array of 21 filenames, from which 3 are
byte-strings, and the rest unicode strings.

after digging around, i found this in the source code:

> #ifdef Py_USING_UNICODE
> if (arg_is_unicode) {
> PyObject *w;
>
> w = PyUnicode_FromEncodedObject(v,
> Py_FileSystemDefaultEncoding,
> "strict");
> if (w != NULL) {
> Py_DECREF(v);
> v = w;
> }
> else {
> /* fall back to the original byte string, as
> discussed in patch #683592 */
> PyErr_Clear();
> }
> }
> #endif


so if the to-unicode-conversion fails, it falls back to the original
byte-string. i went and have read the patch-discussion.

and now i'm not sure what to do.
i know that:

1. the documentation is completely wrong. it does not always return
unicode filenames
2. it's true that the documentation does not specify what happens if the
filename is not in the filesystem-encoding, but i simply expected that i
get an Unicode-exception, as everywhere else. you see, exceptions are
ok, i can deal with them. but this is just plain wrong. from now on,
EVERYWHERE where i use os.listdir, i will have to go through all the
filenames in it, and check if they are unicode-strings or not.

so basically i'd like to ask here: am i reading something incorrectly?
or am i using os.listdir the "wrong way"? how do other people deal with
this?

p.s: one additional note. if you code expects os.listdir to return
unicode, that usually means that all your code uses unicode strings.
which in turn means, that those filenames will somehow later interact
with unicode strings. which means that that byte-string-filename will
probably get auto-converted to unicode at a later point, and that
auto-conversion will VERY probably fail, because the auto-convert only
happens using 'ascii' as the encoding, and if it was not possible to
decode the filename inside listdir, it's quite probable that it also
will not work using 'ascii' as the charset.


gabor
 
Reply With Quote
 
 
 
 
Terry Reedy
Guest
Posts: n/a
 
      11-16-2006

"gabor" <(E-Mail Removed)> wrote in message
news:edfc7$455cd28b$59ad1aca$(E-Mail Removed) oups.com...
> so if the to-unicode-conversion fails, it falls back to the original
> byte-string. i went and have read the patch-discussion.
>
> and now i'm not sure what to do.
> i know that:
>
> 1. the documentation is completely wrong. it does not always return
> unicode filenames


Unless someone says otherwise, report the discrepancy between doc and code
as a bug on the SF tracker. I have no idea of what the resolution should
be .

tjr



 
Reply With Quote
 
 
 
 
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Guest
Posts: n/a
 
      11-16-2006
gabor schrieb:
> so basically i'd like to ask here: am i reading something incorrectly?


You are reading it correctly. This is how it behaves.

> or am i using os.listdir the "wrong way"? how do other people deal with
> this?


You didn't say why the behavior causes a problem for you - you only
explained what the behavior is.

Most people use os.listdir in a way like this:

for name in os.listdir(path):
full = os.path.join(path, name)
attrib = os.stat(full)
if some-condition:
f = open(full)
...

All this code will typically work just fine with the current behavior,
so people typically don't see any problem.

Regards,
Martin
 
Reply With Quote
 
gabor
Guest
Posts: n/a
 
      11-16-2006
Martin v. Löwis wrote:
> gabor schrieb:
>
>> or am i using os.listdir the "wrong way"? how do other people deal with
>> this?

>
> You didn't say why the behavior causes a problem for you - you only
> explained what the behavior is.
>
> Most people use os.listdir in a way like this:
>
> for name in os.listdir(path):
> full = os.path.join(path, name)
> attrib = os.stat(full)
> if some-condition:
> f = open(full)
> ...
>
> All this code will typically work just fine with the current behavior,
> so people typically don't see any problem.
>


i am sorry, but it will not work. actually this is exactly what i did,
and it did not work. it dies in the os.path.join call, where file_name
is converted into unicode. and python uses 'ascii' as the charset in
such cases. but, because listdir already failed to decode the file_name
with the filesystem-encoding, it usually also fails when tried with 'ascii'.

example:

>>> dir_name = u'something'
>>> unicode_file_name = u'\u732b.txt' # the japanese cat-symbol
>>> bytestring_file_name = unicode_file_name.encode('utf-8')
>>>
>>>
>>> import os.path
>>>
>>> os.path.join(dir_name,unicode_file_name)

u'something/\u732b.txt'
>>>
>>>
>>> os.path.join(dir_name,bytestring_file_name)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/posixpath.py", line 65, in join
path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 1:
ordinal not in range(12
>>>



gabor
 
Reply With Quote
 
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
Guest
Posts: n/a
 
      11-16-2006
gabor schrieb:
>> All this code will typically work just fine with the current behavior,
>> so people typically don't see any problem.
>>

>
> i am sorry, but it will not work. actually this is exactly what i did,
> and it did not work. it dies in the os.path.join call, where file_name
> is converted into unicode. and python uses 'ascii' as the charset in
> such cases. but, because listdir already failed to decode the file_name
> with the filesystem-encoding, it usually also fails when tried with
> 'ascii'.


Ah, right. So yes, it will typically fail immediately - just as you
wanted it to do, anyway; the advantage with this failure is that you
can also find out what specific file name is causing the problem
(whereas when listdir failed completely, you could not easily find
out the cause of the failure).

How would you propose listdir should behave?

Regards,
Martin
 
Reply With Quote
 
Fredrik Lundh
Guest
Posts: n/a
 
      11-17-2006
gabor wrote:

> get an Unicode-exception, as everywhere else. you see, exceptions are
> ok, i can deal with them.


> p.s: one additional note. if you code expects os.listdir to return
> unicode, that usually means that all your code uses unicode strings.
> which in turn means, that those filenames will somehow later interact
> with unicode strings. which means that that byte-string-filename will
> probably get auto-converted to unicode at a later point, and that
> auto-conversion will VERY probably fail


it will raise an exception, most likely. didn't you just say that
exceptions were ok?

</F>

 
Reply With Quote
 
Laurent Pointal
Guest
Posts: n/a
 
      11-17-2006
gabor a écrit :
> hi,
>
> from the documentation (http://docs.python.org/lib/os-file-dir.html) for
> os.listdir:
>
> "On Windows NT/2k/XP and Unix, if path is a Unicode object, the result
> will be a list of Unicode objects."


Maybe, for each filename, you can test if it is an unicode string, and
if not, convert it to unicode using the encoding indicated by
sys.getfilesystemencoding().

Have a try.

A+

Laurent.
 
Reply With Quote
 
Johan von Boisman
Guest
Posts: n/a
 
      11-17-2006
Laurent Pointal wrote:
> gabor a écrit :
>> hi,
>>
>> from the documentation (http://docs.python.org/lib/os-file-dir.html) for
>> os.listdir:
>>
>> "On Windows NT/2k/XP and Unix, if path is a Unicode object, the result
>> will be a list of Unicode objects."

>
> Maybe, for each filename, you can test if it is an unicode string, and
> if not, convert it to unicode using the encoding indicated by
> sys.getfilesystemencoding().
>
> Have a try.
>
> A+
>
> Laurent.


Strange coincident, as I was wrestling with this problem only yesterday.

I found this most illuminating discussion on the topic with
contributions from Mr Lövis and others:

http://www.thescripts.com/forum/thread41954.html

/johan
 
Reply With Quote
 
gabor
Guest
Posts: n/a
 
      11-17-2006
Laurent Pointal wrote:
Laurent Pointal wrote:
> gabor a écrit :
>> hi,
>>
>> from the documentation (http://docs.python.org/lib/os-file-dir.html) for
>> os.listdir:
>>
>> "On Windows NT/2k/XP and Unix, if path is a Unicode object, the result
>> will be a list of Unicode objects."

>
> Maybe, for each filename, you can test if it is an unicode string, and
> if not, convert it to unicode using the encoding indicated by
> sys.getfilesystemencoding().
>
> Have a try.
>
> A+
>
> Laurent.


> gabor a écrit :
>> hi,
>>
>> from the documentation (http://docs.python.org/lib/os-file-dir.html) for
>> os.listdir:
>>
>> "On Windows NT/2k/XP and Unix, if path is a Unicode object, the result
>> will be a list of Unicode objects."

>
> Maybe, for each filename, you can test if it is an unicode string, and
> if not, convert it to unicode using the encoding indicated by
> sys.getfilesystemencoding().
>

i don't think it would work. because os.listdir already tried, and
failed (that's why we got a byte-string and not an unicode-string)

gabor
 
Reply With Quote
 
Leo Kislov
Guest
Posts: n/a
 
      11-17-2006

Martin v. Löwis wrote:
> gabor schrieb:
> >> All this code will typically work just fine with the current behavior,
> >> so people typically don't see any problem.
> >>

> >
> > i am sorry, but it will not work. actually this is exactly what i did,
> > and it did not work. it dies in the os.path.join call, where file_name
> > is converted into unicode. and python uses 'ascii' as the charset in
> > such cases. but, because listdir already failed to decode the file_name
> > with the filesystem-encoding, it usually also fails when tried with
> > 'ascii'.

>
> Ah, right. So yes, it will typically fail immediately - just as you
> wanted it to do, anyway; the advantage with this failure is that you
> can also find out what specific file name is causing the problem
> (whereas when listdir failed completely, you could not easily find
> out the cause of the failure).
>
> How would you propose listdir should behave?


How about returning two lists, first list contains unicode names, the
second list contains undecodable names:

files, troublesome = os.listdir(separate_errors=True)

and make separate_errors=True by default in python 3.0 ?

-- Leo

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Return of gets gets John Joyce Ruby 0 04-23-2007 01:38 PM
gets gets John Joyce Ruby 2 03-26-2007 04:00 PM
XMLHttpRequest gets data to display in element but data gets appended not replaces previous! libsfan01 Javascript 5 12-20-2006 06:25 AM
Re: os.lisdir, gets unicode, returns unicode... USUALLY?!?!? Jean-Paul Calderone Python 23 11-21-2006 10:25 AM
Not only the selected HREF gets surrounded, but the whole row gets surrounded Stefan Mueller HTML 5 07-10-2006 11:53 AM



Advertisments