Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: Unicode File Names

Reply
Thread Tools

Re: Unicode File Names

 
 
John Machin
Guest
Posts: n/a
 
      10-17-2008
On Oct 17, 11:43 am, Jordan <jordan.tayl...@gmail.com> wrote:
> I've got a bunch of files with Japanese characters in their names and
> os.listdir() replaces those characters with ?'s. I'm trying to open
> the files several steps later, and obviously Python isn't going to
> find '01-????.jpg' (formally '01-$B$R$i$,$J(B.jpg') because it doesn't exist.
> I'm not sure where in the process I'm able to stop that from
> happening. Thanks.


The Fine Manual says:
"""
listdir( path)

Return a list containing the names of the entries in the directory.
The list is in arbitrary order. It does not include the special
entries '.' and '..' even if they are present in the directory.
Availability: Macintosh, Unix, Windows.
Changed in version 2.3: On Windows NT/2k/XP and Unix, if path is a
Unicode object, the result will be a list of Unicode objects.
"""

Are you unsure whether your version of Python is 2.3 or later?


 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      10-17-2008
On Oct 17, 12:52 pm, Jordan <jordan.tayl...@gmail.com> wrote:
> On Oct 16, 9:20 pm, John Machin <sjmac...@lexicon.net> wrote:
>
>
>
> > On Oct 17, 11:43 am, Jordan <jordan.tayl...@gmail.com> wrote:

>
> > > I've got a bunch of files with Japanese characters in their names and
> > > os.listdir() replaces those characters with ?'s. I'm trying to open
> > > the files several steps later, and obviously Python isn't going to
> > > find '01-????.jpg' (formally '01-$B$R$i$,$J(B.jpg') because it doesn't exist.
> > > I'm not sure where in the process I'm able to stop that from
> > > happening. Thanks.

>
> > The Fine Manual says:
> > """
> > listdir( path)

>
> > Return a list containing the names of the entries in the directory.
> > The list is in arbitrary order. It does not include the special
> > entries '.' and '..' even if they are present in the directory.
> > Availability: Macintosh, Unix, Windows.
> > Changed in version 2.3: On Windows NT/2k/XP and Unix, if path is a
> > Unicode object, the result will be a list of Unicode objects.
> > """

>
> > Are you unsure whether your version of Python is 2.3 or later?

>
> *** Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32
> bit (Intel)] on win32. *** says my interpreter
>
> when it says "if path is a Unicode object...", does that mean the path
> name must have a Unicode char?


If path is a Unicode [should read unicode] object of length > 0, then
*all* characters in path are by definition unicode characters.

Where are you getting your path from? If you are doing os.listdir(r'c:
\test') then do os.listdir(ur'c:\test'). If you are getting it from
the command line or somehow else as a variable, instead of
os.listdir(path), try os.listdir(unicode(path)). If that fails with a
message like "UnicodeDecodeError: 'ascii' codec can't decode .....",
then you'll need something like os.listdir(unicode(path,
encoding='cp1252')) # cp1252 being the most likely suspect

I strongly suggest that you read this:
http://www.amk.ca/python/howto/unicode
which contains lots of useful information, including an answer to your
original question.
 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      10-17-2008
On Oct 17, 2:56 pm, Jordan <jordan.tayl...@gmail.com> wrote:
> I'm not quite sure now if the problem is me, windows, or zipfile
> (which I kinda failed to mention before). Using
> os.listdir(unicode(os.listdir()))


You mean os.listdir(unicode(os.getcwd())), I presume.


> seems to have been a step in the
> right direction (thanks Chris and John). When testing things in the
> python interpreter, I don't seem to hit issues after using the above
> mentioned line.
>
>
Code:
>>> l = os.listdir(unicode(os.getcwd()))
> >>> l
>
> u'01-\u3072\u3089\u304c\u306a.jpg'
> u'02-\u3072\u3089\u304c\u306a.jpg'
> u'03-\u3072\u3089\u304c\u306a.jpg'
>
> >>>for thing in l:
>
> ...    print thing
> 01-$B$R$i$,$J(B.jpg
> 02-$B$R$i$,$J(B.jpg
> 03-$B$R$i$,$J(B.jpg
>
>
> Yay.
>
> Having a file that tries "for thing in l: print thing" fails with:


>
> File "C:\Python25\Lib\encodings\cp437.py", line 12, in encode
> return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode characters in
> position 13-16: character maps to <undefined>
>
> I'm perfectly willing to let command prompt refuse to print that (it's
> debugging only) if the next issue was resolved >_>:


use print repr(thing) for debugging.

>
> """
> Note: There is no official file name encoding for ZIP files. If you
> have unicode file names, please convert them to byte strings in your
> desired encoding before passing them to write(). WinZip interprets all
> file names as encoded in CP437, also known as DOS Latin.
> """
>
> I'm simply not sure what this means and how to deal with it.


Step 1:
Read appendix D of http://www.pkware.com/documents/casestudies/APPNOTE.TXT

Step 2:
Note the change history at the start of that document:
"""
6.3.0 -Added tape positioning storage 09/29/2006
parameters
[snip]
-Added option for Unicode filename
storage
"""

Step 3: Read http://bugs.python.org/issue1734346

Step 4: Either wait for Python 2.7 or apply the patch to your own copy
of zipfile ...
 
Reply With Quote
 
Martin v. Lowis
Guest
Posts: n/a
 
      10-17-2008
> Step 4: Either wait for Python 2.7 or apply the patch to your own copy
> of zipfile ...


Actually, this is released in Python 2.6, see r62724.

Regards,
Martin
 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      10-17-2008
On Oct 17, 6:32 pm, "Martin v. Lo"wis" <mar...@v.loewis.de> wrote:
> > Step 4: Either wait for Python 2.7 or apply the patch to your own copy
> > of zipfile ...

>
> Actually, this is released in Python 2.6, see r62724.


Hi Martin,

That's good. I was lead astray by the fact that the 2.6 docs still
contain the note that the OP asked about: "There is no official file
name encoding for ZIP files. If you have unicode file names, you must
convert them to byte strings in your desired encoding before passing
them to write(). WinZip interprets all file names as encoded in CP437,
also known as DOS Latin."

The first sentence was and is bafflegab, the second didn't mention the
portability issues arising from its suggestion (and is now not true),
and the third needs explanation or omission. I believe that WinZip has
supported utf8 since v11.2.

Should the note be removed, or should it say something like "Unicode
file names are supported. New in Python 2.6."? Is there anything else
that should be mentioned?

More on cp437: I see where you mentioned to the patch author that a
unicode string should be encoded in cp437 if possible, but this was
not done -- it first tries ascii. What are your views on what encoding
should be assumed if the utf8 flag is not set?

Cheers,
John
 
Reply With Quote
 
Mark Tolonen
Guest
Posts: n/a
 
      10-17-2008

"Jordan" <> wrote in message
news:311aa0af-6acd-45b6-b89b-...
>>>> l = os.listdir(unicode(os.getcwd()))


Other options to get the same result:

l = os.listdir(os.getcwdu())
l = os.listdir(u'.')

Oddly, os.getcwd() and os.getcwdu() both still exist in Python 3.0. Since
the behavior is now identical it seems os.getcwdu() should be dropped.

-Mark

 
Reply With Quote
 
Martin v. Löwis
Guest
Posts: n/a
 
      10-18-2008
> Should the note be removed, or should it say something like "Unicode
> file names are supported. New in Python 2.6."? Is there anything else
> that should be mentioned?


The note should be corrected, documenting the behaviour implemented.

> More on cp437: I see where you mentioned to the patch author that a
> unicode string should be encoded in cp437 if possible, but this was
> not done -- it first tries ascii. What are your views on what encoding
> should be assumed if the utf8 flag is not set?


There isn't any standard that is widely followed (just as the note that
you declared bafflegab says). While APPNOTE.TXT specifies it as cp437,
implementations often ignore that, because a) they didn't know, and b)
cp437 was too limited for what they want to do. So we see all kinds of
alternative implementations - often involving the locale's code page
(and on Windows, both OEMCP and ACP get used - often just as a side
effect of whatever internal representation the applications use).

In 2.x, Python doesn't need to decide, so when opening a zip file, the
file names get reported as byte strings unless they have the UTF-8
bit set (in which case they get decoded). In 3.x, file names (in the
zipfile module) uniformly use the (unicode) character string type, hence
that version implements the spec, by decoding as 437.

Upon encoding, chosing between ASCII and CP437 has trade-offs. Notice
how both are formally complying to the spec, as ASCII is a subset of
CP437 (i.e. even though it uses the ASCII codec, it *still* encodes
as CP437). The tradeoffs can be studied by looking at three groups
of file names:
- pure ASCII; choice does not matter (both ascii and cp437 can
encode the file name, and both get the same result)
- arbitrary string containing non-CP437 characters; choice does
not matter (neither ascii nor cp437 can encode, so the UTF-8
bit must be used)
- others; here are the tradeoffs. Pro ASCII: receiver can unambiguously
reproduce the original file name, as the UTF-8 bit will be set.
Pro CP437: old software (unaware of the UTF-8 bit) has a chance
of correctly guessing the file name (if it followed APPNOTE.TXT).

I (now) prefer the tradeoff being taken, as it's the one that
produces more reliable results in the long run (i.e. when more
and more zip readers support UTF-.

Regards,
Martin
 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      10-18-2008
On Oct 18, 5:57*pm, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
> > Should the note be removed, or should it say something like "Unicode
> > file names are supported. New in Python 2.6."? Is there anything else
> > that should be mentioned?

>
> The note should be corrected, documenting the behaviour implemented.
>
> > More on cp437: I see where you mentioned to the patch author that a
> > unicode string should be encoded in cp437 if possible, but this was
> > not done -- it first tries ascii. What are your views on what encoding
> > should be assumed if the utf8 flag is not set?

>


[lots of enlightenment snipped]

Thanks heaps, Martin.
Cheers,
John
 
Reply With Quote
 
Martin v. Löwis
Guest
Posts: n/a
 
      10-18-2008
> Oddly, os.getcwd() and os.getcwdu() both still exist in Python 3.0.
> Since the behavior is now identical it seems os.getcwdu() should be
> dropped.


It is dropped, and os.getcwdb() has been added.

Regards,
Martin
 
Reply With Quote
 
Mark Tolonen
Guest
Posts: n/a
 
      10-18-2008

""Martin v. Löwis"" <> wrote in message
news:48f9de43$0$5124$...
>> Oddly, os.getcwd() and os.getcwdu() both still exist in Python 3.0.
>> Since the behavior is now identical it seems os.getcwdu() should be
>> dropped.

>
> It is dropped, and os.getcwdb() has been added.


Must be changed post 3.0rc1, but I seem to remember reading about that now
in another thread:

Python 3.0rc1 (r30rc1:66507, Sep 18 2008, 14:47:0 [MSC v.1500 32 bit
(Intel)]
on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> [s for s in dir(os) if 'cwd' in s]

['getcwd', 'getcwdu']

-Mark












































 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
WinXP, Python3.1.2,dir-listing to XML - problem with unicode file names kai_nerda Python 0 04-03-2010 02:40 AM
File names, character sets and Unicode Michal Ludvig Python 1 12-12-2008 11:08 AM
how to glob with international or unicode file names? SpringFlowers AutumnMoon Ruby 2 10-14-2007 05:55 PM
Removing file names with '.' in their names from list? Sfdesigner Sfdesigner Ruby 5 08-13-2007 02:38 AM
logical puzzle: how to generate reasonable archive file names fromfile and directory names fBechmann Python 0 06-10-2004 07:13 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57