Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Creating UNICODE filenames with PERL 5.8

Reply
Thread Tools

Creating UNICODE filenames with PERL 5.8

 
 
Allan Yates
Guest
Posts: n/a
 
      11-17-2003
I have been having distinct trouble creating file names in PERL
containing UNICODE characters. I am running ActiveState PERL 5.8 on
Windows 2000.

For a simple test, I picked a UNICODE character that could be
displayed by Windows Explorer. I can select the character(U+0636) from
'charmap' and cut/paste into a filename on Windows Explorer and the
character displays the same as it does in 'charmap'. This proves that
I have the font available.

When I attempt to create the same filename with PERL, I end up with a
filename two characters long: ض

I somebody could point me in the correct direction, I would very much
appreciate it. I have read the UNICODE documents included with PERL as
well searching the newgroups and the web, and everything appears to
indicate this should work.

Perl program:

$name = chr(0x0636);

if (!open(FILE,">uni_names/$name")) {
print STDERR "Could not open ($!): $name\n";
}

close (FILE);


Thanks,

Allan.
a y a t e s a t s i g n i a n t d o t c o m
 
Reply With Quote
 
 
 
 
Alan J. Flavell
Guest
Posts: n/a
 
      11-17-2003
On Mon, 17 Nov 2003, Allan Yates wrote:

> I have been having distinct trouble creating file names in PERL
> containing UNICODE characters. I am running ActiveState PERL 5.8 on
> Windows 2000.


N.B I have limited expertise in this specific area, but some of the
locals around here seem to look to me to answer Unicode questions of
any kind, so I'll give it a try, as long as you take the answers with
the necessary grains of salt...

First important question is - have you set the option for wide
character API in system calls?

> For a simple test, I picked a UNICODE character that could be
> displayed by Windows Explorer. I can select the character(U+0636) from


that'd be Arabic letter DAD, right?

Its utf-8 representation will be two octets: 0xd8, 0xb6.

> 'charmap' and cut/paste into a filename on Windows Explorer and the
> character displays the same as it does in 'charmap'. This proves that
> I have the font available.


(I think that's the least of your worries at the moment...)

> When I attempt to create the same filename with PERL, I end up with a
> filename two characters long: ض


Those look like 0xd8 and 0xb6 to me...

At a quick glance, I suspect we are seeing the pair of octets that
represent the character in utf-8 (Perl's internal representation)
rather than as what Win32 would use, which AIUI is utf-16LE (which in
this case would come out as 0x3606, IINM). However, I'm not sure that
(other than for diagnostic purposes) you should ever need to tangle
with it in that form, since Perl ought to know what to do in a (wide)
system call.

The system call is evidently treating them as two one-byte characters,
hence my question about wide system calls. Look for the reference to
wide system calls in the perlrun page, and the other references to
which it links.

> I somebody could point me in the correct direction, I would very much
> appreciate it. I have read the UNICODE documents included with PERL as


OK, but there are also some Win32-specific documents/web-pages that
come with the ActivePerl distribution. In some situations they might
be just what you need.

> well searching the newgroups and the web, and everything appears to
> indicate this should work.


If the above is not the answer, then maybe Win32API::File has
something for you - but I've never been there myself, so don't pay too
much attention to that.

> Perl program:


But did you start it with the -C option, or set the wide system calls
thingy? I think that may prove to be the key.

Good luck, and please report your findings.
 
Reply With Quote
 
 
 
 
Ben Morrow
Guest
Posts: n/a
 
      11-17-2003
(Allan Yates) wrote:
> I have been having distinct trouble creating file names in PERL


Perl or perl, not PERL.

> containing UNICODE


I'm not so sure about UNICODE...

> For a simple test, I picked a UNICODE character that could be
> displayed by Windows Explorer. I can select the character(U+0636) from
> 'charmap' and cut/paste into a filename on Windows Explorer and the
> character displays the same as it does in 'charmap'. This proves that
> I have the font available.
>
> When I attempt to create the same filename with PERL, I end up with a
> filename two characters long: ض


OK, your problem here is that Win2k is being stupid about Unicode: any
sensible OS that understood UTF8 would be fine . My guess would be
that Windows stores filenames in utf16 with a BOM, and if it doesn't
find a BOM it assumes ASCII/'Windows ANSI'... so try this:

use Encode;

> $name = chr(0x0636);


$name = encode "utf16", $name;

> if (!open(FILE,">uni_names/$name")) {
> print STDERR "Could not open ($!): $name\n";
> }
>
> close (FILE);


If that works, then we could really do with an addition to the 'open'
pragma to do it for you: use open NAMES => "utf16";... hmmm.

If it fails, delete your file in uni_names and create one by
copy/pasting that character out of charmap. Then run

#!/usr/bin/perl

use warnings;
use bytes;

opendir my $U, "uni_names";
my @n = readdir $U;
$, = $\ = "\n";
print map { "$_: " . join ' ', map { ord } split // } @n;

__END__

and tell me what it says.

Ben

--
And if you wanna make sense / Whatcha looking at me for? (Fiona Apple)
* *
 
Reply With Quote
 
Allan Yates
Guest
Posts: n/a
 
      11-17-2003
The key was the missing "-C". I didn't clue in from the documentation
that this was important. Once I added that command line parameter, the
file was created with the correct name.

My next step was to read the file name from the directory. However, I
thought I read in some documentation somewhere that 'readdir' is not
UNICODE aware. I seemed to prove this by reading the directory
containing the file I just created. It comes back with a two character
file name that 'ord' into 0xd8 and 0xb6 as you indicated.

Do you know of a method of reading directories to get the UNICODE file
names?


Thanks,

Allan.

"Alan J. Flavell" <> wrote in message news:< .gla.ac.uk>...
> On Mon, 17 Nov 2003, Allan Yates wrote:
>
> > I have been having distinct trouble creating file names in PERL
> > containing UNICODE characters. I am running ActiveState PERL 5.8 on
> > Windows 2000.

>
> N.B I have limited expertise in this specific area, but some of the
> locals around here seem to look to me to answer Unicode questions of
> any kind, so I'll give it a try, as long as you take the answers with
> the necessary grains of salt...
>
> First important question is - have you set the option for wide
> character API in system calls?
>
> > For a simple test, I picked a UNICODE character that could be
> > displayed by Windows Explorer. I can select the character(U+0636) from

>
> that'd be Arabic letter DAD, right?
>
> Its utf-8 representation will be two octets: 0xd8, 0xb6.
>
> > 'charmap' and cut/paste into a filename on Windows Explorer and the
> > character displays the same as it does in 'charmap'. This proves that
> > I have the font available.

>
> (I think that's the least of your worries at the moment...)
>
> > When I attempt to create the same filename with PERL, I end up with a
> > filename two characters long: ض

>
> Those look like 0xd8 and 0xb6 to me...
>
> At a quick glance, I suspect we are seeing the pair of octets that
> represent the character in utf-8 (Perl's internal representation)
> rather than as what Win32 would use, which AIUI is utf-16LE (which in
> this case would come out as 0x3606, IINM). However, I'm not sure that
> (other than for diagnostic purposes) you should ever need to tangle
> with it in that form, since Perl ought to know what to do in a (wide)
> system call.
>
> The system call is evidently treating them as two one-byte characters,
> hence my question about wide system calls. Look for the reference to
> wide system calls in the perlrun page, and the other references to
> which it links.
>
> > I somebody could point me in the correct direction, I would very much
> > appreciate it. I have read the UNICODE documents included with PERL as

>
> OK, but there are also some Win32-specific documents/web-pages that
> come with the ActivePerl distribution. In some situations they might
> be just what you need.
>
> > well searching the newgroups and the web, and everything appears to
> > indicate this should work.

>
> If the above is not the answer, then maybe Win32API::File has
> something for you - but I've never been there myself, so don't pay too
> much attention to that.
>
> > Perl program:

>
> But did you start it with the -C option, or set the wide system calls
> thingy? I think that may prove to be the key.
>
> Good luck, and please report your findings.

 
Reply With Quote
 
Allan Yates
Guest
Posts: n/a
 
      11-18-2003
But

You are correct that unicode is not an acronym and should not be
capitalised. My deepest apologies for offending you through the use of
my grammer. I was not aware that grammer police were covering this
newsgroup. PERL is an acronym, "Practical Extraction and Report
Language", and thus may be capitalised.


Allan.

P.S. Please don't even think of chastising me for top posting versus
bottom posting. Different people have different preferences.

P.P.S. For the people who have ignored my grammer and helped me in my
quest, I am very appeciative.

Abigail <> wrote in message news:<> ...
> Allan Yates () wrote on MMMDCCXXX September MCMXCIII in
> <URL:news: gle.com>:
> \\ I have been having distinct trouble creating file names in PERL
> \\ containing UNICODE characters. I am running ActiveState PERL 5.8 on
> \\ Windows 2000.
>
> Neither Perl, nor Unicode are acronyms, so they aren't spelled in
> all caps. If you do, it's like you are shouting. And that's rude.
>
>
> Abigail

 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      11-18-2003
(Allan Yates) wrote:
> You are correct that unicode is not an acronym and should not be
> capitalised. My deepest apologies for offending you through the use of
> my grammer. I was not aware that grammer police were covering this
> newsgroup.


'Grammar police' cover every ng worth having, the reason being that it
is very much easier to understand people when their spelling/grammar/
punctuation is correct.

> PERL is an acronym, "Practical Extraction and Report Language", and
> thus may be capitalised.


Nope, it isn't. from perlfaq1:

| But never write "PERL", because perl is not an acronym, apocryphal
| folklore and post- facto expansions notwithstanding.

> P.S. Please don't even think of chastising me for top posting versus
> bottom posting. Different people have different preferences.


No they don't. Only idiots prefer top-posting.

*PLONK*

Ben

--
If I were a butterfly I'd live for a day, / I would be free, just blowing away.
This cruel country has driven me down / Teased me and lied, teased me and lied.
I've only sad stories to tell to this town: / My dreams have withered and died.
<=>=<=>=<=>=<=>=<=>=<=>=<=>=<=>=<=>=<=>=<=> (Kate Rusby)
 
Reply With Quote
 
Malcolm Dew-Jones
Guest
Posts: n/a
 
      11-19-2003
Ben Morrow () wrote:
: (Allan Yates) wrote:
: > I have been having distinct trouble creating file names in PERL

: Perl or perl, not PERL.

: > containing UNICODE

: I'm not so sure about UNICODE...

: > For a simple test, I picked a UNICODE character that could be
: > displayed by Windows Explorer. I can select the character(U+0636) from
: > 'charmap' and cut/paste into a filename on Windows Explorer and the
: > character displays the same as it does in 'charmap'. This proves that
: > I have the font available.
: >
: > When I attempt to create the same filename with PERL, I end up with a
: > filename two characters long: ض

: OK, your problem here is that Win2k is being stupid about Unicode: any
: sensible OS that understood UTF8 would be fine .

Hum, NT has been handling unicode for at least ten years (3.5, 1993) by
the simple expedient of using 16 bit characters. It is hardware that is
stupid, by continuing to use ancient tiny 8 bit elementary units.

Imagine if all that hardware still used 16 or 24 bit memory addresses.
Imagine if all our communication and hardware backbones still actually
transmitted data in single digit bit sizes.

Character size was always a compromise between functionality and memory.
Character size continually increased from the first character manipulating
electronic equipment of the (gee, way way back 1930's or so, believe it or
not) until the 1980's, when it suddenly solidified into a standard
elementary unit that was still a compromise in terms of size, but is now
clearly too small.

Character size remains frozen due to one of murphy's laws regarding the
success of hardware first build using compromises that were appropriate
twenty years ago.

 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      11-19-2003
(Malcolm Dew-Jones) wrote:
> Ben Morrow () wrote:
> : OK, your problem here is that Win2k is being stupid about Unicode: any
> : sensible OS that understood UTF8 would be fine .
>
> Hum, NT has been handling unicode for at least ten years (3.5, 1993) by
> the simple expedient of using 16 bit characters. It is hardware that is
> stupid, by continuing to use ancient tiny 8 bit elementary units.


OK, I invited that with gratuitous OS-bashing ... nevertheless:

1. Unicode is *NOT* a 16-bit character set. UTF16 is an evil bodge to
work around those who started assuming it was before the standards
were properly in place.

2. Given that the world does, in fact, use 8-bit bytes, any 16-bit
encoding has this small problem of endianness... again, solved
(IMHO) less-than-elegantly by the Unicode Consortium.

3. Given that the most widespread character set is likely to be either
ASCII or Chinese ideograms, and ideograms won't fit into less than
16 bits anyway, it seems pretty silly to encode a 7-bit charset
with 16 bits per character.

4. It also seems pretty silly to break everything in the world that
relies on a byte of 0 meaning end-of-string, not to mention '/'
being '/' (or '\', or whatever, as appropriate).

et cetera

Ben

--
And if you wanna make sense / Whatcha looking at me for? (Fiona Apple)
* *
 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      11-19-2003
Allan Yates <> wrote:

> PERL is an acronym,



No it isn't smarty pants.


> P.S. Please don't even think of chastising me for top posting versus
> bottom posting. Different people have different preferences.



No chastisment, just ignoration in perpetuity.

*plonk*


--
Tad McClellan SGML consulting
Perl programming
Fort Worth, Texas
 
Reply With Quote
 
Tassilo v. Parseval
Guest
Posts: n/a
 
      11-19-2003
Also sprach Allan Yates:

> P.S. Please don't even think of chastising me for top posting versus
> bottom posting. Different people have different preferences.


Right. And unless you write those articles solely for yourself, the
preferences of your readers count and not yours. So stop top-posting or
the regulars will stop reading your posts.

Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus}) !JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexi ixesixeseg;y~\n~~dddd;eval
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Unicode to DOS filenames (to call FSUM.exe) Christian Stooker Python 3 05-15-2006 12:01 PM
java.util.zip not handling Unicode filenames Chris Java 2 10-31-2005 04:03 AM
Windows/win32all, unicode and long filenames Neil Hodgson Python 3 08-28-2005 06:59 AM
problem with filenames, Filenames and FILENAMES B.J. HTML 4 04-23-2005 08:13 PM
pep 277, Unicode filenames & mbcs encoding &c. Edward K. Ream Python 5 10-23-2003 06:16 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57