Dr.Ruud wrote:
> P schreef:
>
> > I have one file, which is in UTF8, which contains a set
> > of strings. I want to determine whether any of the
> > strings matches any file name in a specified directory.
> >
> > Since there can be special characters in the file names
> > (and in the strings in the UTF8 file), sometimes I'll
> > get false negatives, because a simple eq on the strings
> > in the UTF8 file and on the file names in the directory
> > won't match (due to the different encodings).
> >
> > So I want to normalise the directory listing first (and
> > this should be dependent on the code page, because
> > different users might be using different code pages) and
> > compare the resulting list to the list in the UTF8 file.
> > Does that make sense? 
>
> Yes, that is much clearer. I'll assume that you have
> Windows and maybe Cygwin.
>
>
> Have you read perllocale, perluniintro, perlunicode,
> perlebcdic?
Yes, I have, and while I consider myself slightly more
intelligent than a garden gnome, I must admit that these
issues concerning character encoding are beyond my abilities
of comprehension (at least at present).
> Use the command:
>
> for /f "tokens=4" %w in ('chcp') do dir >text.%w
>
> to create a file called "text.437" (if your chcp is 437)
> with the dir-output for the current directory.
I assume this is a demonstration, rather than part of a
solution? Or are you saying I'll have to write a temporary
file in this way to solve my problem?
> Under cygwin, you can use the command:
>
> iconv -f CP437 -t UTF-8 text.437 > text.utf8
>
> to convert the file from cp437 to utf8.
I don't have iconv.
> But that second step can also be done with Perl.
>
> (Almost) platform-independent way to see all available
> encodings:
>
> perl -MEncode -e "print join $/, Encode->encodings(':all')" |more
OK, this, and Mr King's reply tell me that Encode is capable
of doing this. I need 'cp437', 'cp850' and 'cp852'
(depending on which machine I'm using). For the rest of this
post I'll assume that I'll be using 'cp437'.
> Now it is your turn to create some code and try to make it
> work.
Here's the script (stripped for the purposes of this post)
*before* tackling the encoding issues:
----------
#!/usr/bin/perl
use warnings;
use strict;
opendir(DIR, '.') or die "Can't open input directory: $!";
my %files = map { $_ => 1 } grep { $_ !~ m/^\.\.?$/ } readdir(DIR);
while (<DATA>) {
chomp;
if ( exists $files{$_} ) {
print "$_ matches.\n";
}
else {
print "$_ doesn't match.\n";
}
}
__DATA__
orde Bala-evic
----------
A file named "orde Bala-evic" *does* exist in the CWD, yet
when I run this script I get:
Đor?e Bala-evi? doesn't match.
So I tried the following fix:
----------
while (<DATA>) {
chomp;
my $key = decode('cp437', $_);
if ( exists $files{$key} ) {
print "$_ matches.\n";
}
else {
print "$_ doesn't match.\n";
}
}
----------
But this gives the same exact result. What am I doing wrong?
--
Best regards,
Angela Druss