Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Converting codepages to UTF8

Reply
Thread Tools

Converting codepages to UTF8

 
 
P
Guest
Posts: n/a
 
      03-30-2006


Hello,

Is there a Perl module which implements converting of codepages
(such as you get when running "chcp" in a command prompt) to UTF8?
Something that allows me to specify, for example, codepage 437 and
then converting it to UTF8. I've looked through the documentation for
the module Encode, but it doesn't seem to deal with codepages at all.

Thank you for any information you can provide that will nudge me in
the
right direction.


Best regards,
Angela Druss

 
Reply With Quote
 
 
 
 
Dr.Ruud
Guest
Posts: n/a
 
      03-30-2006
P schreef:

> Is there a Perl module which implements converting of codepages
> (such as you get when running "chcp" in a command prompt) to UTF8?
> Something that allows me to specify, for example, codepage 437 and
> then converting it to UTF8. I've looked through the documentation for
> the module Encode, but it doesn't seem to deal with codepages at all.



chcp is a command to change the parameters of the display.

C:\>chcp /?
Displays or sets the active code page number.

CHCP [nnn]

nnn Specifies a code page number.

Type CHCP without a parameter to display the active code page number.


What do you want to do? If you want to convert a file from one encoding
to another, look for 'iconv'.

--
Affijn, Ruud

"Gewoon is een tijger."

 
Reply With Quote
 
 
 
 
P
Guest
Posts: n/a
 
      03-30-2006
Dr.Ruud wrote:
> P schreef:
>
> > Is there a Perl module which implements converting of
> > codepages (such as you get when running "chcp" in a
> > command prompt) to UTF8? Something that allows me to
> > specify, for example, codepage 437 and then converting
> > it to UTF8. I've looked through the documentation for
> > the module Encode, but it doesn't seem to deal with
> > codepages at all.

>
>
> chcp is a command to change the parameters of the display.
>
> C:\>chcp /? Displays or sets the active code page number.
>
> CHCP [nnn]
>
> nnn Specifies a code page number.
>
> Type CHCP without a parameter to display the active code
> page number.



Yes, if you call chcp without a parameter you can establish
the code page. That information is necessary to know what
I'm converting from.


> What do you want to do? If you want to convert a file from
> one encoding to another, look for 'iconv'.



That's not exactly what I want to do. I have one file, which
is in UTF8, which contains a set of strings. I want to
determine whether any of the strings matches any file name
in a specified directory. Since there can be special
characters in the file names (and in the strings in the UTF8
file), sometimes I'll get false negatives, because a simple
eq on the strings in the UTF8 file and on the file names in
the directory won't match (due to the different encodings).
So I want to normalise the directory listing first (and this
should be dependent on the code page, because different
users might be using different code pages) and compare the
resulting list to the list in the UTF8 file. Does that make
sense?


Thank you for your input.


--
Best regards,
Angela Druss

 
Reply With Quote
 
Donald King
Guest
Posts: n/a
 
      03-30-2006
P wrote:
>
> Hello,
>
> Is there a Perl module which implements converting of codepages
> (such as you get when running "chcp" in a command prompt) to UTF8?
> Something that allows me to specify, for example, codepage 437 and
> then converting it to UTF8. I've looked through the documentation for
> the module Encode, but it doesn't seem to deal with codepages at all.
>
> Thank you for any information you can provide that will nudge me in
> the
> right direction.
>
>
> Best regards,
> Angela Druss
>


The Encode module should do what you want. As far as I know, Encode
supports all the codepages out there. Assuming that $filename has raw
octets in the native codepage, something like:

$unicodefn = decode("cp437", $filename);

.... should do the trick. The resulting string will be in Perl's Unicode
format -- keep in mind that while Perl uses UTF-8 internally, Perl
treats Unicode strings differently from strings of raw UTF-8 octets.

--
Donald King, a.k.a. Chronos Tachyon
http://chronos-tachyon.net/
 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      03-30-2006
P schreef:

> I have one file, which
> is in UTF8, which contains a set of strings. I want to
> determine whether any of the strings matches any file name
> in a specified directory.
>
> Since there can be special
> characters in the file names (and in the strings in the UTF8
> file), sometimes I'll get false negatives, because a simple
> eq on the strings in the UTF8 file and on the file names in
> the directory won't match (due to the different encodings).
>
> So I want to normalise the directory listing first (and this
> should be dependent on the code page, because different
> users might be using different code pages) and compare the
> resulting list to the list in the UTF8 file. Does that make
> sense?


Yes, that is much clearer. I'll assume that you have Windows and maybe
Cygwin.


Have you read perllocale, perluniintro, perlunicode, perlebcdic?


Use the command:

for /f "tokens=4" %w in ('chcp') do dir >text.%w

to create a file called "text.437" (if your chcp is 437)
with the dir-output for the current directory.


Under cygwin, you can use the command:

iconv -f CP437 -t UTF-8 text.437 > text.utf8

to convert the file from cp437 to utf8.


But that second step can also be done with Perl.

(Almost) platform-independent way to see all available encodings:

perl -MEncode -e "print join $/, Encode->encodings(':all')" |more

Now it is your turn to create some code and try to make it work.

--
Affijn, Ruud

"Gewoon is een tijger."

 
Reply With Quote
 
P
Guest
Posts: n/a
 
      03-31-2006
Dr.Ruud wrote:
> P schreef:
>
> > I have one file, which is in UTF8, which contains a set
> > of strings. I want to determine whether any of the
> > strings matches any file name in a specified directory.
> >
> > Since there can be special characters in the file names
> > (and in the strings in the UTF8 file), sometimes I'll
> > get false negatives, because a simple eq on the strings
> > in the UTF8 file and on the file names in the directory
> > won't match (due to the different encodings).
> >
> > So I want to normalise the directory listing first (and
> > this should be dependent on the code page, because
> > different users might be using different code pages) and
> > compare the resulting list to the list in the UTF8 file.
> > Does that make sense?

>
> Yes, that is much clearer. I'll assume that you have
> Windows and maybe Cygwin.
>
>
> Have you read perllocale, perluniintro, perlunicode,
> perlebcdic?


Yes, I have, and while I consider myself slightly more
intelligent than a garden gnome, I must admit that these
issues concerning character encoding are beyond my abilities
of comprehension (at least at present).


> Use the command:
>
> for /f "tokens=4" %w in ('chcp') do dir >text.%w
>
> to create a file called "text.437" (if your chcp is 437)
> with the dir-output for the current directory.



I assume this is a demonstration, rather than part of a
solution? Or are you saying I'll have to write a temporary
file in this way to solve my problem?


> Under cygwin, you can use the command:
>
> iconv -f CP437 -t UTF-8 text.437 > text.utf8
>
> to convert the file from cp437 to utf8.



I don't have iconv.


> But that second step can also be done with Perl.
>
> (Almost) platform-independent way to see all available
> encodings:
>
> perl -MEncode -e "print join $/, Encode->encodings(':all')" |more



OK, this, and Mr King's reply tell me that Encode is capable
of doing this. I need 'cp437', 'cp850' and 'cp852'
(depending on which machine I'm using). For the rest of this
post I'll assume that I'll be using 'cp437'.


> Now it is your turn to create some code and try to make it
> work.



Here's the script (stripped for the purposes of this post)
*before* tackling the encoding issues:

----------
#!/usr/bin/perl
use warnings;
use strict;

opendir(DIR, '.') or die "Can't open input directory: $!";

my %files = map { $_ => 1 } grep { $_ !~ m/^\.\.?$/ } readdir(DIR);

while (<DATA>) {
chomp;

if ( exists $files{$_} ) {
print "$_ matches.\n";
}
else {
print "$_ doesn't match.\n";
}
}

__DATA__
orde Bala-evic
----------


A file named "orde Bala-evic" *does* exist in the CWD, yet
when I run this script I get:

Đor?e Bala-evi? doesn't match.


So I tried the following fix:

----------
while (<DATA>) {
chomp;

my $key = decode('cp437', $_);

if ( exists $files{$key} ) {
print "$_ matches.\n";
}
else {
print "$_ doesn't match.\n";
}
}
----------


But this gives the same exact result. What am I doing wrong?

--
Best regards,
Angela Druss

 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      03-31-2006
P schreef:

> #!/usr/bin/perl
> use warnings;
> use strict;


I think you need this:

use Encode qw(cp437 cp850 cp852);

or maybe

use Encode::Byte;

but see also the remarks about PerlIO in `perldoc Encode`.


> opendir(DIR, '.') or die "Can't open input directory: $!";


Alternative:

opendir my $dir, '.'
or die "Can't open input directory: $!";

> my %files = map { $_ => 1 } grep { $_ !~ m/^\.\.?$/ } readdir(DIR);



Maybe:

my %files = map { $_ => 1 } grep { ! m/\A\.\.?\z/s } readdir $dir;

or:

my %files = map { $_ => 1 } grep -f, readdir $dir;

(untested)


--
Affijn, Ruud

"Gewoon is een tijger."
 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      03-31-2006
P schreef:


my $cp = 'cp437';

> my %files = map { $_ => 1 } grep { $_ !~ m/^\.\.?$/ } readdir(DIR);


map { decode( $cp, $_ ) => 1 } grep ...

--
Affijn, Ruud

"Gewoon is een tijger."
 
Reply With Quote
 
P
Guest
Posts: n/a
 
      03-31-2006
Dr.Ruud wrote:
> P schreef:
>
> > #!/usr/bin/perl
> > use warnings;
> > use strict;

>
> I think you need this:
>
> use Encode qw(cp437 cp850 cp852);



But those are just arguments to the decode() subroutine. They aren't
exported by Encode.pm so that gives errors.


> or maybe
>
> use Encode::Byte;



According to the documentation for Encode::Byte decode()
loads Encode::Byte implicitly.


> but see also the remarks about PerlIO in `perldoc Encode`.



Those remarks only show of a way to do the encoding on-the-fly.
The result is exactly the same, though.


> > opendir(DIR, '.') or die "Can't open input directory: $!";

>
> Alternative:
>
> opendir my $dir, '.'
> or die "Can't open input directory: $!";
>
> > my %files = map { $_ => 1 } grep { $_ !~ m/^\.\.?$/ } readdir(DIR);

>
>
> Maybe:
>
> my %files = map { $_ => 1 } grep { ! m/\A\.\.?\z/s } readdir $dir;
>
> or:
>
> my %files = map { $_ => 1 } grep -f, readdir $dir;



These tips don't address the issue, though.


Thanks anyway.


--
Best regards,
Angela Druss

 
Reply With Quote
 
P
Guest
Posts: n/a
 
      03-31-2006

Dr.Ruud wrote:
> P schreef:
>
>
> my $cp = 'cp437';
>
> > my %files = map { $_ => 1 } grep { $_ !~ m/^\.\.?$/ } readdir(DIR);

>
> map { decode( $cp, $_ ) => 1 } grep ...



This does exactly what my code does, except at a different point in
time.
The result is the same.


Thanks anyway.


--
Best regards,
Angela Druss

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
given char* utf8, how to read unicode line by line, and output utf8 gry C++ 2 03-13-2012 04:32 AM
converting to utf8? whatdoineed2do@yahoo.co.uk C++ 2 07-21-2007 10:22 AM
Converting default encoding for windows to utf8 rg.iitk@gmail.com Java 2 06-20-2005 05:29 PM
codepages and cookies Mark ASP General 5 04-01-2005 06:37 AM
lis of languages (codepages) in ruby on rails marco Ruby 1 02-26-2005 12:28 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57