Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   Problem handling a Unicode file (http://www.velocityreviews.com/forums/t899658-problem-handling-a-unicode-file.html)

MoshiachNow 08-28-2006 06:34 AM

Problem handling a Unicode file
 
HI,

Got a file that when opened with a Notepad looks like (a sample line) :

[HKEY_LOCAL_MACHINE\

I know it's some type of Unicode (can not figure which one),since when
I print lines in Perl - get the following:

[ H K E Y _ L O C A L _ M A C H I N E \

I basicaly need to replace some strings inside the file,so I need to
decode it from Unicode,and eventually save it in unicode.
Have tried the following:

1.open(FILE, ":utf8", "Araxi.reg"); #no good,stll prints spaces
between charachters

2.my $STRING = decode("EBCDIC", $_); #no good,stll prints spaces
between charachters

All this did not get me far.
How do I achieve the above goals (after establishing the exact unicode
format) ?
Thanks


Brian McCauley 08-28-2006 07:37 AM

Re: Problem handling a Unicode file
 

MoshiachNow wrote:
> HI,
>
> Got a file that when opened with a Notepad looks like (a sample line) :
>
> [HKEY_LOCAL_MACHINE\
>
> I know it's some type of Unicode (can not figure which one),since when
> I print lines in Perl - get the following:
>
> [ H K E Y _ L O C A L _ M A C H I N E \
>
> I basicaly need to replace some strings inside the file,so I need to
> decode it from Unicode,and eventually save it in unicode.
> Have tried the following:
>
> 1.open(FILE, ":utf8", "Araxi.reg"); #no good,stll prints spaces
> between charachters


When Microsoft adopted Unicode it had not yet become clear that utf8
was the "usual" encoding and they went for utf16le as their default
encoding.

open(FILE, "<:encoding(utf16le)", "Araxi.reg") or die $!;

Actaully you can leave out the 'le' as the BOM will tell Perl the
byte-order.

IIRC Windows puts a BOM on utf8 files too so it is in principle
possible to open a file that could be latin1, utf8, utf16be or utf16le
and infer the encoding.

AFAIF there's no simple encoding() in Perl to do this as BOMed utf8
post-dates the initial implementation of Unicode in Perl.


MoshiachNow 08-28-2006 10:11 AM

Re: Problem handling a Unicode file
 
Thanks,

did just that.Reads the file nicely.
Then I want to reolace strings in the file and write it back in utf16
to Araxi2.reg.
I use the code below,but the file does not look good in Notepad
anymore,meaning the format is not exactly utf16 ...

open (FILE,'<:encoding(utf16)',"Araxi.reg") || die "Could not open
Araxi.reg: $!"; #Read UNICODE FILE TO ASCII
open (FILE1,">Araxi1.reg") || die "Could not open Araxi1.reg: $!";
while (<FILE>) {
print FILE1;
}
close FILE;
close FILE1;

open (FILE1,"Araxi1.reg") || die "Could not open Araxi1.reg: $!";
#get old server name
while (<FILE1>) {
chomp;
if (/Host/) {
($OLDNAME) = m/"Host"="(\w*-\w*)"/;
#print "OLDNAME=$OLDNAME\n";
$OLDNAME_SMALL = lc $OLDNAME;
#print "OLDNAME_SMALL=$OLDNAME_SMALL\n";
last;
}
}
close FILE1;

open (FILE,'>:encoding(utf16)',"Araxi2.reg") || die "Could not open
Araxi2.reg: $!"; #CONVERT A UNICODE FILE TO ASCII
open (FILE1,"Araxi1.reg") || die "Could not open Araxi1.reg: $!";
while (<FILE1>) {
s/$OLDNAME/$computer/; #replace capitals
s/$OLDNAME_SMALL/$computer_small/; #replace small letters
names
print FILE "$_";
}


Dr.Ruud 08-28-2006 03:40 PM

Re: Problem handling a Unicode file
 
MoshiachNow schreef:

> I use the code below,but the file does not look good in Notepad
> anymore,meaning the format is not exactly utf16 ...
>
> open (FILE,'<:encoding(utf16)',"Araxi.reg") || die "Could not open
> Araxi.reg: $!"; #Read UNICODE FILE TO ASCII
> open (FILE1,">Araxi1.reg") || die "Could not open Araxi1.reg: $!";


You need to use the utf16le layer for the output to.

#!/usr/bin/perl
use warnings ;
use strict ;

my $fni = 'Araxi.reg' ;
my $fno = 'Araxi1.reg' ;

open my $fhi, '<:encoding(utf16)', $fni
or die "open '$fni', stopped $!" ;

open my $fho, '>:encoding(utf16)', $fno
or die "open '$fno', stopped $!" ;


--
Affijn, Ruud

"Gewoon is een tijger."



Brian McCauley 08-28-2006 05:08 PM

Re: Problem handling a Unicode file
 

MoshiachNow wrote:

> open (FILE,'<:encoding(utf16)',"Araxi.reg") || die "Could not open
> Araxi.reg: $!"; #Read UNICODE FILE TO ASCII


That comment is highly missleading. It should say "Read utf16 file into
Unicode".

The file is in utf16. The strings that are read from it are in Unicode.
Actually Perl will internally represent the stings in utf8, but
conceptually they are just Unicode. One thing they certainly are not is
ASCII. Of course if the data happens to contain no characters beyond
0x7F then the internal represtation of the Unicode string will be
identical to the equivalent ASCII string.


Peter J. Holzer 08-28-2006 06:29 PM

Re: Problem handling a Unicode file
 
On 2006-08-28 10:11, MoshiachNow <lev.weissman@creo.com> wrote:
> Thanks,
>
> did just that.Reads the file nicely.
> Then I want to reolace strings in the file and write it back in utf16
> to Araxi2.reg.
> I use the code below,but the file does not look good in Notepad
> anymore,meaning the format is not exactly utf16 ...


Notepad needs the BOM at the beginning of the file to recognize it
is UTF16, so you have to write that:

> open (FILE,'>:encoding(utf16)',"Araxi2.reg") || die "Could not open

print FILE "\x{FEFF}";

or, if you prefer symbolic names:

use charnames ':short';
....
print FILE "\N{BOM}";


hp


--
_ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
|_|_) | Sysadmin WSR | > ist?
| | | hjp@hjp.at | Was sonst wäre der Sinn des Erfindens?
__/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd

Dr.Ruud 08-28-2006 08:28 PM

Re: Problem handling a Unicode file
 
Peter J. Holzer schreef:

> Notepad needs the BOM at the beginning of the file to recognize it
> is UTF16, so you have to write that:


With "encoding(UTF-16)", the IO-layer takes care of that. But then you
leave it up to Perl (Encode::PerlIO?) to choose between UTF-16LE and
UTF-16BE. See also perldoc Encode::Unicode.


At opening, the file is 0 bytes, but after printing a single space, it
becomes 4 bytes, with the first two holding the BOM:

#!/usr/bin/perl
use warnings ;
use strict ;

my $fni = 'Araxi.reg' ;
my $fno = 'Araxi1.reg' ;

open my $fhi, '<:encoding(UTF-16)', $fni
or die "open '$fni', stopped $!" ;

open my $fho, '>:encoding(UTF-16)', $fno
or die "open '$fno', stopped $!" ;

print $fho ' ' ;
__END__

--
Affijn, Ruud

"Gewoon is een tijger."



MoshiachNow 08-29-2006 06:29 AM

Re: Problem handling a Unicode file
 
Thanks,

I do exactly as advised above,but checking the output in bynary
dipslay,I see that all bytes are interchanged within the words - see
below.
Have tried also "utf16-LE",this did not help.

Good utf16 input file:
FF FE 57 00 69 00 6E 00

Bad output file:
FE FF 00 57 00 69 00 6E

(Indeed,the print FILE "\x{FEFF}"; statement does not look like is
required,since it's been taken care of internally by Perl.)

So what can be still wrong ?


Dr.Ruud 08-29-2006 11:40 AM

Re: Problem handling a Unicode file
 
MoshiachNow schreef:

> all bytes are interchanged within the words


That is the UTF16-LE order, so it would have been wrong if you would
have seen something else. Do you understand the role of the BOM (Byte
Order Mark) now?
http://en.wikipedia.org/wiki/Byte_Order_Mark

Create a fresh file in Notepad with just the word "test" in it, and do a
File/Save As..., with Encoding "Unicode", and you'll see that Windows
defaults to UTF16-LE.

You'll also find an Encoding "Unicode big-endian" there, that is
UTF16-BE. But why would you want the bytes in a different order than the
default for the platform?

--
Affijn, Ruud

"Gewoon is een tijger."



MoshiachNow 08-29-2006 12:25 PM

Re: Problem handling a Unicode file
 
HI,

I do run exactly this :
open my $fhi, '<:encoding(UTF-16)', $fni
or die "open '$fni', stopped $!" ;


open my $fho, '>:encoding(UTF-16)', $fno
or die "open '$fno', stopped $!" ;

and expect input and output files to be in the same order,but they are
not.

I DID try adding the following line,it did not help:

print $fho "\x{FEFF}";



All times are GMT. The time now is 02:00 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.