Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Problem handling a Unicode file

Reply
Thread Tools

Problem handling a Unicode file

 
 
MoshiachNow
Guest
Posts: n/a
 
      08-28-2006
HI,

Got a file that when opened with a Notepad looks like (a sample line) :

[HKEY_LOCAL_MACHINE\

I know it's some type of Unicode (can not figure which one),since when
I print lines in Perl - get the following:

[ H K E Y _ L O C A L _ M A C H I N E \

I basicaly need to replace some strings inside the file,so I need to
decode it from Unicode,and eventually save it in unicode.
Have tried the following:

1.open(FILE, ":utf8", "Araxi.reg"); #no good,stll prints spaces
between charachters

2.my $STRING = decode("EBCDIC", $_); #no good,stll prints spaces
between charachters

All this did not get me far.
How do I achieve the above goals (after establishing the exact unicode
format) ?
Thanks

 
Reply With Quote
 
 
 
 
Brian McCauley
Guest
Posts: n/a
 
      08-28-2006

MoshiachNow wrote:
> HI,
>
> Got a file that when opened with a Notepad looks like (a sample line) :
>
> [HKEY_LOCAL_MACHINE\
>
> I know it's some type of Unicode (can not figure which one),since when
> I print lines in Perl - get the following:
>
> [ H K E Y _ L O C A L _ M A C H I N E \
>
> I basicaly need to replace some strings inside the file,so I need to
> decode it from Unicode,and eventually save it in unicode.
> Have tried the following:
>
> 1.open(FILE, ":utf8", "Araxi.reg"); #no good,stll prints spaces
> between charachters


When Microsoft adopted Unicode it had not yet become clear that utf8
was the "usual" encoding and they went for utf16le as their default
encoding.

open(FILE, "<:encoding(utf16le)", "Araxi.reg") or die $!;

Actaully you can leave out the 'le' as the BOM will tell Perl the
byte-order.

IIRC Windows puts a BOM on utf8 files too so it is in principle
possible to open a file that could be latin1, utf8, utf16be or utf16le
and infer the encoding.

AFAIF there's no simple encoding() in Perl to do this as BOMed utf8
post-dates the initial implementation of Unicode in Perl.

 
Reply With Quote
 
 
 
 
MoshiachNow
Guest
Posts: n/a
 
      08-28-2006
Thanks,

did just that.Reads the file nicely.
Then I want to reolace strings in the file and write it back in utf16
to Araxi2.reg.
I use the code below,but the file does not look good in Notepad
anymore,meaning the format is not exactly utf16 ...

open (FILE,'<:encoding(utf16)',"Araxi.reg") || die "Could not open
Araxi.reg: $!"; #Read UNICODE FILE TO ASCII
open (FILE1,">Araxi1.reg") || die "Could not open Araxi1.reg: $!";
while (<FILE>) {
print FILE1;
}
close FILE;
close FILE1;

open (FILE1,"Araxi1.reg") || die "Could not open Araxi1.reg: $!";
#get old server name
while (<FILE1>) {
chomp;
if (/Host/) {
($OLDNAME) = m/"Host"="(\w*-\w*)"/;
#print "OLDNAME=$OLDNAME\n";
$OLDNAME_SMALL = lc $OLDNAME;
#print "OLDNAME_SMALL=$OLDNAME_SMALL\n";
last;
}
}
close FILE1;

open (FILE,'>:encoding(utf16)',"Araxi2.reg") || die "Could not open
Araxi2.reg: $!"; #CONVERT A UNICODE FILE TO ASCII
open (FILE1,"Araxi1.reg") || die "Could not open Araxi1.reg: $!";
while (<FILE1>) {
s/$OLDNAME/$computer/; #replace capitals
s/$OLDNAME_SMALL/$computer_small/; #replace small letters
names
print FILE "$_";
}

 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      08-28-2006
MoshiachNow schreef:

> I use the code below,but the file does not look good in Notepad
> anymore,meaning the format is not exactly utf16 ...
>
> open (FILE,'<:encoding(utf16)',"Araxi.reg") || die "Could not open
> Araxi.reg: $!"; #Read UNICODE FILE TO ASCII
> open (FILE1,">Araxi1.reg") || die "Could not open Araxi1.reg: $!";


You need to use the utf16le layer for the output to.

#!/usr/bin/perl
use warnings ;
use strict ;

my $fni = 'Araxi.reg' ;
my $fno = 'Araxi1.reg' ;

open my $fhi, '<:encoding(utf16)', $fni
or die "open '$fni', stopped $!" ;

open my $fho, '>:encoding(utf16)', $fno
or die "open '$fno', stopped $!" ;


--
Affijn, Ruud

"Gewoon is een tijger."


 
Reply With Quote
 
Brian McCauley
Guest
Posts: n/a
 
      08-28-2006

MoshiachNow wrote:

> open (FILE,'<:encoding(utf16)',"Araxi.reg") || die "Could not open
> Araxi.reg: $!"; #Read UNICODE FILE TO ASCII


That comment is highly missleading. It should say "Read utf16 file into
Unicode".

The file is in utf16. The strings that are read from it are in Unicode.
Actually Perl will internally represent the stings in utf8, but
conceptually they are just Unicode. One thing they certainly are not is
ASCII. Of course if the data happens to contain no characters beyond
0x7F then the internal represtation of the Unicode string will be
identical to the equivalent ASCII string.

 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      08-28-2006
On 2006-08-28 10:11, MoshiachNow <(E-Mail Removed)> wrote:
> Thanks,
>
> did just that.Reads the file nicely.
> Then I want to reolace strings in the file and write it back in utf16
> to Araxi2.reg.
> I use the code below,but the file does not look good in Notepad
> anymore,meaning the format is not exactly utf16 ...


Notepad needs the BOM at the beginning of the file to recognize it
is UTF16, so you have to write that:

> open (FILE,'>:encoding(utf16)',"Araxi2.reg") || die "Could not open

print FILE "\x{FEFF}";

or, if you prefer symbolic names:

use charnames ':short';
....
print FILE "\N{BOM}";


hp


--
_ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
|_|_) | Sysadmin WSR | > ist?
| | | http://www.velocityreviews.com/forums/(E-Mail Removed) | Was sonst wäre der Sinn des Erfindens?
__/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd
 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      08-28-2006
Peter J. Holzer schreef:

> Notepad needs the BOM at the beginning of the file to recognize it
> is UTF16, so you have to write that:


With "encoding(UTF-16)", the IO-layer takes care of that. But then you
leave it up to Perl (Encode:erlIO?) to choose between UTF-16LE and
UTF-16BE. See also perldoc Encode::Unicode.


At opening, the file is 0 bytes, but after printing a single space, it
becomes 4 bytes, with the first two holding the BOM:

#!/usr/bin/perl
use warnings ;
use strict ;

my $fni = 'Araxi.reg' ;
my $fno = 'Araxi1.reg' ;

open my $fhi, '<:encoding(UTF-16)', $fni
or die "open '$fni', stopped $!" ;

open my $fho, '>:encoding(UTF-16)', $fno
or die "open '$fno', stopped $!" ;

print $fho ' ' ;
__END__

--
Affijn, Ruud

"Gewoon is een tijger."


 
Reply With Quote
 
MoshiachNow
Guest
Posts: n/a
 
      08-29-2006
Thanks,

I do exactly as advised above,but checking the output in bynary
dipslay,I see that all bytes are interchanged within the words - see
below.
Have tried also "utf16-LE",this did not help.

Good utf16 input file:
FF FE 57 00 69 00 6E 00

Bad output file:
FE FF 00 57 00 69 00 6E

(Indeed,the print FILE "\x{FEFF}"; statement does not look like is
required,since it's been taken care of internally by Perl.)

So what can be still wrong ?

 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      08-29-2006
MoshiachNow schreef:

> all bytes are interchanged within the words


That is the UTF16-LE order, so it would have been wrong if you would
have seen something else. Do you understand the role of the BOM (Byte
Order Mark) now?
http://en.wikipedia.org/wiki/Byte_Order_Mark

Create a fresh file in Notepad with just the word "test" in it, and do a
File/Save As..., with Encoding "Unicode", and you'll see that Windows
defaults to UTF16-LE.

You'll also find an Encoding "Unicode big-endian" there, that is
UTF16-BE. But why would you want the bytes in a different order than the
default for the platform?

--
Affijn, Ruud

"Gewoon is een tijger."


 
Reply With Quote
 
MoshiachNow
Guest
Posts: n/a
 
      08-29-2006
HI,

I do run exactly this :
open my $fhi, '<:encoding(UTF-16)', $fni
or die "open '$fni', stopped $!" ;


open my $fho, '>:encoding(UTF-16)', $fno
or die "open '$fno', stopped $!" ;

and expect input and output files to be in the same order,but they are
not.

I DID try adding the following line,it did not help:

print $fho "\x{FEFF}";

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Convert unicode escape sequences to unicode in a file Jeremy Python 0 01-11-2011 11:39 PM
Convert unicode escape sequences to unicode in a file Jeremy Python 1 01-11-2011 10:36 PM
Error in Handling Unicode(UTF16-LE) File & String iaminsik Perl Misc 4 05-08-2008 03:38 AM
Unicode string handling problem Richard Schulman Python 8 09-07-2006 10:37 PM
Unicode string handling problem (revised) Richard Schulman Python 1 09-06-2006 01:46 AM



Advertisments