Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Problem handling a Unicode file

Reply
Thread Tools

Problem handling a Unicode file

 
 
Dr.Ruud
Guest
Posts: n/a
 
      08-29-2006
MoshiachNow schreef:

> I do run exactly this :
> open my $fhi, '<:encoding(UTF-16)', $fni
> or die "open '$fni', stopped $!" ;
>
>
> open my $fho, '>:encoding(UTF-16)', $fno
> or die "open '$fno', stopped $!" ;
>
> and expect input and output files to be in the same order


Why do you expect that? At input, the BOM rules. At output, the platform
rules.

--
Affijn, Ruud

"Gewoon is een tijger."


 
Reply With Quote
 
 
 
 
MoshiachNow
Guest
Posts: n/a
 
      08-29-2006
Thanks a lot.
Read the article,got the ide of the BOM.

The only thing that got me a valid output file was:

open (FILE, ">:raw:encoding(UTF16-LE)", "Araxi.reg") || die "Could not
open Araxi.reg: $!";
print FILE "\x{FEFF}";

Any other sequence will not work well.

Thanks !

 
Reply With Quote
 
 
 
 
Dr.Ruud
Guest
Posts: n/a
 
      08-29-2006
MoshiachNow schreef:

> The only thing that got me a valid output file was:
>
> open (FILE, ">:raw:encoding(UTF16-LE)", "Araxi.reg") || die "Could not
> open Araxi.reg: $!";
> print FILE "\x{FEFF}";


That lay-out really hurts my eyes. Next time, quote something of the
article that you reply on, or provide a "> [short summary]".

#!/usr/bin/perl
use warnings ;
use strict ;
use charnames ':short' ;

my ($fni, $ei) = ('Araxi.reg' , ':encoding(utf16)') ;
my ($fno, $eo) = ('Araxi1.reg', ':raw:encoding(utf16le)') ;

open my $fhi, "<$ei", $fni or die "open '$fni': $!" ;
open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
print $fho "\N{BOM}" ;

print $fho "test\n" ;

# ... etc.


Your ":raw" is a good solution.
I tried "binmode $fho" instead, but got a "Wide character print"
warning. So I put a "use utf8" near the top, but then the BOM was output
as utf8, it looks like $fho's IO-layer was ignored. A "binmode $fho,
':encoding(utf16le)'" might work too, but I am converted to ":raw" now,
thanks.

--
Affijn, Ruud

"Gewoon is een tijger."


 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      08-29-2006
On 2006-08-29 11:40, Dr.Ruud <(E-Mail Removed)> wrote:
> MoshiachNow schreef:
>> all bytes are interchanged within the words

>
> That

^^^^
Could you quote what you mean by "that"? It makes the your posting a
bit hard to understand.

> is the UTF16-LE order,


Nope. The sequence MoshiachNow called "bad" is UTF16-BE.

[...]
> You'll also find an Encoding "Unicode big-endian" there, that is
> UTF16-BE. But why would you want the bytes in a different order than
> the default for the platform?


He doesn't. He wants UTF16-LE (what he labeled "good input file") but
gets UTF16-BE instead.

hp


--
_ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
|_|_) | Sysadmin WSR | > ist?
| | | http://www.velocityreviews.com/forums/(E-Mail Removed) | Was sonst wäre der Sinn des Erfindens?
__/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd
 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      08-30-2006
Peter J. Holzer schreef:
> Dr.Ruud:
>> MoshiachNow:


>>> all bytes are interchanged within the words

>>
>> That

> ^^^^
> Could you quote what you mean by "that"? It makes the your posting a
> bit hard to understand.
>
>> is the UTF16-LE order,

>
> Nope. The sequence MoshiachNow called "bad" is UTF16-BE.


Sorry for the confusion. My "That" was only the quoted phrase itself
(and not the meaning that it had in the original posting), to express
that the interchanged bytes from C<print "\x{FEFF}"> to (binary display)
"FF FE" was the thing to go for.



> [...]
>> You'll also find an Encoding "Unicode big-endian" there, that is
>> UTF16-BE. But why would you want the bytes in a different order than
>> the default for the platform?

>
> He doesn't. He wants UTF16-LE (what he labeled "good input file") but
> gets UTF16-BE instead.


Yes, I mixed up there, I think because I couldn't understand why he
didn't just go for ':encoding(UTF16)'.


Sidenote:

#!/usr/bin/perl
# Script-ID: utf16.pl
use warnings ;
use strict ;

my ($fno, $eo) = ('utf16.txt', ':encoding(UTF16)') ;
open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
print $fho "\n" ;
__END__

results in a 5 byte file (Windows, Perl 5.8.:
FE FF 00 0D 0A

Anyone knows a good reason for why that doesn't result in:
FE FF 00 0D 00 0A
?
(I understand how it happens, but the "why" escapes me.)

With
':raw:encoding(UTF16)'
and
print $fho "\r\n"
one can produce the "right" output of course.

--
Affijn, Ruud

"Gewoon is een tijger."


 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      08-30-2006
On 2006-08-30 00:23, Dr.Ruud <(E-Mail Removed)> wrote:
> Sidenote:
>
> #!/usr/bin/perl
> # Script-ID: utf16.pl
> use warnings ;
> use strict ;
>
> my ($fno, $eo) = ('utf16.txt', ':encoding(UTF16)') ;
> open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
> print $fho "\n" ;
> __END__
>
> results in a 5 byte file (Windows, Perl 5.8.:
> FE FF 00 0D 0A
>
> Anyone knows a good reason for why that doesn't result in:
> FE FF 00 0D 00 0A
> ?
> (I understand how it happens, but the "why" escapes me.)


I think the "why" is a simple bug.

> With
> ':raw:encoding(UTF16)'
> and
> print $fho "\r\n"
> one can produce the "right" output of course.


It looks like the :crlf layer is applied in the wrong place (after
:encoding(UTF16) instead of before).

my ($fno, $eo) = ('utf16.txt', 'encoding(UTF-16):crlf') ;
open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
print $fho "\n" ;

also produces the right result (for Windows) on Linux, so I guess

my ($fno, $eo) = ('utf16.txt', ':raw:encoding(UTF-16):crlf') ;
open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
print $fho "\n" ;

should work on Windows (don't have a Windows machine at hand to test
it).

hp


--
_ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
|_|_) | Sysadmin WSR | > ist?
| | | (E-Mail Removed) | Was sonst wäre der Sinn des Erfindens?
__/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd
 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      08-30-2006
Peter J. Holzer schreef:
> Dr.Ruud:


>> Sidenote:
>>
>> #!/usr/bin/perl
>> # Script-ID: utf16.pl
>> use warnings ;
>> use strict ;
>>
>> my ($fno, $eo) = ('utf16.txt', ':encoding(UTF16)') ;
>> open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
>> print $fho "\n" ;
>> __END__
>>
>> results in a 5 byte file (Windows, Perl 5.8.:
>> FE FF 00 0D 0A
>>
>> Anyone knows a good reason for why that doesn't result in:
>> FE FF 00 0D 00 0A
>> ?
>> (I understand how it happens, but the "why" escapes me.)

>
> I think the "why" is a simple bug.


Yes, I'll report it. (ticket #40255)


> I guess
>
> my ($fno, $eo) = ('utf16.txt', ':raw:encoding(UTF-16):crlf') ;
> open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
> print $fho "\n" ;
>
> should work on Windows (don't have a Windows machine at hand to test
> it).


Yes, that writes the "platform-proper" 6 bytes.

--
Affijn, Ruud

"Gewoon is een tijger."


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Convert unicode escape sequences to unicode in a file Jeremy Python 0 01-11-2011 11:39 PM
Convert unicode escape sequences to unicode in a file Jeremy Python 1 01-11-2011 10:36 PM
Error in Handling Unicode(UTF16-LE) File & String iaminsik Perl Misc 4 05-08-2008 03:38 AM
Unicode string handling problem Richard Schulman Python 8 09-07-2006 10:37 PM
Unicode string handling problem (revised) Richard Schulman Python 1 09-06-2006 01:46 AM



Advertisments