Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > UTF16 and Control M's

Reply
Thread Tools

UTF16 and Control M's

 
 
Eileen
Guest
Posts: n/a
 
      07-02-2003
Hi,

I have a text file with CTRL-M's. It is encoded as UTF16. When I try
to search for a string in this file, nothing is found. If I remove the
control-m's in vi, my search works. However, I cannot get the
control-m's to be removed using Perl. I've tried:

my $file= "myfile.xml";
while (<IN>) {
s/\cM//g;
}

and

my $file= "myfile.xml";
while (<IN>) {
s/\x{0x0D00}//g;
}

and

my $file= "myfile.xml";
while (<IN>) {
s/\^M//g;
}

and

while (<IN>) {
s/\cM//g;
}

all to no avail. I've tried it on Unix perl as well as Windows perl.
Again, I can remove the characters with vi (using s/^V^M//g).

Does anyone have any ideas on what to do? If I convert the file to
UTF8, the substitution and subsequent searches work. However, I have
several hundred files to deal with, and they are all encoded as UTF16.

Thanks,

Eileen
 
Reply With Quote
 
 
 
 
Michael P. Broida
Guest
Posts: n/a
 
      07-02-2003
The Ctrl-M is a "carriage-return" which is \r in Perl.

Mike

Eileen wrote:
>
> Hi,
>
> I have a text file with CTRL-M's. It is encoded as UTF16. When I try
> to search for a string in this file, nothing is found. If I remove the
> control-m's in vi, my search works. However, I cannot get the
> control-m's to be removed using Perl. I've tried:
>
> my $file= "myfile.xml";
> while (<IN>) {
> s/\cM//g;
> }
>
> and
>
> my $file= "myfile.xml";
> while (<IN>) {
> s/\x{0x0D00}//g;
> }
>
> and
>
> my $file= "myfile.xml";
> while (<IN>) {
> s/\^M//g;
> }
>
> and
>
> while (<IN>) {
> s/\cM//g;
> }
>
> all to no avail. I've tried it on Unix perl as well as Windows perl.
> Again, I can remove the characters with vi (using s/^V^M//g).
>
> Does anyone have any ideas on what to do? If I convert the file to
> UTF8, the substitution and subsequent searches work. However, I have
> several hundred files to deal with, and they are all encoded as UTF16.
>
> Thanks,
>
> Eileen

 
Reply With Quote
 
 
 
 
Alan J. Flavell
Guest
Posts: n/a
 
      07-03-2003
On Wed, Jul 2, Michael P. Broida staggered uncertainly out onto Usenet
atop a fullquote:

> The Ctrl-M is a "carriage-return" which is \r in Perl.


Beware of Usenauts bearing TOFU.

 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      07-03-2003
On Thu, Jul 2, Eileen inscribed on the eternal scroll:

> Sorry, I left out the first part of the script.there's the full
> script:
>
> #!/usr/local/bin/perl -w


We also recommend "use strict;" around here. Take advantage of all of
Perl's opportunities for helping you identify mistakes.

> $file = "kono.xml";

^
my

> open (IN, $file) or die "cannot open $file\n";


Don't omit "$!" from the error report: it helps to understand the
reason for the failure.

> I didn't realize you could specify the encoding of a file in Perl.


Another good reason to [check that you're using at least version
5.8.0 and] take a few moments out to read the introduction to the
new support for Unicode. (In earlier Perls you'd need to explicitly
invoke the relevant module to do this stuff).

> the \x{0x0D00} was identified by one of my Unicode editors,and was a
> stab in the dark on my part


But what have you learned from the experience?

- if you are reading text, and have properly defined the encoding,
then internally your characters can be referenced by their unicode
code point values, _not_ by their externally-encoded bit patterns.

- if, on the other hand, you are reading the data as a bunch of bytes
(i.e effectively "as binary") then you'd need to handle the byte-pairs
as byte-pairs, not as unicode characters. This is not to be
recommended in current versions of Perl (unless your data is somehow
defective, and you got to write a fixup routine of some kind).

- the new notation e.g \x{263a} denotes a _wide unicode character_ in
Perl's native unicode representation. That value is the Unicode code
point (in this case the smiley, "U+263a" as the Unicode Consortium's
notation would write it). Don't confuse it with the external coding
representation, which (_if_ you had read utf-16LE coding in binary
format, which I don't recommend) would have been \x3a\x26.

hope this helps

(You'd also be advised to take a read of
http://web.presby.edu/~nnqadmin/nnq/nquote.html )


p.s I have the impression that the regulars around here have nominated
me by default as the character encoding spokesman. I must admit that
I'm sometimes at the edge of my expertise, so I _do_ hope they're
watching closely, and will pounce as necessary if I say something
wrong or explain it badly...
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
utf8 to utf16 jmgeu VHDL 0 03-09-2007 11:39 PM
Regarding UTF16 news.fe.internet.bosch.com C Programming 5 02-12-2006 10:01 AM
UTF16, BOM, and Windows Line endings Fuzzyman Python 4 02-07-2006 09:23 AM
UTF16 codec doesn't round-trip? John Perks and Sarah Mount Python 1 05-28-2005 10:33 PM
convert gb18030 to utf16 Xah Lee Python 2 03-07-2005 01:35 PM



Advertisments