Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > regexp: \x0a => \x0d\x0a

Reply
Thread Tools

regexp: \x0a => \x0d\x0a

 
 
Sébastien Cottalorda
Guest
Posts: n/a
 
      11-27-2003
Hi,

In a file, I have \x0a characters and I'd like to replace them by the couple
\x0d\x0a

How can I do ?

Note: If I have already \x0d\x0a, I don't want to replace \x0a of course.

Thanks in advance.

Sébastien
--
[ retirer NOSPAM pour répondre directement
remove NOSPAM to reply directly ]
 
Reply With Quote
 
 
 
 
Brian McCauley
Guest
Posts: n/a
 
      11-27-2003
Sébastien Cottalorda <> writes:

> In a file, I have \x0a characters and I'd like to replace them by the couple
> \x0d\x0a
>
> How can I do ?


What happend when you tried the obvious s/// ?

s/\x0a/\x0d\x0a/g;

(If you've not heard of s/// then you need to go back and do some
basic Perl tutorials).

> Note: If I have already \x0d\x0a, I don't want to replace \x0a of course.


You could use a negative look-behind.

s/(?<!\x0d)\x0a/\x0d\x0a/g;

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 
Reply With Quote
 
 
 
 
Ben Morrow
Guest
Posts: n/a
 
      11-27-2003

=?ISO-8859-15?Q?S=E9bastien?= Cottalorda <> wrote:
> In a file, I have \x0a characters and I'd like to replace them by the couple
> \x0d\x0a
>
> How can I do ?
>
> Note: If I have already \x0d\x0a, I don't want to replace \x0a of course.


If you have 5.8, you can use

perl -Mopen=IN,:raw,OUT,:crlf -pi -e1 <file>

You may need 5.8.1 to avoid double "\r\r\n"s... IIRC this was one of
the bugfixes.

Ben

--
. | .
\ / The clueometer is reading zero.
. .
__ <-----@ __
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      11-27-2003
On Thu, 27 Nov 2003, Ben Morrow wrote:

> You may need 5.8.1 to avoid double "\r\r\n"s... IIRC this was one of
> the bugfixes.


Yes, it's mentioned in the perldelta

Apropos of which, I suppose I ought at some point to repeat with
5.8.1 the tests that I had reported for 5.8.0 in
http://www.google.com/groups?selm=Pi...lus005.cern.ch
(message )
and related thread, about apparently broken newlines handling with
utf-16LE

Or could you perhaps throw any light, if you're interested, on what I
was seeing there and the subsequent followup?

I don't see anything clearly mentioned in the perldelta for 5.8.1
about *this* particular issue.

cheers
 
Reply With Quote
 
Malcolm Dew-Jones
Guest
Posts: n/a
 
      11-27-2003
=?ISO-8859-15?Q?S=E9bastien?= Cottalorda () wrote:
: Hi,

: In a file, I have \x0a characters and I'd like to replace them by the couple
: \x0d\x0a

: How can I do ?

: Note: If I have already \x0d\x0a, I don't want to replace \x0a of course.

What would you do with \x0d\x0a\x0a?

In addition to other techniques, you could

s/\x0a\x0a/\x0a/g; # reduce pairs to singles
s/\x0a/\x0a\x0a/g; # expand singles to pairs

 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      11-28-2003
"Alan J. Flavell" <> wrote:
> Apropos of which, I suppose I ought at some point to repeat with
> 5.8.1 the tests that I had reported for 5.8.0 in
> http://www.google.com/groups?selm=Pi....0308170139110.
> 6451%40lxplus005.cern.ch
> (message )
> and related thread, about apparently broken newlines handling with
> utf-16LE
>
> Or could you perhaps throw any light, if you're interested, on what I
> was seeing there and the subsequent followup?


Right... I've some some testing on this, and I would say it's
definitely a bug... Also that it has nothing to do with utf16le,
specifically; rather that it is a problem with the :crlf layer.

Please excuse the rather long post.

All the tests below have exactly the same results with 5.8.0 and
5.8.2. All tests have been run on i686-linux-thread-multi, but as of
5.8 they ought to give the same results on all platforms, given that
all filehandles are explicitly binmode()d. (I could be wrong: if Win32
systems have :crlf pushed by default then it's *definitely* worth
pushing :raw before you do anything else if you're dealing with utf16)

First, input. This is a modified version of your script/test file from
the above post. The output has been line-wrapped for posting.

% od -x utf16
0000000 feff 004e 004f 0054 0045 0053 0020 0046
^^^^ BOM (le)
0000020 004f 0052 4120 0041 0044 0044 0049 0054
^^^^ a char >FF
0000040 0049 004f 004e 0041 004c 0020 0041 00a0
a char >7F <FF ^^^^
0000060 0055 004e 0044 002e 000d 000a 000d 000a
DOSish newlines ^^^^-^^^^
0000100

% cat read
#!/usr/bin/perl

use strict;
use warnings;
use Encode qw/:fallbacks is_utf8 _utf8_on/;
use PerlIO::encoding;

my $bom = "\x{feff}";

# just so we know what's what
$PerlIO::encoding::fallback = FB_PERLQQ;
binmode STDOUT, ":encoding(ascii)";

# the first argument is the list of layers to use
open my $IN, "<$ARGV[0]", "utf16" or die $!;

$\ = "\n"; $, = " ";
$_ = <$IN>;

print "utf8 flag is", is_utf8($_) ? "on" : "off";

# force utf8 flag on if we were given two arguments
$ARGV[1] and _utf8_on($_), print "forcing utf8";

s/^$bom// and print "snipped BOM";
chomp;

# this is a slightly clearer display format
print map {sprintf "%04x", $_} unpack '(U)*', $_;
print;

__END__

% ./read ":encoding(utf16le)"
utf8 flag is on
snipped BOM
004e 004f 0054 0045 0053 0020 0046 004f 0052 4120 0041 0044 0044 0049
0054 0049 004f 004e 0041 004c 0020 0041 00a0 0055 004e 0044 002e 000d
DOSish newline not stripped ^^^^
NOTES FOR\x{4120}ADDITIONAL A\x{00a0}UND.

% ./read ":encoding(utf16le):crlf"
utf8 flag is off
00ef 00bb 00bf 004e 004f 0054 0045 0053 0020 0046 004f 0052
^^^^-^^^^-^^^^ this is \x{feff} in utf8
00e4 0084 00a0 0041 0044 0044 0049 0054 0049 004f 004e 0041
^^^^-^^^^-^^^^ ditto \x{4120}
004c 0020 0041 00c2 00a0 0055 004e 0044 002e
DOSish newline is stripped, however ^^
\x{00ef}\x{00bb}\x{00bf}NOTES FOR\x{00e4}\x{0084}\x{00a0}ADDITIONAL
A\x{00c2}\x{00a0}UND.

% ./read ":encoding(utf16le):crlf" 1
utf8 flag is off
forcing utf8
snipped BOM
004e 004f 0054 0045 0053 0020 0046 004f 0052 4120 0041 0044 0044 0049
0054 0049 004f 004e 0041 004c 0020 0041 00a0 0055 004e 0044 002e
NOTES FOR\x{4120}ADDITIONAL A\x{00a0}UND.

So the problem here is that :crlf fails to set the utf8 flag on the
data when it should. Now, output.

% perl -e'binmode STDOUT, ":encoding(utf16le)";
print "\xa0hello\n\n"' > out
% od -x out
0000000 00a0 0068 0065 006c 006c 006f 000a 000a
0000020

% perl -e'binmode STDOUT, ":crlf:encoding(utf16le)";
print "\xa0hello\n\n"' > out
% od -x out
0000000 00a0 0068 0065 006c 006c 006f 0a0d 0d00
0000020 000a
0000022

This is not actually quite such nonsense as it seems: because 'od -x'
byteswaps everything, the file actually ends '6f 00 0d 0a 00 0d 0a 00',
which is the perfectly reasonable result of treating the binary
UTF16 data as text. So we do the :crlf before the UTF16:

% perl -e'binmode STDOUT, ":encoding(utf16le):crlf";
print "\xa0hello\n\n"' > out
Malformed UTF-8 character (unexpected continuation byte 0xa0, with no
preceding start byte) in null operation.
% od -x out
0000000 0000 0068 0065 006c 006c 006f 000d 000a
0000020 000d 000a
0000024

This last would give the desired result, but seems to have the
converse problem from above: that it is trying to treat as utf8 data
that should be treated as bytes.

Having a look at perlio.c suggests to me (though I can't entirely
follow it) that a :crlf layer always has PERLIO_F_UTF8 off, when in
fact it should check the state of the layer below and set itself
accordingly. Having a think about the issued involved suggests to me
that Microsoft should *really* have taken to opportunity of changing
to utf16 to ditch using \r\n... but there we go.

I would seriously consider not using :crlf at all, but instead writing
a :nl layer that maps any of \n, \r, \r\n to \n on input and any of
\n, \r, \r\n to \r\n on output... seems to me that'd be more use, in
general. I guess it would probably be slower.

Ben

--
Musica Dei donum optimi, trahit homines, trahit deos. |
Musica truces mollit animos, tristesque mentes erigit. |
Musica vel ipsas arbores et horridas movet feras. |
 
Reply With Quote
 
John W. Krahn
Guest
Posts: n/a
 
      11-28-2003
Sébastien Cottalorda wrote:
>
> In a file, I have \x0a characters and I'd like to replace them by the couple
> \x0d\x0a
>
> How can I do ?
>
> Note: If I have already \x0d\x0a, I don't want to replace \x0a of course.


perl -i~ -lpe'BEGIN{$/=$\="\x0d\x0a"}s/(?=\x0a)/\x0d/g' yourfile



John
--
use Perl;
program
fulfillment
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      11-28-2003
On Fri, 28 Nov 2003, Ben Morrow wrote:

> Please excuse the rather long post.


Speaking for myself (and who else is going to do that if I don't?
I'm extremely grateful to have your input on this, as I had been
beginning to think I was doing something seriously wrong with the
layers. Anyway, some technical detail makes a pleasant change from
the interminable arguments from crabby newbies who want to impose
their TOFU-posting and FAQ-ignorant demands around here.

Incidentally I've found that for utf-8 data the "od -t x1" format is
handy, rather than "od -x".

This is only a partial response. I'll be looking at this some more
yet. (Just for interest's sake, actually. I don't actually play
with the Microsoft train sim myself[1], which is what lay behind the
originally posted problem.)

> So the problem here is that :crlf fails to set the utf8 flag on the
> data when it should.


Aha, looks like a key observation...

> This is not actually quite such nonsense as it seems: because 'od -x'
> byteswaps everything,


(that's why I recommend od -t x1 instead...)

> the file actually ends '6f 00 0d 0a 00 0d 0a 00',
> which is the perfectly reasonable result of treating the binary
> UTF16 data as text.


Good point.

> Having a look at perlio.c suggests to me (though I can't entirely
> follow it) that a :crlf layer always has PERLIO_F_UTF8 off, when in
> fact it should check the state of the layer below and set itself
> accordingly.


Sounds right to me. Is one of us expected to call this in as a bug,
or do we have developers lurking who would be willing to take this on?

> Having a think about the issued involved suggests to me
> that Microsoft should *really* have taken to opportunity of changing
> to utf16 to ditch using \r\n... but there we go.


I like that idea, but as you say, it's a bit late for them to do that
now.

> I would seriously consider not using :crlf at all, but instead writing
> a :nl layer that maps any of \n, \r, \r\n to \n on input and any of
> \n, \r, \r\n to \r\n on output... seems to me that'd be more use, in
> general. I guess it would probably be slower.


If it was part of the infrastructure, I doubt that the difference in
speed would be noticeable.

Whenever this topic comes up, there's usually someone who offers
anomalous data and asks what we'd do with it (mixed unix/mac/dos
newlines...), but that's just as much a problem for :crlf as it would
be for your hypothetical :nl, so I don't see it as a show-stopper.

thanks for the observations, anyway. In fact you're clearly ahead of
me. all the best


[1] I will admit to playing with BVE, http://mackoy.cool.ne.jp/
but that's entirely off-topic here!
 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      11-29-2003
"Alan J. Flavell" <> wrote:
> Incidentally I've found that for utf-8 data the "od -t x1" format is
> handy, rather than "od -x".


Yes, I found that too. -x is good for little-endian stuff, though.

> > I would seriously consider not using :crlf at all, but instead writing
> > a :nl layer that maps any of \n, \r, \r\n to \n on input and any of
> > \n, \r, \r\n to \r\n on output... seems to me that'd be more use, in
> > general. I guess it would probably be slower.


http://morrow.me.uk/PerlIO-nline-0.01.tar.gz

> If it was part of the infrastructure, I doubt that the difference in
> speed would be noticeable.


#!/usr/bin/perl

use Benchmark qw/cmpthese/;
use Fcntl qw/:seek/;

my $teststr = "a\cJb\cMc\cM\cJ";
$/ = undef;

print "Writing mixed:\n";

{
open my $CRLF, ">:crlf", "one";
open my $NLINE, ">:nline", "two";

select((select($CRLF ),$|=1)[0]);
select((select($NLINE),$|=1)[0]);

cmpthese -5, { crlf => sub { print $CRLF $teststr },
nline => sub { print $NLINE $teststr }
};
}

print "Writing just \\n:\n";

{
open my $CRLF, ">:crlf", "one";
open my $NLINE, ">:nline", "two";

select((select($CRLF ),$|=1)[0]);
select((select($NLINE),$|=1)[0]);

cmpthese -5, { crlf => sub { print $CRLF "a\n" },
nline => sub { print $NLINE "a\n" }
};
}

{
open my $RAW, ">:raw", "three";
print $RAW $teststr;
}

print "Reading:\n";

{
open my $CRLF, "<:crlf", "three";
open my $NLINE, "<:nline", "three";

cmpthese -5, { crlf => sub { <$CRLF>; seek $CRLF, 0, SEEK_SET },
nline => sub { <$NLINE>; seek $NLINE, 0, SEEK_SET }
};
}

__END__

Writing mixed:
Rate nline crlf
nline 190612/s -- -23%
crlf 247892/s 30% --
Writing just \n:
Rate nline crlf
nline 229302/s -- -9%
crlf 252560/s 10% --
Reading:
Rate crlf nline
crlf 58405/s -- -0%
nline 58519/s 0% --


Hmmm... not that bad, I suppose, 'specially if you don't use the extra
flexibility.

> Whenever this topic comes up, there's usually someone who offers
> anomalous data and asks what we'd do with it (mixed unix/mac/dos
> newlines...), but that's just as much a problem for :crlf as it would
> be for your hypothetical :nl, so I don't see it as a show-stopper.


I'm /pretty/ sure this layer does the Right Thing in all situations.

Ben

--
For the last month, a large number of PSNs in the Arpa[Inter-]net have been
reporting symptoms of congestion ... These reports have been accompanied by an
increasing number of user complaints ... As of June,... the Arpanet contained
47 nodes and 63 links. [ftp://rtfm.mit.edu/pub/arpaprob.txt] *
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      11-30-2003
On Sat, 29 Nov 2003, Ben Morrow wrote:

> "Alan J. Flavell" <> wrote:
> > Incidentally I've found that for utf-8 data the "od -t x1" format is
> > handy, rather than "od -x".

>
> Yes, I found that too. -x is good for little-endian stuff, though.


Agreed.

> > > I would seriously consider not using :crlf at all, but instead writing
> > > a :nl layer that maps any of \n, \r, \r\n to \n on input and any of
> > > \n, \r, \r\n to \r\n on output... seems to me that'd be more use, in
> > > general. I guess it would probably be slower.

>
> http://morrow.me.uk/PerlIO-nline-0.01.tar.gz
>
> > If it was part of the infrastructure, I doubt that the difference in
> > speed would be noticeable.


Thanks for the interesting posting! Just to make my meaning clear, I
meant "I doubt that the difference would be noticeable within the
scope of a realistic application". The benchmarking is interesting,
all the same.

Your approach is clearly more versatile. But the :crlf layer ought to
do what it says on the tin, shouldn't it? - and from the previous
discussion, it rather looks as if it isn't doing. Or else I was using
it wrong, but I tried several interpretations - and all the others
seemed to be even worse.

cheers
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off




Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57