Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > end-of-line conventions

Reply
Thread Tools

end-of-line conventions

 
 
kj
Guest
Posts: n/a
 
      08-13-2009



There are three major conventions for the end-of-line marker:
"\n", "\r\n", and "\r".

In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries. There are three situations that interest me in
particular: 1. the splitting into lines that happens when one
iterates over a file using the <> operator; 2. the meaning of the
operation performed by chomp; and 3. the meaning of the $ anchor
in regular expressions.

These three issues are tested by the following simple script:

my $lines = my $matches = 0;
while (<>) {
$lines++;
if (/z$/) {
$matches++;
chomp;
print ">$_<";
}
}

print "$/$matches matches out of $lines lines$/";
__END__

I have three files, unix.txt, dos.txt, and mac.txt, each containing
four lines. Disregarding the end-of-line character(s) these lines
are "foo", "bar", "baz", "frobozz".

The file unix.txt uses "\n" to separate the lines. The output that
I get when I pass it as the argument to the script is this:

% demo.pl unix.txt
>baz<>frobozz<

2 matches out of 4 lines

The file dos.txt uses "\r\n" to separate lines, and the file mac.txt
uses "\r". Here's the output I get when I pass these files to the
script:

% demo.pl dos.txt

0 matches out of 4 lines
% demo.pl mac.txt

0 matches out of 1 lines

How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

(Mucking with the value of $/ I was able to get <> to split the
input stream at the right places, but it had no impact on the result
of the regular expression match.)

TIA!

kynn
 
Reply With Quote
 
 
 
 
kj
Guest
Posts: n/a
 
      08-13-2009
In <(E-Mail Removed)> Tad J McClellan <(E-Mail Removed)> writes:

>kj <(E-Mail Removed)> wrote:
>>
>>
>> Subject: end-of-line conventions



>Have you read the "Newlines" section in


> perldoc perlport


>??



>> There are three major conventions for the end-of-line marker:
>> "\n", "\r\n", and "\r".
>>
>> In a variety of situation, Perl must split strings into "lines",
>> and must therefore follow a particular convention to identify line
>> boundaries.



>perl detects its platform when it is *compiled*.


>That is, perl decides what line ending to use when it is built.



>> The file dos.txt uses "\r\n" to separate lines, and the file mac.txt
>> uses "\r".


>> How can I change the script so that the output for unix.txt, dos.txt,
>> and mac.txt will be the same as the one shown above for unix.txt?



>You can't.



Mind-blowing, to say the least...

Oh, well. Live and lurn. Thanks. And to Ben too.

kynn
 
Reply With Quote
 
 
 
 
Heiko Eißfeldt
Guest
Posts: n/a
 
      08-13-2009
kj wrote:

> There are three major conventions for the end-of-line marker:
> "\n", "\r\n", and "\r".


These notations are not unambigious! See perlport documentation section
newlines for details.

> In a variety of situation, Perl must split strings into "lines",
> and must therefore follow a particular convention to identify line
> boundaries. There are three situations that interest me in
> particular: 1. the splitting into lines that happens when one
> iterates over a file using the <> operator; 2. the meaning of the
> operation performed by chomp; and 3. the meaning of the $ anchor
> in regular expressions.


<> and chomp use the $/ variable for line endings. Since $/ does not
support regular expressions, you cannot use this mechanism for all
types of line endings.

The $ anchor normally is just the end of the string (with or without an
line ending).

> How can I change the script so that the output for unix.txt, dos.txt,
> and mac.txt will be the same as the one shown above for unix.txt?


use strict;
use warnings;

my $lines = my $matches = 0;
{
local $/ = undef;
for (<> =~ m{\G([^\012\015]*) \015?\012?}xmsg) {
$lines++;
if (/z$/) {
$matches++;
print ">$_<";
}
}
}
print "\n$matches matches out of $lines lines\n";
__END__

This uses <> with no line end definition, and iterates with a regular
expression suitable for three types of line endings. The line ending is
not included in $_, so chomp is omitted.

If you need the line endings in $_ use the following lines.
for (<> =~ m{\G([^\012\015]* \015?\012?)}xmsg) {
$lines++;
if (/z\s*$/) {
$matches++;
s{[\015\012][\015\012]?}{}xms; # chomp replacement

Hope that helps, heiko
 
Reply With Quote
 
Steve C
Guest
Posts: n/a
 
      08-13-2009
kj wrote:
> There are three major conventions for the end-of-line marker:
> "\n", "\r\n", and "\r".
>
> In a variety of situation, Perl must split strings into "lines",
> and must therefore follow a particular convention to identify line
> boundaries. There are three situations that interest me in
> particular: 1. the splitting into lines that happens when one
> iterates over a file using the <> operator; 2. the meaning of the
> operation performed by chomp; and 3. the meaning of the $ anchor
> in regular expressions.
>
> These three issues are tested by the following simple script:
>
> my $lines = my $matches = 0;
> while (<>) {
> $lines++;
> if (/z$/) {
> $matches++;
> chomp;
> print ">$_<";
> }
> }
>
> print "$/$matches matches out of $lines lines$/";
> __END__
>
> I have three files, unix.txt, dos.txt, and mac.txt, each containing
> four lines. Disregarding the end-of-line character(s) these lines
> are "foo", "bar", "baz", "frobozz".
>
> The file unix.txt uses "\n" to separate the lines. The output that
> I get when I pass it as the argument to the script is this:
>
> % demo.pl unix.txt
>> baz<>frobozz<

> 2 matches out of 4 lines
>
> The file dos.txt uses "\r\n" to separate lines, and the file mac.txt
> uses "\r". Here's the output I get when I pass these files to the
> script:
>
> % demo.pl dos.txt
>
> 0 matches out of 4 lines
> % demo.pl mac.txt
>
> 0 matches out of 1 lines
>
> How can I change the script so that the output for unix.txt, dos.txt,
> and mac.txt will be the same as the one shown above for unix.txt?
>


Since "\n" eq "\012" on unix, you ought to be able to
do something like this to be the same on all platforms:

my $lines = my $matches = 0;

$/ = "\012";
binmode STDIN;
binmode STDOUT;

while (<>) {
$lines++;
if (/z\012/) {
$matches++;
s/\012//g;
print ">$_<";
}
}

print "$/$matches matches out of $lines lines$/";
__END__
 
Reply With Quote
 
Nathan Keel
Guest
Posts: n/a
 
      08-13-2009
kj wrote:

>
> Mind-blowing, to say the least...
>
> Oh, well. Live and lurn. Thanks. And to Ben too.
>
> kynn


Don't worry, use a real OS (not Windows) and you'll not have to think
about these things, though they are easily dealt with, and you'll have
a lot more benefits as well.
 
Reply With Quote
 
chris
Guest
Posts: n/a
 
      08-14-2009
kj wrote:
> There are three major conventions for the end-of-line marker:
> "\n", "\r\n", and "\r".
>
> In a variety of situation, Perl must split strings into "lines",
> and must therefore follow a particular convention to identify line
> boundaries. There are three situations that interest me in
> particular: 1. the splitting into lines that happens when one
> iterates over a file using the <> operator; 2. the meaning of the
> operation performed by chomp; and 3. the meaning of the $ anchor
> in regular expressions.
>
> These three issues are tested by the following simple script:
>
> my $lines = my $matches = 0;
> while (<>) {
> $lines++;
> if (/z$/) {
> $matches++;
> chomp;
> print ">$_<";
> }
> }
>
> print "$/$matches matches out of $lines lines$/";
> __END__
>
> I have three files, unix.txt, dos.txt, and mac.txt, each containing
> four lines. Disregarding the end-of-line character(s) these lines
> are "foo", "bar", "baz", "frobozz".


If you're on linux (it seems you are) I would pass any files of dubious
origin through 'mac2unix' and 'dos2unix' first to ensure that your perl
will parse them correctly.
 
Reply With Quote
 
Steve C
Guest
Posts: n/a
 
      08-14-2009
Ben Morrow wrote:
> Quoth Steve C <(E-Mail Removed)>:
>> Since "\n" eq "\012" on unix, you ought to be able to
>> do something like this to be the same on all platforms:
>>
>> my $lines = my $matches = 0;
>>
>> $/ = "\012";
>> binmode STDIN;
>> binmode STDOUT;
>>
>> while (<>) {
>> $lines++;
>> if (/z\012/) {
>> $matches++;
>> s/\012//g;
>> print ">$_<";
>> }
>> }
>>
>> print "$/$matches matches out of $lines lines$/";
>> __END__

>
> Did you try it? This completely fails with "\r"-separated files, and
> fails to match any lines with "\r\n"-separated files.
>
> Ben
>


I misread the question.
 
Reply With Quote
 
Jürgen Exner
Guest
Posts: n/a
 
      08-15-2009
kj <(E-Mail Removed)> wrote:
>There are three major conventions for the end-of-line marker:


Yes.

>"\n", "\r\n", and "\r".


No. The end-of-line markers are "\010", "\013\010", and "\013".

"\n" is Perl's short-hand notation for whatever end-of-line marker
combination is used on the current platform, thus it can be any of the
three.

>How can I change the script so that the output for unix.txt, dos.txt,
>and mac.txt will be the same as the one shown above for unix.txt?


If you have to deal with cross-platform files then your best bet is to
explicitely check for each combination individually and not to use the
short-hand "\n".

jue
 
Reply With Quote
 
sln@netherlands.com
Guest
Posts: n/a
 
      08-15-2009
On Sat, 15 Aug 2009 23:39:45 +0100, Ben Morrow <(E-Mail Removed)> wrote:

>
>Quoth Jürgen Exner <(E-Mail Removed)>:
>> kj <(E-Mail Removed)> wrote:
>> >There are three major conventions for the end-of-line marker:

>>
>> Yes.
>>
>> >"\n", "\r\n", and "\r".

>>
>> No. The end-of-line markers are "\010", "\013\010", and "\013".

>
>ITYM \012 and \015 there. \0-escapes are in octal.
>

<snip>
>Ben


He meant 10/13 respectfully.
Lets get this table going just for grins:

lf crlf cr
dec 10 13,10 13
hex 0a 0d,0a 0d
oct 012 015,012 015

But how should binary intended be interpreted if opened for translation?
Even if ascii and invalidness.

The recovery of a applies to all regexp valid regex cannot create a mixed
mode platform with append. Either all is converted OR invalid, or
none is converted.

No 0a0a0d0d0a0a. Naw, invalid. At best, recover what is possible,
rewrite file, right the ship, destroy old. Don't tell anybody about it.
Delete file, exit with success, or reformat hd, send it to deep magnetic
disk recovery for partial recovery, tracks wiped clean.

-sln
 
Reply With Quote
 
Jürgen Exner
Guest
Posts: n/a
 
      08-16-2009
Ben Morrow <(E-Mail Removed)> wrote:
>Quoth Jürgen Exner <(E-Mail Removed)>:
>> No. The end-of-line markers are "\010", "\013\010", and "\013".

>
>ITYM \012 and \015 there. \0-escapes are in octal.


Yes, sorry.

>> "\n" is Perl's short-hand notation for whatever end-of-line marker
>> combination is used on the current platform, thus it can be any of the
>> three.

>
>"\n" can *never* mean "\015\012": on Win32 it means "\012", just as on
>Unix.


But then how come that the file created by this little program

open FOO, ">" , "foo";
print FOO "k\n" x 20;
close FOO;

is 60 bytes long instead of 40 as would to be expected if the 'k' and
the "\n" each were only one byte long?

C:\tmp>dir foo
15-Aug-09 21:13 60 foo

jue
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Web.conf debug attribute creates different control naming conventions .NET 1.1 Roy Assaly ASP .Net 1 04-10-2006 07:59 PM
Me vs. .NET HTML Control naming conventions Josh Wolf ASP .Net 2 03-31-2006 12:37 PM
Namespaces and Naming conventions Floppy Jellopy ASP .Net 4 07-21-2005 01:36 PM
Naming conventions for ASP.NET objects? =B= ASP .Net 4 09-06-2004 09:05 AM
Coding Conventions for C# Andrea Williams ASP .Net 6 03-05-2004 06:45 AM



Advertisments