Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Parsing a text file line-by-line: skipping badly-formed lines?

Reply
Thread Tools

Parsing a text file line-by-line: skipping badly-formed lines?

 
 
denis.papathanasiou@gmail.com
Guest
Posts: n/a
 
      05-14-2007
I have a script which reads a plain text (dos) file line-by-line and
splits it into several smaller files, based on a single attribute.

The code (below) works, except when a line is malformed (i.e., the
line contains binary or control characters), and the script just exits
with an error:

open(IN, "$IN_FILE") or die "\n\terror: Could not read $IN_FILE $!
\n"; ;
binmode(IN);
while( $ln=<IN> ) {
if( $ln =~ m/\r\n$/ ) {
$ln =~ s/\r\n$/\n/; # dos2unix: convert CR LF to LF
if( $. > 0 ) { # skip the header line
$sym = substr($ln, 10, 16);
$sym =~ s/ //g;
if( $prior_sym ne $sym ) {
if( $prior_sym ne '' ) { close(OUT); }
$sym_file = $OUT_PATH . "/" . $sym . "." . $OUT_SUFFIX ;
open(OUT, ">$sym_file") or die "\n\terror: Could not write to
$sym_file $!\n";
binmode(OUT);
}
print OUT $ln;
$prior_sym = $sym ;
}
}
}
close(IN);

What I'd like it to do, instead, is if it hits a bad line, write a
warning and keep going to the end of the file.

I've tried wrapping the block above in "eval { }; warn $@ if $@;" but
that doesn't trap the error; even with eval/warn, a bad line will
cause the script to exit.

Is there a better way of doing this?

 
Reply With Quote
 
 
 
 
Greg Bacon
Guest
Posts: n/a
 
      05-14-2007
In article <(E-Mail Removed) .com>,
<(E-Mail Removed)> wrote:

: I've tried wrapping the block above in "eval { }; warn $@ if $@;" but
: that doesn't trap the error; even with eval/warn, a bad line will
: cause the script to exit.

You say your program exits with an error, but you didn't say what
the error is.

What's the error? What version of perl are you using? What's your
operating system?

Your chances of receiving a helpful reply are even better if you can
provide input that causes the problem. Yes, transmitting non-printable
characters on Usenet is a pain, so uuencode the input or write a Perl
program that can recreate it!

Greg
--
When buying and selling are controlled by legislation, the first
things to be bought and sold are legislators.
-- P. J. O'Rourke
 
Reply With Quote
 
 
 
 
denis.papathanasiou@gmail.com
Guest
Posts: n/a
 
      05-14-2007

> You say your program exits with an error, but you didn't say what
> the error is.


My fault, I should have been more precise.

$? actually returns 0 but I know that is incorrect because the output
is not as expected.

The large text file contains data from "A" to "Z", so a successful run
would result in 26 smaller files.

But the output we get stops at "R", so either one of the "R" lines (or
possibly the start of the "S" data) is malformed.

> What's the error? What version of perl are you using? What's your
> operating system?


$ perl -v
This is perl, v5.8.4 built for i386-linux-thread-multi

$ uname -sro
Linux 2.4.27-2-386 GNU/Linux

> Your chances of receiving a helpful reply are even better if you can
> provide input that causes the problem. Yes, transmitting non-printable
> characters on Usenet is a pain, so uuencode the input or write a Perl
> program that can recreate it!


Getting to the exact line with the problem has been surprisingly
difficult: the input file is 14 gb in size, which is too big for the
hex editor we use (shed).

I've also tried split to break up the file into smaller chunks, so I
can load the "R" or "S" chunk into shed and look at the line, but
split suffers the same problem, i.e. it only gets so far through the
original file before it quits, leaving the "S" to "Z" range unsplit.

I'd also thought it might have to do with the $. command (perhaps at
14 gb, it exceeds perl's ability to count that high?), but removing
that logic in my script didn't change the result.

 
Reply With Quote
 
John W. Krahn
Guest
Posts: n/a
 
      05-14-2007
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> I have a script which reads a plain text (dos) file line-by-line and
> splits it into several smaller files, based on a single attribute.
>
> The code (below) works, except when a line is malformed (i.e., the
> line contains binary or control characters), and the script just exits
> with an error:
>
> open(IN, "$IN_FILE") or die "\n\terror: Could not read $IN_FILE $!


perldoc -q quoting

Also, you should get into the habit of using the three argument form of open:

open IN, '<', $IN_FILE or die "\n\terror: Could not read $IN_FILE $!\n";


> \n"; ;
> binmode(IN);


You can also incorporate that into the open statement:

open IN, '<:raw', $IN_FILE or die "\n\terror: Could not read $IN_FILE $!\n";


> while( $ln=<IN> ) {
> if( $ln =~ m/\r\n$/ ) {
> $ln =~ s/\r\n$/\n/; # dos2unix: convert CR LF to LF


You don't need to match the same pattern twice:

if ( $ln =~ s/\r\n$/\n/ ) {

Or more portable and correct:

if ( $ln =~ s/\015\012\z/\n/ ) {


> if( $. > 0 ) { # skip the header line


$. starts out at 1 so it is *always* greater than 0 (unless you explicitly
change it.)


> $sym = substr($ln, 10, 16);
> $sym =~ s/ //g;


Use the three argument open() so you won't have to worry about whitespace in
the file name. However there are other characters that are not valid in a
file name that you should remove such as "\0" and '/'.

$sym =~ tr!\0/!!d


> if( $prior_sym ne $sym ) {
> if( $prior_sym ne '' ) { close(OUT); }
> $sym_file = $OUT_PATH . "/" . $sym . "." . $OUT_SUFFIX ;
> open(OUT, ">$sym_file") or die "\n\terror: Could not write to


open OUT, '>:raw', $sym_file or die "\n\terror: Could not write to
$sym_file $!\n";


> $sym_file $!\n";
> binmode(OUT);
> }
> print OUT $ln;
> $prior_sym = $sym ;
> }
> }
> }
> close(IN);
>
> What I'd like it to do, instead, is if it hits a bad line, write a
> warning and keep going to the end of the file.
>
> I've tried wrapping the block above in "eval { }; warn $@ if $@;" but
> that doesn't trap the error; even with eval/warn, a bad line will
> cause the script to exit.
>
> Is there a better way of doing this?



John
--
Perl isn't a toolbox, but a small machine shop where you can special-order
certain sorts of tools at low cost and in short order. -- Larry Wall
 
Reply With Quote
 
denis.papathanasiou@gmail.com
Guest
Posts: n/a
 
      05-14-2007

> perldoc -q quoting
>
> Also, you should get into the habit of using the three argument form of open:
>
> open IN, '<', $IN_FILE or die "\n\terror: Could not read $IN_FILE $!\n";
>
> > \n"; ;
> > binmode(IN);

>
> You can also incorporate that into the open statement:
>
> open IN, '<:raw', $IN_FILE or die "\n\terror: Could not read $IN_FILE $!\n";


Thanks for the suggestion; I've been working with an old template, and
since it was functional, I never bothered to make it more idiomatic.

> > while( $ln=<IN> ) {
> > if( $ln =~ m/\r\n$/ ) {
> > $ln =~ s/\r\n$/\n/; # dos2unix: convert CR LF to LF

>
> You don't need to match the same pattern twice:
>
> if ( $ln =~ s/\r\n$/\n/ ) {
>
> Or more portable and correct:
>
> if ( $ln =~ s/\015\012\z/\n/ ) {


I'm guilty of some spaghetti there: the dos2unix line was added later,
and I just stuck it in there w/o thinking about the statement before
it.

> > if( $. > 0 ) { # skip the header line

>
> $. starts out at 1 so it is *always* greater than 0 (unless you explicitly
> change it.)


Really? If I leave that statement out, it winds up processing the
first line, but when it's there, it skips the first line.

> > $sym = substr($ln, 10, 16);
> > $sym =~ s/ //g;

>
> Use the three argument open() so you won't have to worry about whitespace in
> the file name. However there are other characters that are not valid in a
> file name that you should remove such as "\0" and '/'.
>
> $sym =~ tr!\0/!!d
>
> > if( $prior_sym ne $sym ) {
> > if( $prior_sym ne '' ) { close(OUT); }
> > $sym_file = $OUT_PATH . "/" . $sym . "." . $OUT_SUFFIX ;
> > open(OUT, ">$sym_file") or die "\n\terror: Could not write to

>
> open OUT, '>:raw', $sym_file or die "\n\terror: Could not write to
> $sym_file $!\n";


These are all great comments, but they don't help with the original
problem: any thoughts on why the block terminates before processing
every line of the original input file?

 
Reply With Quote
 
Greg Bacon
Guest
Posts: n/a
 
      05-14-2007
In article <(E-Mail Removed). com>,
<(E-Mail Removed)> wrote:

: > You say your program exits with an error, but you didn't say what
: > the error is.
:
: My fault, I should have been more precise.

Yes, precision helps in diagnosing technical problems!

Is your program exiting silently, i.e., with no error message?

You wrote that you expected files named A-Z but R is the last
file created. Looking at your logic, your code skips input lines
that don't have CR NL. Is this your intent? Could the lines with
symbols in S-Z be "hidden" in the sense that they fail the test
in the following line?

if( $ln =~ m/\r\n$/ ) {

Debugging output will help you find the problem input. I'd add
at least two warnings:

while( $ln=<IN> ) {
if( $ln =~ s/\r\n\z/\n/ ) {
if( $. > 1 ) { # skip the header line
# the rest of your code...
}
else {
warn "$0: $IN_FILE:$.: skipping...\n";
}
}

warn "$0: $IN_FILE:$.: exiting...\n";

Hope this helps,
Greg
--
(As far as I can see, it is always a man who makes the [Faustian] agreement.
A woman is more likely to be the contract's benefit than its negotiator.
The assumption is that Old Slewfoot fully controls her. Obviously, the
story is literature.) -- Gary North
 
Reply With Quote
 
Martijn Lievaart
Guest
Posts: n/a
 
      05-14-2007
On Mon, 14 May 2007 12:42:00 -0700, denis.papathanasiou wrote:

> These are all great comments, but they don't help with the original
> problem: any thoughts on why the block terminates before processing
> every line of the original input file?


Maybe go back to the good old ways of debugging, add print statements
that tell what the program is doing. Tee this so you save it to a file as
well for later reference, or ptint to a logfile in the first place.

This will not tell you what is wrong, but may pinpoint the location in
the 14GB file where your program goes wrong.

HTH,
M4

 
Reply With Quote
 
denis.papathanasiou@gmail.com
Guest
Posts: n/a
 
      05-14-2007

> Is your program exiting silently, i.e., with no error message?


Yes, $? is 0

> You wrote that you expected files named A-Z but R is the last
> file created. Looking at your logic, your code skips input lines
> that don't have CR NL. Is this your intent? Could the lines with
> symbols in S-Z be "hidden" in the sense that they fail the test
> in the following line?
>
> if( $ln =~ m/\r\n$/ ) {


Yes, that's the intent, because if a line doesn't end in CR, it is
malformed and cannot be parsed further.

While it's likely that there is at least one line that fits that
description (and hence fails the $ln =~ m/\r\n$/ test), the bulk of
the S-Z data *does* end in CR (I verified this by doing a tail on the
input file).

So those lines, i.e. the S-Z lines which do end in CR should not be
skipped.

> Debugging output will help you find the problem input. I'd add
> at least two warnings:
>
> while( $ln=<IN> ) {
> if( $ln =~ s/\r\n\z/\n/ ) {
> if( $. > 1 ) { # skip the header line
> # the rest of your code...
> }
> else {
> warn "$0: $IN_FILE:$.: skipping...\n";
> }
> }
>
> warn "$0: $IN_FILE:$.: exiting...\n";
>


Thanks, I'll try that.

In the meantime, I also tried doing a head of the first 120761073
lines (split exits after processing 120761072 lines in total, which is
not the full size of the file), and it gave me an interesting error:

$ head -120761073 qte20070430 > xy.1
head: error reading `qte20070430': Input/output error
$ echo $?
1
$ tail -2 xy.1
134950345PRIG 000008192000000028000008197000000003R
PP000000001715724200 C
134950355TRIG 000008192000000052000008197000000014$

So the last line there has the problem (well-formed lines are 90 bytes
long), but my hex editor doesn't show anything unusual after the "4"
character:

offs asc hex dec oct bin
0135: 0 30 048 060 00110000
0136: 0 30 048 060 00110000
0137: 0 30 048 060 00110000
0138: 0 30 048 060 00110000
0139: 0 30 048 060 00110000
0140: 8 38 056 070 00111000
0141: 1 31 049 061 00110001
0142: 9 39 057 071 00111001
0143: 7 37 055 067 00110111
0144: 0 30 048 060 00110000
0145: 0 30 048 060 00110000
0146: 0 30 048 060 00110000
0147: 0 30 048 060 00110000
0148: 0 30 048 060 00110000
0149: 0 30 048 060 00110000
0150: 0 30 048 060 00110000
0151: 1 31 049 061 00110001
0152: 4 34 052 064 00110100

(end)
152/153 (dec)



 
Reply With Quote
 
denis.papathanasiou@gmail.com
Guest
Posts: n/a
 
      05-14-2007
Using the extra warnings gave me this:

$ ./split-file.pl qte20070330
../split-file.pl: qte20070330:120761073: skipping...
134950355TRIG 000008192000000052000008197000000014
$ echo $?
0

Looking at the tail end of the problem line gave me this:

offs asc hex dec oct bin
0119: 0 30 048 060 00110000
0120: 0 30 048 060 00110000
0121: 1 31 049 061 00110001
0122: 4 34 052 064 00110100
0123: 0A 010 012 00001010

The difference between the malformed line is that it contains a single
linefeed character (hex 0a) at the 63rd byte, whereas a normal/well-
formed line is 90 bytes long, ending in carriage return (hex 0d) plus
linefeed (hex 0a).

So it seems that the single linefeed (0a character) fools perl into
thinking that it's come to EOF, terminating the "while( $ln=<IN> )
{ }" loop.

So if that's true, how can I guard against this condition?

 
Reply With Quote
 
Greg Bacon
Guest
Posts: n/a
 
      05-14-2007
In article <(E-Mail Removed) om>,
<(E-Mail Removed)> wrote:

: > You wrote that you expected files named A-Z but R is the last
: > file created. Looking at your logic, your code skips input lines
: > that don't have CR NL. Is this your intent? Could the lines with
: > symbols in S-Z be "hidden" in the sense that they fail the test
: > in the following line?
: >
: > if( $ln =~ m/\r\n$/ ) {
:
: Yes, that's the intent, because if a line doesn't end in CR, it is
: malformed and cannot be parsed further.

Assuming you haven't changed the value of $/ (documented in the
perlvar manpage), $ln contains newline-terminated records, so
control wouldn't reach the above conditional without a newline
at the end.

Note that your regular expression tests for a carriage return
followed by a newline at the end of $ln. Looking at the output
in a followup farther downthread, there's at least one record
that's being ignored because it doesn't have a carriage return.

You report that head(1) is failing with an I/O error. Can anyone
read the entire input? Does the following command succeed?

wc -l qte20070430

Greg
--
"Unsustainable," say economists.
"Bubble," say the sourpusses.
"Buy," say the lumpeninvestoriat.
-- Bill Bonner
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Skipping bytes while reading a binary file? MRAB Python 3 02-05-2009 11:51 PM
skipping line while reading data file Ivan Liu C++ 3 09-05-2006 02:29 PM
Publishing ASP.NET 2.0 website skipping file? dm1608 ASP .Net 0 02-23-2006 06:09 PM
In file parsing, taking the first few characters of a text file after a readfile or streamreader file read... .Net Sports ASP .Net 11 01-17-2006 12:44 AM
Newbie question - trying to get a handle on until text / line skipping Chris Vidal Perl 3 07-18-2003 11:11 AM



Advertisments