Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Assistance parsing text file using Text::CSV_XS

Reply
Thread Tools

Assistance parsing text file using Text::CSV_XS

 
 
Domenico Discepola
Guest
Posts: n/a
 
      09-01-2004
Hello. I'm trying to parse a text file into a 2-d array using Text::CSV_XS.
The input file is structured as follows. "Fields" are separated with a
"\x0d\x0a" (CRLF) and are enclosed in double-quotes. "Records" are
separated with a "\x0c" (FF). My fields can contain embedded CRLF's hence
the need for double-quoting. How can I use Text::CSV_XS to solve my
problem? My code below only outputs the first line in the input file.
Thanks in advance.


#!perl
use strict;
use warnings;
use diagnostics;
use Text::CSV_XS;

our $g_file_input = shift @ARGV;
die "Usage: $0 filename\n" unless $g_file_input;

######
my ( @arr01 );

#Record seperator - I tried using this and commenting this out
# local $/ = "\x0c";

my $csv = Text::CSV_XS->new( {'sep_char' => "\x0d\x0a", 'binary' => 1,
'always_quote' => 1 } );

open(TFILE, "< ${g_file_input}") || die "$!";
while (<TFILE>) {

my $line = $_;
my $status = $csv->parse($line) || print "Cannot parse\n";
my @arr_temp = $csv->fields();
push ( @arr01, [@arr_temp]);
print join('|', $_), "\n" for @arr_temp;

#exiting here for debugging only
exit;
}
close (TFILE) || die "$!\n";


 
Reply With Quote
 
 
 
 
Scott W Gifford
Guest
Posts: n/a
 
      09-01-2004
"Domenico Discepola" <> writes:

> Hello. I'm trying to parse a text file into a 2-d array using Text::CSV_XS.
> The input file is structured as follows. "Fields" are separated with a
> "\x0d\x0a" (CRLF) and are enclosed in double-quotes. "Records" are
> separated with a "\x0c" (FF). My fields can contain embedded CRLF's hence
> the need for double-quoting. How can I use Text::CSV_XS to solve my
> problem? My code below only outputs the first line in the input file.
> Thanks in advance.


Text::CSV_XS assumes that it's handed a full record at a time, and
expects you to independently figure out where one record ends and the
next one begins.

So you have three choices.

The easiest is to use Text:SV instead of Text::CSV_XS. This handles
embedded newlines as you'd expect, and in general works quite well.
Unfortunately I've found it's about 6 times slower than Text::CSV_XS.
If you can't afford that kind of slowdown, read on.

The next easiest thing to do is find record boundaries on your own.
In one application I wrote, I found this worked well; the file I had
always had lines ending in a quote followed by a newline, so I just
kept appending lines to a buffer until I found a quote at the end of a
line that wasn't preceded by an escape character, then passed it on to
Text::CSV_XS. This won't work with all data files, so it might not be
for you.

The third option is to take each line, ask Text::CSV_XS to parse it,
and if it fails, append the next line and try again. This should work
with properly formed CSV files, but will behave poorly in the face of
an error; if there's some corruption on the first line, you may not
read anything, since it will keep appending and finding the same
error.

Good luck!

----ScottG.
 
Reply With Quote
 
 
 
 
Tad McClellan
Guest
Posts: n/a
 
      09-01-2004
Domenico Discepola <> wrote:

> I'm trying to parse a text file



We need the data as well as the code if we are to be able
to test the code...


> "Records" are
> separated with a "\x0c" (FF). My fields can contain embedded CRLF's hence
> the need for double-quoting.



> our $g_file_input = shift @ARGV;



You should always prefer lexical (my) variables over package (our)
variables, except when you can't.


And you can, so make that:

my $g_file_input = shift @ARGV;


> #Record seperator - I tried using this and commenting this out
> # local $/ = "\x0c";



If you leave it commented out, then you are reading 1 line at
a time rather than 1 record at a time.

I don't see how it would not be working if uncommented...

.... if I had data to run it against I could try it and see.

But I don't, so I can't. (hint)


> open(TFILE, "< ${g_file_input}") || die "$!";



Why the unnecessary curly braces?


> while (<TFILE>) {
> my $line = $_;



If you want it in $line then put it there rather than putting
it somewhere else only to copy it to where you really want
it to be.

Calling it a "line" when it is not a line is asking for trouble.

while ( my $record = <TFILE> ) { # $record instead of $line


> my @arr_temp = $csv->fields();
> push ( @arr01, [@arr_temp]);



No need to copy all that data, just take a reference directly:

push ( @arr01, \@arr_temp);


--
Tad McClellan SGML consulting
Perl programming
Fort Worth, Texas
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      09-01-2004
Scott W Gifford <> wrote in comp.lang.perl.misc:
> "Domenico Discepola" <> writes:
>
> > Hello. I'm trying to parse a text file into a 2-d array using Text::CSV_XS.
> > The input file is structured as follows. "Fields" are separated with a
> > "\x0d\x0a" (CRLF) and are enclosed in double-quotes. "Records" are
> > separated with a "\x0c" (FF). My fields can contain embedded CRLF's hence
> > the need for double-quoting. How can I use Text::CSV_XS to solve my
> > problem? My code below only outputs the first line in the input file.
> > Thanks in advance.

>
> Text::CSV_XS assumes that it's handed a full record at a time, and
> expects you to independently figure out where one record ends and the
> next one begins.


Well, *record* separation is easily done in this case. Just set

local $/ = "x0c";

and use <>, chomp() and whatever as usual to get one record each time.
If CSV_XS isn't upset by embedded linefeeds as such it can do the hard
part.

OP only mentions embedded record separators, not field separators, so
this should work.

Anno
 
Reply With Quote
 
Brad Baxter
Guest
Posts: n/a
 
      09-01-2004
On Wed, 1 Sep 2004, Anno Siegel wrote:

> Scott W Gifford <> wrote in comp.lang.perl.misc:
> > "Domenico Discepola" <> writes:
> >
> > > Hello. I'm trying to parse a text file into a 2-d array using Text::CSV_XS.
> > > The input file is structured as follows. "Fields" are separated with a
> > > "\x0d\x0a" (CRLF) and are enclosed in double-quotes. "Records" are
> > > separated with a "\x0c" (FF). My fields can contain embedded CRLF's hence
> > > the need for double-quoting. How can I use Text::CSV_XS to solve my
> > > problem? My code below only outputs the first line in the input file.
> > > Thanks in advance.

> >
> > Text::CSV_XS assumes that it's handed a full record at a time, and
> > expects you to independently figure out where one record ends and the
> > next one begins.

>
> Well, *record* separation is easily done in this case. Just set
>
> local $/ = "x0c";
>
> and use <>, chomp() and whatever as usual to get one record each time.
> If CSV_XS isn't upset by embedded linefeeds as such it can do the hard
> part.


It isn't upset if you specify 'binary' => 1 in the new() call.


> OP only mentions embedded record separators, not field separators, so
> this should work.


I see a reference to an 'eol' character in CSV_XS, but it's apparently
only for output--not reading.

Regards,

Brad
 
Reply With Quote
 
Domenico Discepola
Guest
Posts: n/a
 
      09-02-2004

> > OP only mentions embedded record separators, not field separators, so
> > this should work.

>
> I see a reference to an 'eol' character in CSV_XS, but it's apparently
> only for output--not reading.
>

Yes, the 'eol' attribute is what confused me into thinking I can use this
module.


 
Reply With Quote
 
Domenico Discepola
Guest
Posts: n/a
 
      09-02-2004
"Tad McClellan" <> wrote in message
news:...
> Domenico Discepola <> wrote:
>
> We need the data as well as the code if we are to be able
> to test the code...
>
> > "Records" are
> > separated with a "\x0c" (FF). My fields can contain embedded CRLF's

hence
> > the need for double-quoting.


> ... if I had data to run it against I could try it and see.
>
> But I don't, so I can't. (hint)


I will reproduce the data here but because there exists embedded binary
characters, I can only "simulate" them:

begin sample data file

"field 1: value1"\n"field 2: value2a\nvalue2b"\n"field 3: value3"\n\x0c
"field 4: value 4"\n"field 5: value5"\n\x0c

end sample data file

This data was exported from a Lotus Notes database using the structured text
format. Note that each "record" can contain different "fields" (as is shown
in the sample data).


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
SAX parsing problem, when element contains text like "[text]" Kai Schlamp Java 1 03-27-2008 08:36 PM
Using Remote Assistance on Wireless =?Utf-8?B?Q2hhcGxhaW4gRG91Zw==?= Wireless Networking 0 08-30-2006 03:59 PM
In file parsing, taking the first few characters of a text file after a readfile or streamreader file read... .Net Sports ASP .Net 11 01-17-2006 12:44 AM
Technical assistance for those using Google Groups Default User C++ 0 01-14-2005 08:36 PM
Need assistance connecting to AD using LDAP Luis Esteban Valencia ASP .Net 0 01-12-2005 04:42 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57