Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > CSV Parsing algorithms in Java

Reply
Thread Tools

CSV Parsing algorithms in Java

 
 
=?ISO-8859-1?Q?Arne_Vajh=F8j?=
Guest
Posts: n/a
 
      11-04-2006
Simon Brooke wrote:
> Heavens, writing a CSV parser is trivial. It's simply a case of a
> StringTokenizer in a for loop:


Hmmm.

In the real world programmers usually have to deal with
item separators (typical , or inside strings (typical "").
And a convention for string delimiters inside strings.

Arne
 
Reply With Quote
 
 
 
 
Eric Sosman
Guest
Posts: n/a
 
      11-04-2006
Simon Brooke wrote:
>
> Heavens, writing a CSV parser is trivial. It's simply a case of a
> StringTokenizer in a for loop:
> [...]


There is no one official "CSV format," but even the simple
version described at http://www.wotsit.org/ is not parseable by
a mere StringTokenizer (which the JavaDoc calls a "legacy class"
whose use in new code is "discouraged," by the way).

Brooke, 21 Elm Street
// space before '2' should vanish but embedded spaces
// should remain

"Brooke, Simon" , 21 Elm Street
// first comma does not end a field, quotes disappear,
// both spaces surrounding second comma disappear

"Brooke, Simon" , """The Beeches"", Herts"
// doubled quotes become singles, only one of the three
// commas is a field separator, more disappearing and
// retained spaces

"Brooke, Simon" , "21 Elm Street
Apartment 3B"
// embedded newline in second field

Parsing CSV -- even allowing for some variations beyond the
wotsit description -- is not difficult, but not trivial. My own
CSVReader class runs to 376 lines, including JavaDoc. (It could
probably be tightened a bit; I wrote it as an exercise when I was
new to Java and would likely do things differently nowadays.)

--
Eric Sosman
http://www.velocityreviews.com/forums/(E-Mail Removed)lid
 
Reply With Quote
 
 
 
 
Davide Consonni
Guest
Posts: n/a
 
      11-04-2006
Jeffrey Spoon wrote:


> Hello, has anybody seen well-known/good practice CSV parsing algorithms
> in Java? I've been googling about but can't see anything suitable so
> far. I'm not interested in using library functions, rather implementing
> the algorithm myself (or at least learning how to).
>
> Any pointers appreciated, thanks.


use regex, watch this:
http://tinyurl.com/ska4z

--
Davide Consonni <(E-Mail Removed)> http://csvtosql.sourceforge.net
"Avremo un bambino. Sara' il mio regalo di Natale." "Ma io mi sarei
accontentato di una cravatta!" -- Woody Allen, da "Prendi i soldi e scappa"

 
Reply With Quote
 
Chris Uppal
Guest
Posts: n/a
 
      11-05-2006
Jeffrey Spoon wrote:

> Hello, has anybody seen well-known/good practice CSV parsing algorithms
> in Java? I've been googling about but can't see anything suitable so
> far. I'm not interested in using library functions, rather implementing
> the algorithm myself (or at least learning how to).


There is no real specification for CSV. Some places to look for information on
what people /think/ CSV files are like:

http://www.ietf.org/rfc/rfc4180.txt
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
http://www.pobox.com/~qed/bcsv.zip

Note: I'm pretty sure that the rfc's suggested handling of spaces around fields
is wrong -- everbody else seems to think that leading/trailing spaces are
ignored.

-- chris


 
Reply With Quote
 
Chris Uppal
Guest
Posts: n/a
 
      11-05-2006
Simon Brooke wrote:

> for ( String line = buffy.readLine(); line != null;
> line = buffy.readLine)


CSV fields (and hence CSV records) may span more than one line.


> StringTokenizer tok =
> new StringTokenizer( line,
> separatorChars);


Nothing based on naive use of pattern matching can possibly parse CSV since
fields may contain separator tokens. Indeed a field may contain an entire
CSV-format sub-file (and so on recursively).

If /I/ had set this exercise then my (hidden) purpose would have been to filter
out candidates who don't realise that this is a reasonably complex parsing
task, and not solvable with simple minded tools like regexps[*].

The probability (I think) is that the OP's interviewer was someone who would
have failed my test

Mind you, I wouldn't have set this task -- too challenging for the context.
Unless, perhaps, I were interviewing for very senior engineers and I was
expecting them to show that they could think realistically under pressure by
answering "that's too complicated to do here and now".

-- chris

([*] Using regexps is nearly always a sign that the program is broken -- there
are not many tasks for which they are (part of) the correct solution.)




 
Reply With Quote
 
Martin Gregorie
Guest
Posts: n/a
 
      11-05-2006
Stefan Ram wrote:
> Jeffrey Spoon <(E-Mail Removed)> writes:
>> So that's a no then?
>> They did specify that some of the values may contain double quotes.
>> I had two other questions to do as well, in 30 minutes.

>
> Assuming that there are only about 10 minutes to write such a
> parser on paper without any reference, it is difficult, indeed.
>
> Let me try to see, what I can write in 10 minutes without a
> reference
>
> // 2006-11-04T17:48:18+01:00
>
> public class CsvParser
> { private CsvScanner tokenSource;
> public CsvParser( final CsvScanner tokenSource )
> { this.tokenSource = tokenSource; }
>
> // 2006-11-04T17:50:09+01:00
>
> public void parseAll()
> { while( tokenSource.isMoreInSource() )parseLine(); }
>
> // 2006-11-04T17:51:26+01:00
>
> public void parseLine()
> { while( tokenSource.isMoreInLine() )parseValue(); }
>
> // 2006-11-04T17:54:43+01:00
>
> public void parseValue()
> { final Token token = tokenSource.getToken();
> token.to( new TokenProcessor()
> { public void processNumericStart(){ /* todo */ }
> public void processTextStart(){ /* todo */ }
> /* here my time limit was reached */
>
> // 2006-11-04T17:58:31+01:00
>
> Sometimes an interviewer might give you an "impossible"
> task just to see how you cope with that.
>

Clever clogs solution:

- write down the BNF notation for the CSV syntax (about 6 statements)
- say you're going to feed that through a parser generator, e.g. Coco/R


--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |
 
Reply With Quote
 
Jeffrey Spoon
Guest
Posts: n/a
 
      11-05-2006
In message <(E-Mail Removed)>, Simon
Brooke <(E-Mail Removed)> writes

>>
>> Thanks to the others who suggested as well, I'll get around to them.

>
>Heavens, writing a CSV parser is trivial. It's simply a case of a
>StringTokenizer in a for loop:
>


Except I wasn't allowed to use String Tokenizer, as I said in the
original post, "I'm not interested in using library functions".



--
Jeffrey Spoon

 
Reply With Quote
 
Jeffrey Spoon
Guest
Posts: n/a
 
      11-05-2006
In message <(E-Mail Removed)-berlin.de>, Stefan Ram
<(E-Mail Removed)-berlin.de> writes

>
>// 2006-11-04T17:48:18+01:00
>
>public class CsvParser
>{ private CsvScanner tokenSource;
> public CsvParser( final CsvScanner tokenSource )
> { this.tokenSource = tokenSource; }
>
>// 2006-11-04T17:50:09+01:00
>
> public void parseAll()
> { while( tokenSource.isMoreInSource() )parseLine(); }
>
>// 2006-11-04T17:51:26+01:00
>
> public void parseLine()
> { while( tokenSource.isMoreInLine() )parseValue(); }
>
>// 2006-11-04T17:54:43+01:00
>
> public void parseValue()
> { final Token token = tokenSource.getToken();
> token.to( new TokenProcessor()
> { public void processNumericStart(){ /* todo */ }
> public void processTextStart(){ /* todo */ }
> /* here my time limit was reached */
>
>// 2006-11-04T17:58:31+01:00
>
> Sometimes an interviewer might give you an "impossible"
> task just to see how you cope with that.
>


Interesting, thanks. I certainly have to do some reading on parsing in
general anyway.

Cheers all,



--
Jeffrey Spoon

 
Reply With Quote
 
Simon Brooke
Guest
Posts: n/a
 
      11-05-2006
in message <(E-Mail Removed)>, Jeffrey Spoon
('(E-Mail Removed)') wrote:

> In message <(E-Mail Removed)>, Simon
> Brooke <(E-Mail Removed)> writes
>
>>>
>>> Thanks to the others who suggested as well, I'll get around to them.

>>
>>Heavens, writing a CSV parser is trivial. It's simply a case of a
>>StringTokenizer in a for loop:

>
> Except I wasn't allowed to use String Tokenizer, as I said in the
> original post, "I'm not interested in using library functions".


Then write your own; it's a trivial thing to do. Here, in fact, is one I
wrote earlier:

/**
* MIDP does not provide a StringTokenizer. Because this has to be
* compatible with MIDP we'll provide our own. If you have access to a real
* StringTokenizer don't use this one - it is minimal and possibly
* inefficient.
*/
public class StringTokenizer
{
//~ Instance fields -----------------------------------------------

/** the source string, which I tokenize */
private String source = null;

/** the separator character which I split it on */
private char sep = ' ';

/** my current cursor into the strong */
private int cursor = 0;

//~ Constructors --------------------------------------------------

/**
* @param sep the separator which separates tokens in this source
* @param source the source string to separate into tokens
*/
public StringTokenizer( String source, char sep )
{
super( );
this.sep = sep;
this.source = source;
}

//~ Methods -------------------------------------------------------

/**
* @return true if this tokenizer still has more tokens, else false
*/
public boolean hasMoreTokens( )
{
return ( ( source != null ) && ( cursor < source.length( ) ) );
}

/**
* Test harness only - do not use
*
* @param args
*/
public static void main( String[] args )
{
if ( args.length == 2 )
{
StringTokenizer tock =
new StringTokenizer( args[0], args[1].charAt( 0 ) );

System.out.println( "String is: '" + args[0] + "'" );
System.out.println( "Separator is: '" + args[1].charAt( 0 ) + "'" );

for ( int i = 0; tock.hasMoreTokens( ); i++ )
{
System.out.println( Integer.toString( i ) + ": '" +
tock.nextToken( ) + "'" );
}
}
}

/**
* @return the next token from this string tokenizer, or null if there are
* no more.
*/
public synchronized String nextToken( )
{
String result = null;
int end = source.indexOf( sep, cursor );

if ( cursor < source.length( ) )
{
if ( end > -1 )
{
result = source.substring( cursor, end );
cursor = end + 1;
}
else
{
result = source.substring( cursor );
cursor = source.length( );
}
}

return result;
}
}


--
(E-Mail Removed) (Simon Brooke) http://www.jasmine.org.uk/~simon/

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
read and write csv file using csv module jliu66 Python 0 10-19-2007 03:12 PM
How to move data from a CSV file to a JTable, and from a JTable to a CSV file ? Tintin92 Java 1 02-14-2007 06:51 PM
Re: csv writerow creates double spaced excel csv files Skip Montanaro Python 0 02-13-2004 08:50 PM
csv writerow creates double spaced excel csv files Michal Mikolajczyk Python 0 02-13-2004 08:38 PM
Perl expression for parsing CSV (ignoring parsing commas when in double quotes) GIMME Perl 2 02-11-2004 05:40 PM



Advertisments