Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Splitting a CSV file into 40,000 line chunks

Reply
Thread Tools

Splitting a CSV file into 40,000 line chunks

 
 
James Edward Gray II
Guest
Posts: n/a
 
      11-29-2006
On Nov 29, 2006, at 11:29 AM, Paul Lutus wrote:

> Will Jessop wrote:
>
>> Paul Lutus wrote:
>>>> CSV.open(infile, 'r') do |row|
>>>
>>> Why are you using CSV for this? You aren't parsing the lines into
>>> fields,
>>> so the fact that they contain CSV content has no bearing on the
>>> present
>>> task. Your goal is to split the input file into groups of lines
>>> delimited
>>> by linefeeds, not fields delimited by commas.
>>>
>>> Why not simply read lines from the input file and write them to a
>>> series
>>> of output files, until the input file is exhausted?

>>
>> Because CSV understands csv data with embedded newlines:

>
> A plain-text CSV file uses linefeeds as record delimiters. A
> program that
> uses "readline" or "gets" splits the records just as a sane CSV parser
> would. And IMHO a CSV file should never, ever have linefeeds
> embedded in
> fields.


Your opinion doesn't make you right on this one. The CSV RFC clearly
defines handling for carriage-returns and linefeeds. They certainly
are allowed in fields. Here is a link to the document, in case you
want to read up:

http://www.ietf.org/rfc/rfc4180.txt

Not to use a CSV parser on this task would be shooting yourself in
the foot. The result using a simple File object would be broken and,
much worse, it might look OK for a while. You just can't be sure you
are never going to split a CSV file that has an embedded linefeed in
it (especially since that's perfectly legal), and when you do you
will be responsible for destroying data. There's just no reason for
that.

I know you're a don't-use-a-library guy and you know I disagree.
This is the reason why. The edge cases will get you every time.

James Edward Gray II


 
Reply With Quote
 
 
 
 
Drew Olson
Guest
Posts: n/a
 
      11-29-2006
James -

I'll give FasterCSV a try when I get home from work and out from behind
this proxy. Here's another question: in some cases I need to sort the
file before splitting it (in this case sorting by the 4th cell in each
row). However, the current file I'm trying to sort and split is around
76 MB and ruby fails when trying to store the CSV as an array. The code
and output are below. How else can I go about this?

Code:

require 'csv'

infile = ""

if ARGV[0] != nil
infile = ARGV[0]
else
puts "Please enter a file to split"
exit
end

puts "loading file"

rows = CSV.read(infile)

puts "file loaded"

rows.sort!{|x,y| x[3] <=> y[3]}

counter = 1
header = ""
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')

rows.each_index do |i|
if(i != 0 && i%40000 == 0)
writer.close
counter+=1
writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
writer << header
end
if (i == 0)
header = row
end
writer << row
end

writer.close()

Output:

loading file
The exception unknown software exception (0xc00000fd) occured in the
application at location.....etc

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
James Edward Gray II
Guest
Posts: n/a
 
      11-29-2006
On Nov 29, 2006, at 12:56 PM, Drew Olson wrote:

> Here's another question: in some cases I need to sort the
> file before splitting it (in this case sorting by the 4th cell in each
> row). However, the current file I'm trying to sort and split is around
> 76 MB and ruby fails when trying to store the CSV as an array. The
> code
> and output are below. How else can I go about this?


Hmm, that's a good question.

Perhaps you can collect just the key values in an Array and then use
those to reorder the lines bit by bit. That's not going to be fast
with any library helping you, but I don't have a better idea.

James Edward Gray II

 
Reply With Quote
 
Reid Thompson
Guest
Posts: n/a
 
      11-29-2006
On Thu, 2006-11-30 at 02:25 +0900, ChrisH wrote:
> Also since he's feeding Excel this is most likely on Windows rather
> than *NIX

not a show stopper..
http://www.research.att.com/sw/tools/uwin/

http://unxutils.sourceforge.net/

 
Reply With Quote
 
Drew Olson
Guest
Posts: n/a
 
      11-29-2006
Reid Thompson wrote:
> On Thu, 2006-11-30 at 02:25 +0900, ChrisH wrote:
>> Also since he's feeding Excel this is most likely on Windows rather
>> than *NIX

> not a show stopper..
> http://www.research.att.com/sw/tools/uwin/
>
> http://unxutils.sourceforge.net/


Reid -

I also have cygwin installed on my box, so if there is a simply way to
do this in unix I'd love to know. My unix is rusty (that's being
generous).

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Reid Thompson
Guest
Posts: n/a
 
      11-29-2006
On Thu, 2006-11-30 at 04:44 +0900, Drew Olson wrote:
> Reid Thompson wrote:
> > On Thu, 2006-11-30 at 02:25 +0900, ChrisH wrote:
> >> Also since he's feeding Excel this is most likely on Windows rather
> >> than *NIX

> > not a show stopper..
> > http://www.research.att.com/sw/tools/uwin/
> >
> > http://unxutils.sourceforge.net/

>
> Reid -
>
> I also have cygwin installed on my box, so if there is a simply way to
> do this in unix I'd love to know. My unix is rusty (that's being
> generous).
>

rthompso@jhereg:~$ cat splitandhead
#!/bin/bash

fname=$1
split -l 4000 -a 6 -d $fname $fname

for file in `ls ${fname}0*`
do
sed -e '1i\Col1 Col2 Col3 Col4\' $file > ${file}.tmp
mv ${file}.tmp $file
done



rthompso@jhereg:~$ head testsed
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
rthompso@jhereg:~$ wc testsed
40140 160560 602100 testsed
rthompso@jhereg:~$ time ./splitandhead testsed

real 0m0.499s
user 0m0.140s
sys 0m0.092s
rthompso@jhereg:~$ head -2 testsed0000*
==> testsed000000 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000001 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000002 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000003 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000004 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000005 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000006 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000007 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000008 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000009 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000010 <==
Col1 Col2 Col3 Col4
this is a test
rthompso@jhereg:~$


 
Reply With Quote
 
Reid Thompson
Guest
Posts: n/a
 
      11-29-2006
On Thu, 2006-11-30 at 04:04 +0900, James Edward Gray II wrote:
> On Nov 29, 2006, at 12:56 PM, Drew Olson wrote:
>
> > Here's another question: in some cases I need to sort the
> > file before splitting it (in this case sorting by the 4th cell in each
> > row). However, the current file I'm trying to sort and split is around
> > 76 MB and ruby fails when trying to store the CSV as an array. The
> > code
> > and output are below. How else can I go about this?

>
> Hmm, that's a good question.
>
> Perhaps you can collect just the key values in an Array and then use
> those to reorder the lines bit by bit. That's not going to be fast
> with any library helping you, but I don't have a better idea.
>
> James Edward Gray II

sort -n +4 out
or
sort +4 out


rthompso@jhereg:~$ head -50 out
this is a test 10202
this is a test 23990
this is a test 11056
this is a test 9606
this is a test 28590
this is a test 18264
this is a test 12902
this is a test 12856
this is a test 27571
this is a test 5495
this is a test 15965
this is a test 22229
this is a test 18865
this is a test 31339
this is a test 21913
this is a test 406
this is a test 8602
this is a test 5329
this is a test 10048
this is a test 6458
this is a test 20069
this is a test 19771
this is a test 21844
this is a test 24719
this is a test 30894
this is a test 25239
this is a test 9900
this is a test 1727
this is a test 12042
this is a test 20832
this is a test 23735
this is a test 28768
this is a test 10283
this is a test 390
this is a test 10480
this is a test 1337
this is a test 2745
this is a test 26398
this is a test 32288
this is a test 3797
this is a test 22251
this is a test 458
this is a test 14679
this is a test 29642
this is a test 19943
this is a test 26342
this is a test 24232
rthompso@jhereg:~$ sort +4 out
this is a test 0
this is a test 10
this is a test 10
this is a test 100
this is a test 100
this is a test 100
this is a test 1000
this is a test 1000
this is a test 10001
this is a test 10001
this is a test 10005
this is a test 10006
this is a test 10006
this is a test 10007
this is a test 10007
this is a test 10008
this is a test 10008
this is a test 10009
this is a test 10009
this is a test 1001
this is a test 10010
this is a test 10011
this is a test 10012
this is a test 10012
this is a test 10012
this is a test 10012
this is a test 10012
this is a test 10013
this is a test 10014
this is a test 10014
this is a test 10015
this is a test 10017
this is a test 10017
this is a test 10017
this is a test 10018
this is a test 10019
this is a test 1002
this is a test 1002
this is a test 10021
this is a test 10026
this is a test 10026
-------------------- OR
rthompso@jhereg:~$ sort -n +4 out
this is a test 0
this is a test 2
this is a test 3
this is a test 3
this is a test 4
this is a test 7
this is a test 10
this is a test 10
this is a test 12
this is a test 13
this is a test 14
this is a test 14
this is a test 15
this is a test 16
this is a test 16
this is a test 20
this is a test 21
this is a test 24
this is a test 25
this is a test 28
this is a test 29
this is a test 29
this is a test 29
this is a test 30
this is a test 30
this is a test 31
this is a test 32
this is a test 32
this is a test 34
this is a test 35
this is a test 35
this is a test 35
this is a test 36
this is a test 36
this is a test 37
this is a test 37
this is a test 38
this is a test 38
this is a test 41
this is a test 41
this is a test 41
this is a test 41
this is a test 41
this is a test 41
this is a test 42
this is a test 42
this is a test 42
this is a test 43
this is a test 45
this is a test 46
this is a test 46
this is a test 47
this is a test 47


 
Reply With Quote
 
Mike Harris
Guest
Posts: n/a
 
      11-29-2006
Drew Olson wrote:

>All -
>
>I've written a script to split a .csv file into smaller .csv files of
>40,000 lines each. The intent here is to break the file down enough so
>that excel does not have issues reading each chunk. My code takes a
>filename from the command line and breaks it down as so:
>
>infile -> xyz.csv
>
>output -> xyz_part_1.csv
> xyz_part_2.csv
> etc...
>
>My code is working but I don't find it very "rubyish". In particular, I
>hate having my index and counter counters and I don't like that I had to
>declare my header variable outside of the loop. Bear in mind here that I
>can not do something like "rows = CSV.open(infile)" because ruby will
>yell and error as the input file is too big (250 mb). Any advice on
>making the code nicer is appreciated. The current code is as follows:
>
>require 'csv'
>
>infile = ARGV[0] if ARGV[0] != nil
>
>counter = 1
>index = 0
>header = ""
>writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
>
>CSV.open(infile, 'r') do |row|
> if(index != 0 && index%40000 == 0)
> writer.close
> counter+=1
> writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
> writer << header
> end
> if (index == 0)
> header = row
> end
> writer << row
> index += 1
>end
>
>writer.close()
>
>
>

I will ignore the CSV issue, not because it isn't important, but simply
because I'm not familar with the csv parser, and this example
sufficiently represents the concept.

For maximum elegance, I would write the code this way. It uses the
helper methods (i.e. not in the stdlib) File#write_fresh, File#to_a and
Enumerable#chunks, all of which I've written at one time or another.
Mentally sub in the appropriate code as desired.

File.to_a('xyz.csv').chunks(40000).each_with_index do |chunk,i|
File.write_fresh("xyz_part_#{i+1}",chunk.join("\n" ))
end

File.to_a returns an array of lines
Enumerable#chunks divides an Enumerable into groups of 40k. A 100k
array would yield 2 40k chunks and a 20k chunk.
File#write_fresh creates the file if it doesn't exist, truncates any
existing file, and writes the 2nd argument to the file.

This version is much prettier than the corresponding version without the
helper methods, but it is also clearer. It is obvious at a glance what
it does. The same can't be said for the version without helper methods.

 
Reply With Quote
 
Mike Harris
Guest
Posts: n/a
 
      11-29-2006
Drew Olson wrote:

>All -
>
>I've written a script to split a .csv file into smaller .csv files of
>40,000 lines each. The intent here is to break the file down enough so
>that excel does not have issues reading each chunk. My code takes a
>filename from the command line and breaks it down as so:
>
>infile -> xyz.csv
>
>output -> xyz_part_1.csv
> xyz_part_2.csv
> etc...
>
>My code is working but I don't find it very "rubyish". In particular, I
>hate having my index and counter counters and I don't like that I had to
>declare my header variable outside of the loop. Bear in mind here that I
>can not do something like "rows = CSV.open(infile)" because ruby will
>yell and error as the input file is too big (250 mb). Any advice on
>making the code nicer is appreciated. The current code is as follows:
>
>require 'csv'
>
>infile = ARGV[0] if ARGV[0] != nil
>
>counter = 1
>index = 0
>header = ""
>writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
>
>CSV.open(infile, 'r') do |row|
> if(index != 0 && index%40000 == 0)
> writer.close
> counter+=1
> writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
> writer << header
> end
> if (index == 0)
> header = row
> end
> writer << row
> index += 1
>end
>
>writer.close()
>
>
>

I will ignore the CSV issue, not because it isn't important, but simply
because I'm not familar with the csv parser, and this example
sufficiently represents the concept.

For maximum elegance, I would write the code this way. It uses the
helper methods (i.e. not in the stdlib) File#write_fresh, File#to_a and
Enumerable#chunks, all of which I've written at one time or another.
Mentally sub in the appropriate code as desired.

File.to_a('xyz.csv').chunks(40000).each_with_index do |chunk,i|
File.write_fresh("xyz_part_#{i+1}",chunk.join("\n" ))
end

File.to_a returns an array of lines
Enumerable#chunks divides an Enumerable into groups of 40k. A 100k
array would yield 2 40k chunks and a 20k chunk.
File#write_fresh creates the file if it doesn't exist, truncates any
existing file, and writes the 2nd argument to the file.

This version is much prettier than the corresponding version without the
helper methods, but it is also clearer. It is obvious at a glance what
it does. The same can't be said for the version without helper methods.

 
Reply With Quote
 
Edwin Fine
Guest
Posts: n/a
 
      11-29-2006
I am perplexed by CSV.open. In IO and File, open returns something that
quacks like an IO object. You can then call gets, puts, read, write and
so on. The CSV open seems to return an array (or gives you a row at a
time). This makes it very hard to write code that can take advantage of
duck typing, e.g.

def get_batch(f, count)
lines = nil
count.times do
break unless line = f.gets
(lines ||= []) << line
end
lines
end

get_batch(File.open("foo"), 40000) # No problem
get_batch(CSV.open("foo", "r"), 40000) # Error

I wonder why it was not patterned more closely on IO? Any thoughts?

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Splitting a list into even size chunks in python? Peter Otten Python 4 03-28-2013 12:15 AM
Re: Splitting a list into even size chunks in python? Arnaud Delobelle Python 0 03-27-2013 08:19 PM
Splitting a list into even size chunks in python? Norah Jones Python 0 03-27-2013 08:06 AM
Re: Splitting up the definitions of a class into different files (splitting public from private)? Mark C++ 0 07-19-2003 04:24 PM
Re: Splitting up the definitions of a class into different files (splitting public from private)? John Ericson C++ 0 07-19-2003 04:03 PM



Advertisments