Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Splitting a CSV file into 40,000 line chunks

Reply
Thread Tools

Splitting a CSV file into 40,000 line chunks

 
 
Jamey Cribbs
Guest
Posts: n/a
 
      12-01-2006
Jamey Cribbs wrote:
> Thomas Mueller wrote:
>> 2006/11/30, Drew Olson <(E-Mail Removed)>:
>>> I'll give FasterCSV a try when I get home from work and out from behind
>>> this proxy. Here's another question: in some cases I need to sort the
>>> file before splitting it (in this case sorting by the 4th cell in each
>>> row). However, the current file I'm trying to sort and split is around
>>> 76 MB and ruby fails when trying to store the CSV as an array. The code
>>> and output are below. How else can I go about this?

>
> I'm coming to this party really late, so I hope I don't come across as
> shamelessly plugging KirbyBase, but, you might want to try it for this.
>
> If you are simply trying to take a large csv file, sort it by one of
> its fields, and split it up into smaller files that each contain
> 40,000 records, I think it might work.
>
> Here's some code (not tested, could be incorrect) off the top of my head:
>
>
> require 'kirbybase'
>
> db = KirbyBase.new
>
> tbl = db.create_table(:foo, :field1, :String, :field2, :Integer,
> :field3, :String............................
>
> tbl.import_csv(name_of_csv_file)
>
> rec_count = tbl.total_recs
> last_recno_written_out = 0
>
> while rec_count > 0
> recs = tbl.select { |r| r.recno > last_recno_written_out and r.recno
> < last_recno_written_out + 40000 }.sort(:field4)
>
> ........ here is where you put the code to write these 40,000 recs to
> a csv output file .............
>
> last_recno_written_out = recs.last.recno
>
> rec_count = rec_count - 40000
> end


I realized this morning that the solution I posted last night won't work
because you need the whole dataset sorted *before* you start splitting
it up into 40,000 record files. Oops!

Anyway, in an attempt to recover gracefully from my mistake and also to
give me the opportunity to shamelessly plug another one of my libraries,
I present the following proposed solution that is totally untested and
probably full of holes:

require 'mongoose'
db = Mongoose:atabase.new

db.create_table(:foo) do |tbl|
tbl.add_column(:field1, :string)
tbl.add_column(:field2, :string)
tbl.add_column(:field3, :integer)
tbl.add_indexed_column(:field4, :string)
 
Reply With Quote
 
 
 
 
M. Edward (Ed) Borasky
Guest
Posts: n/a
 
      12-01-2006
Paul Lutus wrote:
> Nice, informative post. There are a lot of issues here, primarily the fact
> that the database under discussion is too big to hold in memory, and it is
> also too big to fit into Excel in one chunk, which appears to be its
> destination.
>
> Most people have begin to drift toward suggesting a database approach,
> rather than anything that involves direct manipulation of the database in
> Ruby. Because of the size of the database and because sorting the records
> is one goal, I have to agree.
>

I haven't "begun to drift" -- I'll flat out say, "Use a %^$&%^$( database!"

--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blogspot.com/

If God had meant for carrots to be eaten cooked, He would have given rabbits fire.


 
Reply With Quote
 
 
 
 
Mike Harris
Guest
Posts: n/a
 
      12-01-2006
Louis J Scoras wrote:

> On 11/29/06, James Edward Gray II <(E-Mail Removed)> wrote:
>
>> On Nov 29, 2006, at 12:56 PM, Drew Olson wrote:
>>
>> > Here's another question: in some cases I need to sort the
>> > file before splitting it (in this case sorting by the 4th cell in each
>> > row). However, the current file I'm trying to sort and split is around
>> > 76 MB and ruby fails when trying to store the CSV as an array. The
>> > code
>> > and output are below. How else can I go about this?

>>
>> Hmm, that's a good question.
>>
>> Perhaps you can collect just the key values in an Array and then use
>> those to reorder the lines bit by bit. That's not going to be fast
>> with any library helping you, but I don't have a better idea.
>>
>> James Edward Gray II

>
>
> Indeed. That problem is difficult in general because you need to have
> the whole set of elements in memory before you can begin sorting them.
> As James pointed out, you might be able to use some sort of
> memoization technique to track only the bits relevent to sorting. The
> problem is you'll also need some way to get back to the original
> record.
>
> Depending on how you ending up parsing the records, you might be able
> to store the file position of the start of the record and the record
> length.
>
> Records -> [sort_key, file.pos, record.length]
>
> Then sort those arrays if you can fit them all in memory. Finally,
> you can use the offsets for random access to grab the records and
> stick them into the new files as you've been doing.
>
> Basically, you're looking at a complicated swartzian transformation.
> If it will work depends on how big your records are. If they are
> fairly large, you might be able to pull if off; however, if they're
> small and the problem is only that there are too many records, you'll
> still have a problem.
>
>
> In that case, you might want to just shove them in an RDBMS and let it
> sort it for you.
>
>

Let's say you want to sort by the foo column

Read in all the foo values and sort them
Get every 40,000th value from the list.
Now, upon reading any row, you can determine what page it should go on.
Read the file, get the rows for the first N pages, ignoring the rest of
the rows, where N is a number that won't run you out of memory.
Create the files for those rows
Remove references to the rows you read in.
Repeat with the next N pages until finished.


 
Reply With Quote
 
Drew Olson
Guest
Posts: n/a
 
      12-02-2006
Thanks everyone for ALL the replys. Lots of interesting things to think
about. I'll take a look at using a database approach for this and I'm
looking at FasterCSV now. Also, some very good insight related to
building code from scratch and using libraries.

Another great example of the ruby community at work IMO.

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Splitting a list into even size chunks in python? Peter Otten Python 4 03-28-2013 12:15 AM
Re: Splitting a list into even size chunks in python? Arnaud Delobelle Python 0 03-27-2013 08:19 PM
Splitting a list into even size chunks in python? Norah Jones Python 0 03-27-2013 08:06 AM
Re: Splitting up the definitions of a class into different files (splitting public from private)? Mark C++ 0 07-19-2003 04:24 PM
Re: Splitting up the definitions of a class into different files (splitting public from private)? John Ericson C++ 0 07-19-2003 04:03 PM



Advertisments