Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Splitting a CSV file into 40,000 line chunks

Reply
Thread Tools

Splitting a CSV file into 40,000 line chunks

 
 
James Edward Gray II
Guest
Posts: n/a
 
      11-29-2006
On Nov 29, 2006, at 5:25 PM, Edwin Fine wrote:

> I am perplexed by CSV.open.


Me too.

> I wonder why it was not patterned more closely on IO? Any thoughts?


The author does not believe CSV objects should be IO like. If you
search the archives for my messages about including FasterCSV in the
standard library you will run into our debates on this subject.

FasterCSV's open() behaves the way you expect.

James Edward Gray II


 
Reply With Quote
 
 
 
 
M. Edward (Ed) Borasky
Guest
Posts: n/a
 
      11-30-2006
Paul Lutus wrote:
> James Edward Gray II wrote:
>

[snip]

Hmmmm ... seems like a bunch of folks are going to a bunch of trouble
and to baby a tool that can only handle a CSV file with less than 65536
rows. Excel is broken ... Excel sucks at math ... Excel can't handle
lots of rows and lots of columns ... Excel costs money. Excel was a
great wonder in 1986 -- 20 years later it's lame beyond words.

I'm not sure about the column count limitations, but I know almost any
database will deal with a CSV file with more than 65536 rows. There are
some awfully good free databases. Did I mention how bad Excel is at
math? Sometimes you *don't* get what you pay for.


--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blogspot.com/

If God had meant for carrots to be eaten cooked, He would have given rabbits fire.


 
Reply With Quote
 
 
 
 
James Edward Gray II
Guest
Posts: n/a
 
      11-30-2006
On Nov 30, 2006, at 12:36 PM, Paul Lutus wrote:

> Your own code ... er, excuse me, your own library ... will meet your
> requirements exactly, it won't cover cases that are not relevant to
> the
> problem at hand, it will be much faster overall than existing
> solutions,
> and you will learn things about Ruby that you would not if you used
> someone
> else's library.


Now you're guilty of a new sin: encouraging people to reinvent the
wheel. You just can't win, can you?

Different problems require different levels of paranoia. Sometimes a
little code will get you over the hump, but you may be making some
trade-offs when you don't use a robust library. Sometimes those are
even good trade-offs, like sacrificing edge case handling to gain
some speed. Sometimes it's even part of the goal to avoid the
library, like when I built FasterCSV to address some needs CSV wasn't
meeting. As soon as things start getting serious though, *I* usually
feel safer reaching for the library.

The people reading this list have seen us debate the issue now and be
able to make well informed decisions about what they think is best.

> On the other hand, if your data does not exploit this CSV trait (few
> real-world CSV databases embed linefeeds)...


Really? How do they handle data with newlines in it?

Which "CSV databases" are you referring to here?

James Edward Gray II


 
Reply With Quote
 
Louis J Scoras
Guest
Posts: n/a
 
      11-30-2006
On 11/29/06, James Edward Gray II <(E-Mail Removed)> wrote:
> On Nov 29, 2006, at 12:56 PM, Drew Olson wrote:
>
> > Here's another question: in some cases I need to sort the
> > file before splitting it (in this case sorting by the 4th cell in each
> > row). However, the current file I'm trying to sort and split is around
> > 76 MB and ruby fails when trying to store the CSV as an array. The
> > code
> > and output are below. How else can I go about this?

>
> Hmm, that's a good question.
>
> Perhaps you can collect just the key values in an Array and then use
> those to reorder the lines bit by bit. That's not going to be fast
> with any library helping you, but I don't have a better idea.
>
> James Edward Gray II


Indeed. That problem is difficult in general because you need to have
the whole set of elements in memory before you can begin sorting them.
As James pointed out, you might be able to use some sort of
memoization technique to track only the bits relevent to sorting. The
problem is you'll also need some way to get back to the original
record.

Depending on how you ending up parsing the records, you might be able
to store the file position of the start of the record and the record
length.

Records -> [sort_key, file.pos, record.length]

Then sort those arrays if you can fit them all in memory. Finally,
you can use the offsets for random access to grab the records and
stick them into the new files as you've been doing.

Basically, you're looking at a complicated swartzian transformation.
If it will work depends on how big your records are. If they are
fairly large, you might be able to pull if off; however, if they're
small and the problem is only that there are too many records, you'll
still have a problem.


In that case, you might want to just shove them in an RDBMS and let it
sort it for you.


--
Lou.

 
Reply With Quote
 
Thomas Mueller
Guest
Posts: n/a
 
      11-30-2006
2006/11/30, Drew Olson <(E-Mail Removed)>:
> I'll give FasterCSV a try when I get home from work and out from behind
> this proxy. Here's another question: in some cases I need to sort the
> file before splitting it (in this case sorting by the 4th cell in each
> row). However, the current file I'm trying to sort and split is around
> 76 MB and ruby fails when trying to store the CSV as an array. The code
> and output are below. How else can I go about this?


You could import it into MS Access, sort there and export again. Maybe
you could even do all the splitting up in Access directly, but I don't
know too much about Access to help with that.
And I guess using Access would make it harder to automate this whole
task, in case that's required.

That's not very "rubyish", of course

Thomas

 
Reply With Quote
 
James Edward Gray II
Guest
Posts: n/a
 
      12-01-2006
On Nov 30, 2006, at 2:45 PM, Paul Lutus wrote:

> Also, there is a hidden assumption in your position -- that
> libraries, ipso facto, represent robust methods.


> For the newbies, however, it might matter. They might think library
> contents differ from ordinary code.


I sure hope they think that! I know I do.

There's no faster way to find bugs than to bundle up some code and
turn it loose on the world. That leads to more robust code. This is
the reason open source development works so well.

If one of us patches a library, everyone benefits. It's like having
a few hundred extra programmers on your staff.

Yes, I realize I'm over generalizing there. There will always be
poorly supported or weak libraries, but someone just forks or
replaces those eventually.

>>> On the other hand, if your data does not exploit this CSV trait (few
>>> real-world CSV databases embed linefeeds)...

>>
>> Really? How do they handle data with newlines in it?

>
> Linefeeds are escaped as though in a normal quoted string. This is
> how I
> have always dealt with embedded linefeeds, which is why I was
> ignorant of
> the specification's language on this (an explanation, not an excuse).


So a linefeed is \n and then we need to escape the \ so that is \\, I
assume. Interesting.

I would argue that is not CSV, but it's certainly debatable. My
reasoning is that you either need to post process the CSV parsed data
to restore it or use a custom parser that understands CSV plus your
escaping rules.

>> Which "CSV databases" are you referring to here?

>
> MySQL, the database I am most familiar with, uses this method for
> import or
> export of comma- or tab-separated plain-text data. Within MySQL's own
> database protocol, linefeeds really are linefeeds, but an imported or
> exported plain-text table has them escaped within fields.


Wild. I use MySQL everyday. Guess I've never dumped a CSV of
linefeed containing data with it though. (I generally walk the
database myself with a Ruby script and dump with FasterCSV.)

> It just takes longer if all the database
> handling (not just record parsing) must use the same state machine
> that
> field parsing must use.


I don't understand this comment. MySQL does not use CSV internally,
like most databases.

> It's very simple, really. Once you allow the record separator inside a
> field, you give up any chance to parse records quickly.


Have you heard of the FasterCSV library? It's pretty zippy.

> But parsing will necessarily be slow, character by character, the
> entire
> database scan must use an intelligent parser (no splitting records on
> linefeeds as I have been doing), and the state machine needs a few
> extra
> states.


You don't really have to parse CSV character by character. FasterCSV
does most of its parsing with a single highly optimized (to avoid
backtracking) regular expression and a few tricks.

Basically you can read line by line and divide into fields. If you
have an unclosed field at the end of the line, you hit an embedded
linefeed. You then just pull and append the next line and continue
eating fields.

The standard CSV library does not do this and that is one of two big
reasons it is so slow.

James Edward Gray II


 
Reply With Quote
 
M. Edward (Ed) Borasky
Guest
Posts: n/a
 
      12-01-2006
Thomas Mueller wrote:
> 2006/11/30, Drew Olson <(E-Mail Removed)>:
>> I'll give FasterCSV a try when I get home from work and out from behind
>> this proxy. Here's another question: in some cases I need to sort the
>> file before splitting it (in this case sorting by the 4th cell in each
>> row). However, the current file I'm trying to sort and split is around
>> 76 MB and ruby fails when trying to store the CSV as an array. The code
>> and output are below. How else can I go about this?

>
> You could import it into MS Access, sort there and export again. Maybe
> you could even do all the splitting up in Access directly, but I don't
> know too much about Access to help with that.
> And I guess using Access would make it harder to automate this whole
> task, in case that's required.
>
> That's not very "rubyish", of course
>
> Thomas
>
>

Well ... *databases* are highly Ruby-ish! Every major database,
including, I'm guessing, MS Abscess*, has a simple Ruby interface.
Reading a huge file into memory is the wrong approach.

* Abscess -- An abscess is a localized collection of pus in any part of
the body, caused by an infection.

http://www.nlm.nih.gov/medlineplus/e...cle/001353.htm

--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blogspot.com/

If God had meant for carrots to be eaten cooked, He would have given rabbits fire.


 
Reply With Quote
 
M. Edward (Ed) Borasky
Guest
Posts: n/a
 
      12-01-2006
Drew Olson wrote:
> James -
>
> I'll give FasterCSV a try when I get home from work and out from behind
> this proxy. Here's another question: in some cases I need to sort the
> file before splitting it (in this case sorting by the 4th cell in each
> row). However, the current file I'm trying to sort and split is around
> 76 MB and ruby fails when trying to store the CSV as an array. The code
> and output are below. How else can I go about this?
>

On Linux (or CygWin, since you're on Windows):

$ sort -k 4 -t , <inputfile> > <outputfile>

That will probably put the header line at the bottom of the file or
perhaps the middle, but that's pretty easy to fix.

$ cp <a-file-with-just-the-header-line> <outputfile>
$ grep -v <some-field-in-the-header-and-not-in-the-data> <inputfile> |
sort -k 4 -t , >> <outputfile>

**However**:

If the file is truly CSV (numeric values unquoted, date and time stamps
in official format for your locale, and all text fields quoted) there is
actually a way you can treat it like a table in a database with ODBC.
Open your little ODBC data sources gizmo and create a "File DSN". Use
the "text" driver. It will let you configure it so that the "database"
is the directory where your CSV file lives and your CSV file is a table
in that database. Then anything that can connect to an ODBC Data Source
Name will be able to query your CSV file.

If the sort is numeric, add a "-n" to the sort command above. If it's
descending, add "-r" and if it's both numeric and descending add "-nr".
"man sort" is your friend.

--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blogspot.com/

If God had meant for carrots to be eaten cooked, He would have given rabbits fire.


 
Reply With Quote
 
Jamey Cribbs
Guest
Posts: n/a
 
      12-01-2006
Thomas Mueller wrote:
> 2006/11/30, Drew Olson <(E-Mail Removed)>:
>> I'll give FasterCSV a try when I get home from work and out from behind
>> this proxy. Here's another question: in some cases I need to sort the
>> file before splitting it (in this case sorting by the 4th cell in each
>> row). However, the current file I'm trying to sort and split is around
>> 76 MB and ruby fails when trying to store the CSV as an array. The code
>> and output are below. How else can I go about this?


I'm coming to this party really late, so I hope I don't come across as
shamelessly plugging KirbyBase, but, you might want to try it for this.

If you are simply trying to take a large csv file, sort it by one of its
fields, and split it up into smaller files that each contain 40,000
records, I think it might work.

Here's some code (not tested, could be incorrect) off the top of my head:


require 'kirbybase'

db = KirbyBase.new

tbl = db.create_table(:foo, :field1, :String, :field2, :Integer,
:field3, :String............................

tbl.import_csv(name_of_csv_file)

rec_count = tbl.total_recs
last_recno_written_out = 0

while rec_count > 0
recs = tbl.select { |r| r.recno > last_recno_written_out and r.recno <
last_recno_written_out + 40000 }.sort(:field4)

........ here is where you put the code to write these 40,000 recs to
a csv output file .............

last_recno_written_out = recs.last.recno

rec_count = rec_count - 40000
end


KirbyBase will even use FasterCSV for it's csv stuff if you have it
installed.


Anyway, hope this helps. If I have totally misunderstood the request,
feel free to ignore!

Jamey Cribbs

 
Reply With Quote
 
Edwin Fine
Guest
Posts: n/a
 
      12-01-2006
Paul Lutus wrote:
> Edwin Fine wrote:
>
>> I am perplexed by CSV.open. In IO and File, open returns something that
>> quacks like an IO object. You can then call gets, puts, read, write and
>> so on. The CSV open seems to return an array (or gives you a row at a
>> time).

>
> / ...
>
>> I wonder why it was not patterned more closely on IO? Any thoughts?

>
> This is an experience with which I am becoming familiar. Someone
> requests a
> solution to a problem. Someone else offers the option of a library to
> solve
> the problem. Then the original problem fades into the background,
> replaced
> by discussion of the library's problems.
>
> This same pattern has repeated itself about four times in the past
> fortnight, in just this one newsgroup.
> /snip/


Ummm, I am not sure exactly how to interpret the above post, but I see
my name quoted there, so I feel compelled to clarify what I was thinking
in making my original post. I had just written a small Ruby program that
would satisfy the OP's stated problem, but using IO/File. While I was
doing this, more posts appeared, which alerted me to the possibility
that I would have to cater for newlines in the input., "Oh well", I
thought, "I'll just replace every use of "IO" with "CSV", and that will
be that. BZZZ! Wrong! Thank you for playing. I couldn't drop in CSV
instead of IO? WTF???

This is where my perplexity came in. Matz himself has regularly and
clearly stated that he designed Ruby along the Principle Of Least
Surprise (or LOLA, Law of Least Astonishment). Well, I was grievously
surprised and astonished when CSV#open behaved differently to every open
I have used in any language. All the other opens that I know of return
the concept of a handle/object, or some *thing* that can then be
beseeched to bring forth the contents of the actual I/O "device", one
element at a time, or all at once. The CSV#open skips this step and goes
straight from open to bringing forth entities, and thereby breaks
possible compatibility with IO/File. IMHO, this is inelegant design.

I have written many, many libraries (not in Ruby) and know how important
it is to present to your users a consistent, clean, orthogonal,
robust,reliable set of interfaces and implementations. That's why it is
inadvisable to release a very early version of a library to a large
audience of end users until it has proven itself in battle, as it were.
Otherwise, you face the prospect of having to change the interface to be
less surprising (*and* keep a backward-compatible, deprecated one) and
re-releasing it to possibly hundreds of users.

The bottom line is, although I am perfectly capable of doing so, I don't
WANT to reinvent the wheel. I really like reusing good, dependable,
predictable code. I haven't had time to look at FasterCSV yet, but I
predict that I will like what I see, because to my mind, from the works
of his I have seen, the author does his best to embody the "Tao" of Ruby
(LOLA). (Although you can never accurately describe the Tao of anything,
I know...)

Well, that's my 2c worth

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Splitting a list into even size chunks in python? Peter Otten Python 4 03-28-2013 12:15 AM
Re: Splitting a list into even size chunks in python? Arnaud Delobelle Python 0 03-27-2013 08:19 PM
Splitting a list into even size chunks in python? Norah Jones Python 0 03-27-2013 08:06 AM
Re: Splitting up the definitions of a class into different files (splitting public from private)? Mark C++ 0 07-19-2003 04:24 PM
Re: Splitting up the definitions of a class into different files (splitting public from private)? John Ericson C++ 0 07-19-2003 04:03 PM



Advertisments