Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Efficient parsing of large Excel documents in Ruby

Reply
Thread Tools

Efficient parsing of large Excel documents in Ruby

 
 
Wes Gamble
Guest
Posts: n/a
 
      10-28-2006
All,

I am currently using the parseexcel gem to parse an Excel file so that I
can save it's data into a database (in a Rails app.).

When I run it against an Excel file with ~42000 rows and 11 columns (the
spreadsheet is about 10MB in size), it takes the better part of 10
minutes to parse the file and uses about 200 MB of memory. I'm not
actually sure if the parsing completes successfully - the app. stops
right after the parsing appears to finish.

It would seem that I have two issues - memory usage and performance.

Is anyone aware of a much faster way to parse large Excel files?

Is anyone aware of a "windowing" scheme that uses a set amount of memory
to transfer portions of the data to a client requesting the data so as
to keep the memory required at a fixed level?

Should I look into using the Win32OLE libraries with the hope that it
would be more efficient?

I am transferring this data into a SQL Server database, so perhaps I
could look into some M$-native method of moving this data.

Any thoughts or advice is appreciated.

Thanks,
Wes Gamble

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
matt neuburg
Guest
Posts: n/a
 
      10-28-2006
Christian Madsen <(E-Mail Removed)> wrote:

> See this link: http://wiki.rubygarden.org/Ruby/page...ScriptingExcel


Interesting. Just as a note, everything that happens on that page can
now be done on Mac using almost the same language, thanks to
rb-appscript (or RubyOSA, which I have not yet tried). It would be
interesting to write a Mac version of the page. m.

--
matt neuburg, phd = http://www.velocityreviews.com/forums/(E-Mail Removed), http://www.tidbits.com/matt/
Tiger - http://www.takecontrolbooks.com/tiger-customizing.html
AppleScript - http://www.amazon.com/gp/product/0596102119
Read TidBITS! It's free and smart. http://www.tidbits.com
 
Reply With Quote
 
 
 
 
Sam Smoot
Guest
Posts: n/a
 
      10-28-2006
I agree with Christian, but with a twist.

I like to use Excel Automation (w/ WIN32OLE) to convert the document to
a format that's easier to work with. That way, after the initial
conversion, you can handle CSV, XML, etc, and be working with a fast
native library instead of the slow interop.

Another reason I prefer this method is that Excel interop is sometimes
hard to get "just right". There's processes I've written with it
running for over a year that I've never had to touch, but there's
another that used to require I login to the server and kill orphaned
Excel processes. Someone changes a column, you don't code defensively
enough, and BAM!

So the simplicity of the Load/Convert process, and getting out of Excel
as quickly as possible, conserving resources and avoiding potential
bugs that might result in orphaned processes is a big advantage IMO.

Plus it's almost gauranteed to be the fastest, and simplest to develop.
(As long as you're OK with having to run on Windows and having Excel
installed anyways.)

 
Reply With Quote
 
Wes Gamble
Guest
Posts: n/a
 
      10-28-2006
I was definitely thinking about using win32ole/Excel to convert this
spreadsheet to CSV and then process it with a CSV parser.

I'm currently running on Linux though, so now I have to figure out any
issues around a Windows production platform.

Thanks,
Wes


--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
M. Edward (Ed) Borasky
Guest
Posts: n/a
 
      10-28-2006
Wes Gamble wrote:
> I was definitely thinking about using win32ole/Excel to convert this
> spreadsheet to CSV and then process it with a CSV parser.
>
> I'm currently running on Linux though, so now I have to figure out any
> issues around a Windows production platform.
>
> Thanks,
> Wes
>
>


Have you tried unixODBC? It seems to work fairly well, although there
are some issues with most Linux/Unix tools when the spreadsheet has tabs
or column headers with spaces in them.

It's fairly easy to leave the Excel spreadsheets on a Windows system and
hack together a simple "server" for them. I think all you have to do is
"publish them to the intranet", but I've never done it. Then you can use
something like Hpricot to parse them.

 
Reply With Quote
 
M. Edward (Ed) Borasky
Guest
Posts: n/a
 
      10-28-2006
Wes Gamble wrote:

> I am transferring this data into a SQL Server database, so perhaps I
> could look into some M$-native method of moving this data.


Yes, there is a native way of doing this in VBA from a macro, which you
can store either in your personal macro workbook on in the spreadsheets
themselves. I have a colleague where I work that built an application
that does it.

But unfortunately, I have no idea what the technique is called. Almost
anyone who's been through some advanced Microsoft training ought to be
able to help you find it, or you might search the Microsoft knowledge
base for "efficient transfer of data to SQL Server".

If you don't have an answer by Tuesday, let me know and I'll ask her
what it's called.

And no, I don't think it involves installing CygWin.

 
Reply With Quote
 
Wes Gamble
Guest
Posts: n/a
 
      10-28-2006
Ed,

Not a bad idea. I'm already using unixODBC to connect to SQL Server
from the Linux box. I totally forgot that you can get to Excel via
ODBC.

Don't I need an ODBC driver for Excel though?

What about a DBI driver for Excel - does one exist?

Thanks,
Wes

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Charles Oliver Nutter
Guest
Posts: n/a
 
      10-28-2006
Wes Gamble wrote:
> All,
>
> I am currently using the parseexcel gem to parse an Excel file so that I
> can save it's data into a database (in a Rails app.).


Another rough option would be using JRuby or one of the Java bridges to
call out to POI, Apache's Office-document library.

--
Charles Oliver Nutter, JRuby Core Developer
Blogging on Ruby and Java @ headius.blogspot.com
Help spec out Ruby today! @ www.headius.com/rubyspec
(E-Mail Removed) -- (E-Mail Removed)

 
Reply With Quote
 
Wes Gamble
Guest
Posts: n/a
 
      10-29-2006
Charles Oliver Nutter wrote:
> Wes Gamble wrote:
>> All,
>>
>> I am currently using the parseexcel gem to parse an Excel file so that I
>> can save it's data into a database (in a Rails app.).

>
> Another rough option would be using JRuby or one of the Java bridges to
> call out to POI, Apache's Office-document library.


Charles,

I would love to use JRuby for this app., but don't feel it's quite ready
yet (although you guys are moving it forward at an incredible rate and I
expect it will be a viable production option in less than a year - keep
up the good work).

Does anyone know if the POI Ruby bindings are a potential solution? I
see
"Implement support for reading Excel files (easy)" under the list of
TODOs so not sure if I could use POI-Ruby to read...

Anyone have any experience with POI-Ruby?

Wes


--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Wes Gamble
Guest
Posts: n/a
 
      10-29-2006
M. Edward (Ed) Borasky wrote:
> Wes Gamble wrote:
>> I was definitely thinking about using win32ole/Excel to convert this
>> spreadsheet to CSV and then process it with a CSV parser.
>>
>> I'm currently running on Linux though, so now I have to figure out any
>> issues around a Windows production platform.
>>
>> Thanks,
>> Wes
>>
>>

>
> Have you tried unixODBC? It seems to work fairly well, although there
> are some issues with most Linux/Unix tools when the spreadsheet has tabs
> or column headers with spaces in them.
>
> It's fairly easy to leave the Excel spreadsheets on a Windows system and
> hack together a simple "server" for them. I think all you have to do is
> "publish them to the intranet", but I've never done it. Then you can use
> something like Hpricot to parse them.


Hpricot? I thought that only parsed HTML? Can you say a little more
about this? Would the "server" you mention above parse the Excel into
some intermediate format that I would then process with something else
(like Hpricot or ???)?

Thanks,
Wes

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem with Excel reports ::::Excel 2003 Migration To Excel 2007 =?Utf-8?B?c2hhc2hhbmsga3Vsa2Fybmk=?= ASP .Net 15 10-24-2007 01:34 PM
Efficient serialization of binary data within XML documents? Noozer XML 2 09-21-2007 06:38 PM
No more stuff on C:\Documents and Settings\[User]\My Documents\Visual Studio 2005\ craigkenisston@hotmail.com ASP .Net 1 10-18-2006 03:31 PM
regd efficient methods to manipulate *large* files Madhusudhanan Chandrasekaran Python 2 05-01-2006 09:19 PM



Advertisments