Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Parsing Excel spreadsheets

Reply
Thread Tools

Parsing Excel spreadsheets

 
 
Steve Holden
Guest
Posts: n/a
 
      01-02-2009
brooklineTom wrote:
> On Dec 31 2008, 9:56 am, John Machin <(E-Mail Removed)> wrote:
>> On Dec 31 2008, 4:02 pm, brooklineTom <(E-Mail Removed)> wrote:
>>
>>> (E-Mail Removed) wrote:
>>>> Hi,
>>>> Can anybody recommend an approach for loading and parsing Excel
>>>> spreadsheets in Python. Any well known/recommended libraries for this?
>>>> The only thing I found in a brief search washttp://www.lexicon.net/sjmachin/xlrd.htm,
>>>> but I'd rather get some more input before going with something I don't
>>>> know.
>>>> Thanks,
>>>> Andy.
>>> I save the spreadsheets (in Excel) in xml format.

>> Which means that you need to be on a Windows box with a licensed copy
>> of Excel. I presume you talking about using Excel 2003 and saving as
>> "XML Spreadsheet (*.xml)". Do you save the files manually, or using a
>> COM script? What is the largest xls file that you've saved as xml, how
>> big was the xml file, and how long did it take to parse the xml file?
>> Do you extract formatting information or just cell contents?

>
> 1. The OP requested Excel files, by construction those must be
> generated with a licensed copy of Excel. I did the actual processing
> on both linux and windoze platforms.


Well, even if Andy meant "Excel files" rather than "Excel-formatted
files" there are many ways to come by these without having a licensed
copy of Excel. FTP and email attachment come to mind most readily.

How then to convert those to XML without Excel?

[...]> I looked, briefly, at xlrd. I found and scanned a few alternatives,
> though I don't remember what the others were. I needed something I
> could incorporate into my own application framework, and I knew I
> didn't need most of the formatting information. I'm not in any way
> criticizing xlrd, it's simply that, based on its API summary, it seems
> focused on problems I didn't have to solve. I knew that I needed only
> a small subset of the xlrd behavior, and I concluded (perhaps
> incorrectly) that it would be easier to roll my own parser than find,
> extract, and then port (to my own framework) the corresponding parts
> of xlrd.
>

My own case was similar, in that I only needed the value data. The
approach I took was to install xlrd and use it. Job done.

> I needed to extract the content of each row, cell by cell, and build
> data objects (in my framework) with the content of various cells. I
> also needed to build an "exception file" containing malformed entries
> that I could re-open with Excel after my code finished, so that the
> bogus entries could be manually corrected. What I mean by "malformed"
> entry is, for example, an address field that fails to correctly
> geocode or comment fields with confused utf8/unicode contents. My
> focus was on data content, as opposed to presentation. I needed to
> crack the cells into things like "string", "boolean", "float", and so
> on.
>
> Most importantly, I needed to do this one entry at a time -- I did
> *not* want to load the entire spreadsheet at once.
>

My data files weren't that large (IIRC the largest spreadsheet was about
6MB), so I was quite happy to load the whole thing in memory, iterate
over it and then write the results to the database as they were extracted.

> I'm not saying that this couldn't be done with xlrd; only that I chose
> to roll my own and had minimal difficulty doing so.
>

Given the constraints of your problem it seems like an intelligent approach.

> I hope this helps!


I'm sure it will.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      01-03-2009
On Jan 3, 2:01*am, brooklineTom <(E-Mail Removed)> wrote:
> On Dec 31 2008, 9:56 am, John Machin <(E-Mail Removed)> wrote:


> > On Dec 31 2008, 4:02 pm, brooklineTom <(E-Mail Removed)> wrote:

>
> > > (E-Mail Removed) wrote:
> > > > Hi,

>
> > > > Can anybody recommend an approach for loading and parsing Excel
> > > > spreadsheets in Python. Any well known/recommended libraries for this?

>
> > > > The only thing I found in a brief search was http://www.lexicon.net/sjmachin/xlrd.htm,
> > > > but I'd rather get some more input before going with something I don't
> > > > know.

>
> > > > Thanks,
> > > > Andy.

>
> > > I save the spreadsheets (in Excel) in xml format.

>
> > Which means that you need to be on a Windows box with a licensed copy
> > of Excel. I presume you talking about using Excel 2003 and saving as
> > "XML Spreadsheet (*.xml)". Do you save the files manually, or using a
> > COM script? What is the largest xls file that you've saved as xml, how
> > big was the xml file, and how long did it take to parse the xml file?
> > Do you extract formatting information or just cell contents?

>
> 1. The OP requested Excel files, by construction those must be
> generated with a licensed copy of Excel. I did the actual processing
> on both linux and windoze platforms.


My point was that however the original XLS files were created or
acquired, the first step in your solution involves converting the XLS
file to "XML Spreadsheet" format, which requires a copy of Excel on a
Windows box. Many people start with an XLS file, no Excel and no
Windows box, no COM, and users who can't be relied on to open a file
and save it in the right format with the right name and extension.

BTW, did you consider opening the XLS files with OpenOffice.org's Calc
and saving it in their default ods format (chunks of XML in a zip
file)?


> 3. The largest file I used was about 228M,


Is that the XLS file or the XML file?

> containing 36,393 hotel
> properties from Commission Junction. Each entry had 113 cells. The
> parsing overhead was minimal (on a per-entry basis) -- that's why I
> choose to use a pull-parser.
> 4. I extracted primarily cell contents, though I did some very limited
> format handling (looking for non-text fields and such).


I don't understand "looking for non-text fields" as "format handling".
To my way of thinking, knowing the data-type that Excel has assigned
to a cell is close to essential for effective use of the contents. And
you don't have to look very far: the ss:Type attribute tells you
whether a cell's content is String, Number, DateTime, Boolean, or
Error.

> > > xhtml. I know there are various python packages around that do it, but
> > > I found the learning curve of those packages to be steeper than just
> > > grokking the spreadsheet structure itself.

>
> > I'm curious to know which are the "various python packages" with the
> > so steep learning curves, and what the steep bits were.

>
> I looked, briefly, at xlrd. I found and scanned a few alternatives,
> though I don't remember what the others were. I needed something I
> could incorporate into my own application framework, and I knew I
> didn't need most of the formatting information. I'm not in any way
> criticizing xlrd, it's simply that, based on its API summary, it seems
> focused on problems I didn't have to solve.


Not so focused at all. The default behaviour of xlrd is to ignore
formatting info as much as possible. AFAICT this is the mode used by
most users.

> I knew that I needed only
> a small subset of the xlrd behavior, and I concluded (perhaps
> incorrectly) that it would be easier to roll my own parser than find,
> extract, and then port (to my own framework) the corresponding parts
> of xlrd.


Possibly incorrectly. If approached at the time, I would have said:
(a) if desperate to DIY:
(a1) ignore any code for old Excel versions (self.biff_version < 80)
(a2) ignore any code for extracting formatting info
(self.formatting_info)
(a3) find the Sheet.put_cell* methods in sheet.py e.g.

def put_cell(self, rowx, colx, ctype, value, xf_index):

ignore the xf_index arg and subvert them to your own needs instead of
filling up a big rectangular arena with data

(b) if not really so desperate, talk to me about implementing an
option in xlrd where callers can specify a callback to be used instead
of the Sheet.put_cell* methods

(c) What is all this "port to my own framework" caper anyway? If you
need to extract data from a database, do you rummage in their code
libraries and port the relevant bits to your own framework?

>
> I needed to extract the content of each row, cell by cell, and build
> data objects (in my framework) with the content of various cells. I
> also needed to build an "exception file" containing malformed entries
> that I could re-open with Excel after my code finished, so that the
> bogus entries could be manually corrected. What I mean by "malformed"
> entry is, for example, an address field that fails to correctly
> geocode or comment fields with confused utf8/unicode contents.


Building objects in your own framework and checking data integrity is
something that happens *after* you've got the basics for a cell (row
id, column id, data type, data value). You have dodgy postal
addresses? You're not alone, and it's not relevant to how you parse
the spreadsheet or even whether the data source was a spreadsheet or a
database query or a box of punched cards.

> My
> focus was on data content, as opposed to presentation. I needed to
> crack the cells into things like "string", "boolean", "float", and so
> on.


"Crack"? It's not the Enigma code. It's not even rot13. Spreadsheet
XML tells you the type (String, Boolean, ...). xlrd tells you the type
(XL_CELL_TEXT, XL_CELL_BOOLEAN, ...).

>
> Most importantly, I needed to do this one entry at a time -- I did
> *not* want to load the entire spreadsheet at once.
>
> I'm not saying that this couldn't be done with xlrd; only that I chose
> to roll my own and had minimal difficulty doing so.
>
> I hope this helps!


I think so. Thanks. You didn't directly address the "steep learning
curve" question, but you explained enough of where you came from.

It's a matter of "horses for courses". I guess some people might
regard (xml.dom, xml.dom.minidom, DIY "pullparser") as having a
slightly non-horizontal learning curve

Cheers,
John
 
Reply With Quote
 
 
 
 
brooklineTom
Guest
Posts: n/a
 
      01-07-2009
On Jan 2, 7:04 pm, John Machin <(E-Mail Removed)> wrote:
> On Jan 3, 2:01 am, brooklineTom <(E-Mail Removed)> wrote:
>
> <snip>
>
> My point was that however the original XLS files were created or
> acquired, the first step in your solution involves converting the XLS
> file to "XML Spreadsheet" format, which requires a copy of Excel on a
> Windows box. Many people start with an XLS file, no Excel and no
> Windows box, no COM, and users who can't be relied on to open a file
> and save it in the right format with the right name and extension.


True enough. I develop on a WinXP box and have Excel. I just used it.

> BTW, did you consider opening the XLS files with OpenOffice.org's Calc
> and saving it in their default ods format (chunks of XML in a zip
> file)?


No. As I said, I have Excel.

> > 3. The largest file I used was about 228M,

>
> Is that the XLS file or the XML file?


That's the xml file size. I just *love* file-bloat, don't you?

>
> <snip>
>


> > I knew that I needed only
> > a small subset of the xlrd behavior, and I concluded (perhaps
> > incorrectly) that it would be easier to roll my own parser than find,
> > extract, and then port (to my own framework) the corresponding parts
> > of xlrd.

>
> Possibly incorrectly. If approached at the time, I would have said:
> (a) if desperate to DIY:
> (a1) ignore any code for old Excel versions (self.biff_version < 80)
> (a2) ignore any code for extracting formatting info
> (self.formatting_info)
> (a3) find the Sheet.put_cell* methods in sheet.py e.g.
>
> def put_cell(self, rowx, colx, ctype, value, xf_index):
>
> ignore the xf_index arg and subvert them to your own needs instead of
> filling up a big rectangular arena with data
>
> (b) if not really so desperate, talk to me about implementing an
> option in xlrd where callers can specify a callback to be used instead
> of the Sheet.put_cell* methods
>
> (c) What is all this "port to my own framework" caper anyway? If you
> need to extract data from a database, do you rummage in their code
> libraries and port the relevant bits to your own framework?


I'm sure xlrd is fine, I wasn't any way suggesting that any
"improvements" are needed. I was just explaining what I did.

>
> <snip>
>


> "Crack"? It's not the Enigma code. It's not even rot13. Spreadsheet
> XML tells you the type (String, Boolean, ...). xlrd tells you the type
> (XL_CELL_TEXT, XL_CELL_BOOLEAN, ...).


Sure, once you've sorted through the excess MS-specific stuff, dealt
with the runs of empty cells, and so on.

I needed to write an exception file, containing the original
spreadsheet entries that failed, so that our curators could find and
fix (or delete) them. Thus, I already had to do row-by-row handling of
the input, I had to catch and handle exceptions, I had to know and
remember enough about the original SS format/layout to replicate it in
the exception file, and so on.

> I think so. Thanks. You didn't directly address the "steep learning
> curve" question, but you explained enough of where you came from.
>
> It's a matter of "horses for courses". I guess some people might
> regard (xml.dom, xml.dom.minidom, DIY "pullparser") as having a
> slightly non-horizontal learning curve


*LOL*

Yes. Sadly, I had already climbed the dom/minidom/pullparser curve for
all the other stuff I have to do. In the space I work in, xml handling
is pervasive enough that I already have all that stuff "in my hands"
anyway.

I wonder if the OP ever read any of this....

Thx,
Tom
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Excel Spreadsheets =?Utf-8?B?Zml0emJlc3Q=?= Microsoft Certification 2 06-14-2006 05:43 PM
ANNOUNCE: xlrd 0.5.2 -- extract data from Excel spreadsheets John Machin Python 15 03-30-2006 04:22 PM
Excel spreadsheets home fuel calculator Chris Computer Support 2 09-15-2005 10:17 PM
Programmatically creating Excel spreadsheets in ASP.NET Raoul Snyman ASP .Net 14 04-29-2004 06:45 PM



Advertisments