Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Fastest way to store ints and floats on disk

Reply
Thread Tools

Fastest way to store ints and floats on disk

 
 
Laszlo Nagy
Guest
Posts: n/a
 
      08-07-2008

Hi,

I'm working on a pivot table. I would like to write it in Python. I
know, I should be doing that in C, but I would like to create a cross
platform version which can deal with smaller databases (not more than a
million facts).

The data is first imported from a csv file: the user selects which
columns contain dimension and measure data (and which columns to
ignore). In the next step I would like to build up a database that is
efficient enough to be used for making pivot tables. Here is my idea for
the database:

Original CSV file with column header and values:

"Color","Year","Make","Price","VMax"
Yellow,2000,Ferrari,100000,254
Blue,2003,Volvo,50000,210

Using the GUI, it is converted to this:

dimensions = [
{ 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
'Yellow' ], },
{ 'name':'Year', colindex:1, 'values':[
1995,1999,2000,2001,2002,2003,2007 ], },
{ 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
'Lamborgini' ], },
]
measures = [
{ 'name', 'Price', 'colindex':3 },
{ 'name', 'Vmax', 'colindex':4 },
]
facts = [
( (3,2,0),(100000.0,254.0) ), # ( dimension_value_indexes,
measure_values )
( (1,5,1),(50000.0,210.0) ),
.... # Some million rows or less
]


The core of the idea is that, when using a relatively small number of
possible values for each dimension, the facts table becomes
significantly smaller and easier to process. (Processing the facts would
be: iterate over facts, filter out some of them, create statistical
values of the measures, grouped by dimensions.)

The facts table cannot be kept in memory because it is too big. I need
to store it on disk, be able to read incrementally, and make statistics.
In most cases, the "statistic" will be simple sum of the measures, and
counting the number of facts affected. To be effective, reading the
facts from disk should not involve complex conversions. For this reason,
storing in CSV or XML or any textual format would be bad. I'm thinking
about a binary format, but how can I interface that with Python?

I already looked at:

- xdrlib, which throws me DeprecationWarning when I store some integers
- struct which uses format string for each read operation, I'm concerned
about its speed

What else can I use?

Thanks,

Laszlo



 
Reply With Quote
 
 
 
 
castironpi
Guest
Posts: n/a
 
      08-07-2008
On Aug 7, 1:41*pm, Laszlo Nagy <(E-Mail Removed)> wrote:
> * Hi,
>
> I'm working on a pivot table. I would like to write it in Python. I
> know, I should be doing that in C, but I would like to create a cross
> platform version which can deal with smaller databases (not more than a
> million facts).
>
> The data is first imported from a csv file: the user selects which
> columns contain dimension and measure data (and which columns to
> ignore). In the next step I would like to build up a database that is
> efficient enough to be used for making pivot tables. Here is my idea for
> the database:
>
> Original CSV file with column header and values:
>
> "Color","Year","Make","Price","VMax"
> Yellow,2000,Ferrari,100000,254
> Blue,2003,Volvo,50000,210
>
> Using the GUI, it is converted to this:
>
> dimensions = [
> * * { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
> 'Yellow' ], },
> * * { 'name':'Year', colindex:1, 'values':[
> 1995,1999,2000,2001,2002,2003,2007 ], },
> * * { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
> 'Lamborgini' ], },
> ]
> measures = [
> * * { 'name', 'Price', 'colindex':3 },
> * * { 'name', 'Vmax', 'colindex':4 },
> ]
> facts = [
> * * ( (3,2,0),(100000.0,254.0) *), # ( dimension_value_indexes,
> measure_values )
> * * ( (1,5,1),(50000.0,210.0) ),
> * *.... # Some million rows or less
> ]
>
> The core of the idea is that, when using a relatively small number of
> possible values for each dimension, the facts table becomes
> significantly smaller and easier to process. (Processing the facts would
> be: iterate over facts, filter out some of them, create statistical
> values of the measures, grouped by dimensions.)
>
> The facts table cannot be kept in memory because it is too big. I need
> to store it on disk, be able to read incrementally, and make statistics.
> In most cases, the "statistic" will be simple sum of the measures, and
> counting the number of facts affected. To be effective, reading the
> facts from disk should not involve complex conversions. For this reason,
> storing in CSV or XML or any textual format would be bad. I'm thinking
> about a binary format, but how can I interface that with Python?
>
> I already looked at:
>
> - xdrlib, which throws me DeprecationWarning when I store some integers
> - struct which uses format string for each read operation, I'm concerned
> about its speed
>
> What else can I use?
>
> Thanks,
>
> * *Laszlo


Take a look at the mmap module. You get direct memory access, backed
by the file system. struct + mmap, if you keep your strings small?
 
Reply With Quote
 
 
 
 
Matthew Woodcraft
Guest
Posts: n/a
 
      08-09-2008
Laszlo Nagy <(E-Mail Removed)> writes:

> The facts table cannot be kept in memory because it is too big. I need to
> store it on disk, be able to read incrementally, and make statistics. In most
> cases, the "statistic" will be simple sum of the measures, and counting the
> number of facts affected. To be effective, reading the facts from disk should
> not involve complex conversions. For this reason, storing in CSV or XML or any
> textual format would be bad. I'm thinking about a binary format, but how can I
> interface that with Python?
>
> I already looked at:
>
> - xdrlib, which throws me DeprecationWarning when I store some integers
> - struct which uses format string for each read operation, I'm concerned about
> its speed
>
> What else can I use?


pytables (<http://www.pytables.org/>) looks like the right kind of
thing.

-M-
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
In-place memory manager, mmap (was: Fastest way to store ints andfloats on disk) castironpi Python 5 08-24-2008 08:36 PM
Re: Fastest way to store ints and floats on disk M.-A. Lemburg Python 3 08-10-2008 04:41 AM
numpy and filtering (was: Fastest way to store ints and floats ondisk) Laszlo Nagy Python 0 08-08-2008 11:06 AM
Fastest way to read array of ints off disk? Chris Java 5 01-28-2007 07:41 AM
ints ints ints and ints Skybuck Flying C Programming 24 07-10-2004 04:48 AM



Advertisments