Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Need design advice. What's my best approach for storing this data?

Reply
Thread Tools

Need design advice. What's my best approach for storing this data?

 
 
Mudcat
Guest
Posts: n/a
 
      03-17-2006
Hi,

I am trying to build a tool that analyzes stock data. Therefore I am
going to download and store quite a vast amount of it. Just for a
general number - assuming there are about 7000 listed stocks on the two
major markets plus some extras, 255 tradying days a year for 20 years,
that is about 36 million entries.

Obviously a database is a logical choice for that. However I've never
used one, nor do I know what benefits I would get from using one. I am
worried about speed, memory usage, and disk space.

My initial thought was to put the data in large dictionaries and shelve
them (and possibly zipping them to save storage space until the data is
needed). However, these are huge files. Based on ones that I have
already done, I estimated at least 5 gigs for storage this way. My
structure for this files was a 3 layered dictionary.
[Market][Stock][Date](Data List). That allows me to easily access any
data for any date or stock in a particular market. Therefore I wasn't
really concerned about the organizational aspects of a db since this
would serve me fine.

But before I put this all together I wanted to ask around to see if
this is a good approach. Will it be faster to use a database over a
structured dictionary? And will I get a lot of overhead if I go with a
database? I'm hoping people who have dealt with such large data before
can give me a little advice.

Thanks ahead of time,
Marc

 
Reply With Quote
 
 
 
 
J Correia
Guest
Posts: n/a
 
      03-17-2006

"Mudcat" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) ups.com...
> Hi,
>
> I am trying to build a tool that analyzes stock data. Therefore I am
> going to download and store quite a vast amount of it. Just for a
> general number - assuming there are about 7000 listed stocks on the two
> major markets plus some extras, 255 tradying days a year for 20 years,
> that is about 36 million entries.
>


On a different tack, to avoid thinking about any db issues, consider
subscribing
to TC2000 (tc2000.com)... they already have all that data,
in a database which takes about 900Mb when fully installed.
They also have an API which allows you full access to the database
(including from Python via COM). The API is pretty robust and allows
you do pre-filtering (e.g. give me last 20 years of all stocks over $50
with ave daily vol > 100k) at the db level meaning you can focus on using
Python for analysis. The database is also updated daily.

If you don't need daily updates, then subscribe (first 30 days free) and
cancel, and you've got a snapshot db of all the data you need.

They also used to send out an evaluation CD which had all
the history data barring the last 3 months or so which is certainly
good enough for analysis and testing. Not sure if they still do that.

HTH.


 
Reply With Quote
 
 
 
 
Dennis Lee Bieber
Guest
Posts: n/a
 
      03-17-2006
On 17 Mar 2006 09:08:03 -0800, "Mudcat" <(E-Mail Removed)> declaimed
the following in comp.lang.python:

> structure for this files was a 3 layered dictionary.
> [Market][Stock][Date](Data List). That allows me to easily access any
> data for any date or stock in a particular market. Therefore I wasn't


Of course, you'll have to know which market the stock is in first,
or do a test on each. And what happens if a stock changes market (it
happens)...

>
> But before I put this all together I wanted to ask around to see if
> this is a good approach. Will it be faster to use a database over a
> structured dictionary? And will I get a lot of overhead if I go with a
> database? I'm hoping people who have dealt with such large data before
> can give me a little advice.
>

Well, an RDBM won't require you to load the whole dataset just to
add a record <G>. And you are looking at something that has small
updates, but possible large retrievals. Do you really want the
application to load "all" that data just to add one record and then save
it all back out (what happens if the power fails halfway through the
save; are you writing to a different file, then delete/rename). Do you
have enough memory to support the data set as one chunk? (Did you
mention 5GB?)


I don't know what your data list is (high, low, close?) but there
are a number of choices available...

From a simple flat table with indices on Market, Stock, and Date
[Market, Stock, Date -> high, low, close]

(The notation is [unique/composite key -> dependent data]

Or dual tables:
[Market, Stock] [Stock, Date -> high, low, close]

(Avoids duplicating the Market field for every record, though loses any
history of when a stock changes markets; note the first table doesn't
have any dependent data)

Or, if using an RDBM where each table is a separate file [Visual FoxPro
<yuck>, MySQL MyISAM], you could even do:
[Market, Stock] [Date -> high, low, close]stock1
[Date -> high, low, close]stock2
...
[Date -> high, low, close]stock/n/

where each [Date -> high, low, close] is a separate table named after
the stock (each such table is identical in layout). More complexity when
working multiple stocks (the worst would be to do a report of stock
names and spread sorted by size of spread for a single day -- you'd have
to create a temporary table of [Stock -> high, low, close] to perform
the report).
--
> ================================================== ============ <
> http://www.velocityreviews.com/forums/(E-Mail Removed) | Wulfraed Dennis Lee Bieber KD6MOG <
> (E-Mail Removed) | Bestiaria Support Staff <
> ================================================== ============ <
> Home Page: <http://www.dm.net/~wulfraed/> <
> Overflow Page: <http://wlfraed.home.netcom.com/> <

 
Reply With Quote
 
Rene Pijlman
Guest
Posts: n/a
 
      03-17-2006
Mudcat:
>My initial thought was to put the data in large dictionaries and shelve
>them (and possibly zipping them to save storage space until the data is
>needed). However, these are huge files.


ZODB solves that problem for you.
http://www.zope.org/Wikis/ZODB/FrontPage

More in particular "5.3 BTrees Package":
http://www.zope.org/Wikis/ZODB/guide...00000000000000

But I've only used ZODB for small databases compared to yours. It's
supposed to scale very well, but I can't speak from experience.

--
René Pijlman
 
Reply With Quote
 
Mudcat
Guest
Posts: n/a
 
      03-17-2006
>On a different tack, to avoid thinking about any db issues, consider
>subscribing
>to TC2000 (tc2000.com)... they already have all that data,
>in a database which takes about 900Mb when fully installed.


That is an interesting option also. I had actually looked for ready
made databases and didn't come across this one. Although, I don't
understand how they can fit all that info into 900Mb.

I like this option, but I guess if I decide to keep using this database
then I need to keep up my subcription. The thing I liked about
downloading everything from Yahoo was that I didn't have to pay anyone
for the data.

Does anyone know the best way to compress this data? or do any of these
databases handle compression automatically? 5gig will be hard for any
computer to deal with, even in a database.

 
Reply With Quote
 
Mudcat
Guest
Posts: n/a
 
      03-19-2006
In doing a little research I ran across PyTables, which according to
the documentation does this: "PyTables is a hierarchical database
package designed to efficiently manage very large amounts of data." It
also deals with compression and various other handy things. Zope also
seems to be designed to handle large amounts of data with compression
in mind.

Does any know which of these two apps would better fit my purpose? I
don't know if either of these has limitations that might not work out
well for what I'm trying to do. I really need to try and compress the
data as much as possible without making the access times really slow.

Thanks

 
Reply With Quote
 
Robert Kern
Guest
Posts: n/a
 
      03-19-2006
Mudcat wrote:
> In doing a little research I ran across PyTables, which according to
> the documentation does this: "PyTables is a hierarchical database
> package designed to efficiently manage very large amounts of data." It
> also deals with compression and various other handy things. Zope also
> seems to be designed to handle large amounts of data with compression
> in mind.
>
> Does any know which of these two apps would better fit my purpose? I
> don't know if either of these has limitations that might not work out
> well for what I'm trying to do. I really need to try and compress the
> data as much as possible without making the access times really slow.


PyTables is exactly suited to storing large amounts of numerical data aranged in
tables and arrays. The ZODB is not.

--
Robert Kern
(E-Mail Removed)

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

 
Reply With Quote
 
Magnus Lycka
Guest
Posts: n/a
 
      03-20-2006
Mudcat wrote:
> I am trying to build a tool that analyzes stock data. Therefore I am
> going to download and store quite a vast amount of it. Just for a
> general number - assuming there are about 7000 listed stocks on the two
> major markets plus some extras, 255 tradying days a year for 20 years,
> that is about 36 million entries.
>
> Obviously a database is a logical choice for that. However I've never
> used one, nor do I know what benefits I would get from using one. I am
> worried about speed, memory usage, and disk space.


This is a typical use case for relational database systems.
With something like DB2 or Oracle here, you can take advantage
of more than 20 years of work by lots of developers trying to
solve the kind of problems you will run into.

You haven't really stated all the facts to decide what product
to choose though. Will this be a multi-user applications?
Do you forsee a client/server application? What operating
system(s) do you need to support?

With relational databases, it's plausible to move some of
the hard work in the data analysis into the server. Using
this well means that you need to learn a bit about how
relational databases work, but I think it's with the trouble.
It could mean that much less data ever needs to reach your
Python program for processing, and that will mean a lot for
your performance. Relational databases are very good at
searching, sorting and simple aggregations of data. SQL is
a declarative language, and in principle, your SQL code
will just declare the correct queries and manipulations that
you want to achieve, and tuning will be a separate activity,
which doesn't need to involve program changes. In reality,
there are certainly cases where changes in SQL code will
influence performance, but to a very large extent, you can
achieve good performance through building indices and by
letting the database gather statistics and analyze the
queries your programs contain. As a bonus, you also have
advanced systems for security, transactional safety, on-
line backup, replication etc.

You don't get these advantages with any other data storage
systems.

I'd get Chris Fehily's "SQL Visual Quickstart Guide", which
is as good as his Python book. As database, it depends a bit
on your platform you work with. I'd avoid MySQL. Some friends
of mine have used it for needs similar to yours, and they are
now running into its severe shortcomings. (I did warn them.)

For Windows, I think the single user version of SQL Server
(MSDE?) is gratis. For both Windows and Linux/Unix, there are
(I think) gratis versions of both Oracle 10g, IBM DB2 UDB and
Mimer SQL. Mimer SQL is easy to install, Oracle is a pain, and
I think DB2 is somewhere in between. PostgreSQL is also a good
option.

Either way, it certainly seems natural to learn relational
databases and SQL if you want to work with financial software.
 
Reply With Quote
 
Dennis Lee Bieber
Guest
Posts: n/a
 
      03-20-2006
On Mon, 20 Mar 2006 11:00:21 +0100, Magnus Lycka <(E-Mail Removed)>
declaimed the following in comp.lang.python:

> For Windows, I think the single user version of SQL Server
> (MSDE?) is gratis. For both Windows and Linux/Unix, there are


MSDE seems to vary availability from month to month <G>

The version I had (came with VB6 Pro) was restricted to a maximum
database of 2GB (~same size limit as a JET MDB, but using SQL Server
core), and had a query choke of something like five simultaneous
queries.

> (I think) gratis versions of both Oracle 10g, IBM DB2 UDB and
> Mimer SQL. Mimer SQL is easy to install, Oracle is a pain, and
> I think DB2 is somewhere in between. PostgreSQL is also a good
> option.
>

Depending upon requirements, there is also Firebird (a spawn of
Interbase), MaxDB (MySQL's release of the SAP DB), and (while some abhor
it) even MySQL might be worthy... (at least, until Oracle decides MySQL
can no longer license either the Sleepycat BDB or the Inno Oy InnoDB
backends).

Ingres may also be viable.
http://www.ingres.com/products/Prod_...ad_Portal.html
--
> ================================================== ============ <
> (E-Mail Removed) | Wulfraed Dennis Lee Bieber KD6MOG <
> (E-Mail Removed) | Bestiaria Support Staff <
> ================================================== ============ <
> Home Page: <http://www.dm.net/~wulfraed/> <
> Overflow Page: <http://wlfraed.home.netcom.com/> <

 
Reply With Quote
 
Dennis Lee Bieber
Guest
Posts: n/a
 
      03-21-2006
On Mon, 20 Mar 2006 17:52:19 GMT, Dennis Lee Bieber
<(E-Mail Removed)> declaimed the following in comp.lang.python:


> Ingres may also be viable.
> http://www.ingres.com/products/Prod_...ad_Portal.html


Looking deeper, looks like Linux only...
--
> ================================================== ============ <
> (E-Mail Removed) | Wulfraed Dennis Lee Bieber KD6MOG <
> (E-Mail Removed) | Bestiaria Support Staff <
> ================================================== ============ <
> Home Page: <http://www.dm.net/~wulfraed/> <
> Overflow Page: <http://wlfraed.home.netcom.com/> <

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
User Images: Storing in Files VS Storing in Database Jonathan Wood ASP .Net 1 06-02-2008 05:56 PM
storing pointer vs storing object toton C++ 11 10-13-2006 11:08 AM
Best Design approach Jason Mauss ASP .Net 0 03-24-2005 07:48 AM
Feasability of Design By Contract (DBC) approach in standard C++ christopher diggins C++ 7 04-21-2004 01:20 AM
Advice on design approach and principles Mr Gordonz ASP .Net 1 08-04-2003 10:08 PM



Advertisments