Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Databases and python

Reply
Thread Tools

Databases and python

 
 
Dan Stromberg
Guest
Posts: n/a
 
      02-16-2006

I've been putting a little bit of time into a file indexing engine in
python, which you can find here:
http://dcs.nac.uci.edu/~strombrg/pyindex.html

It'll do 40,000 mail messages of varying lengths pretty well now, but I
want more

So far, I've been taking the approach of using a single-table database
like gdbm or dbhash (actually a small number of them, to map filenames to
numbers, numbers to filenames, words to numbers, numbers to words,
and numbered words to numbered filenames), and making each entry keyed by
a word, and under the word in the database is a null terminated list of
filenames (in http://dcs.nac.uci.edu/~strombrg/base255.html representation).

However, despite using http://dcs.nac.uci.edu/~strombrg/cachedb.html module
to make the database use faster, bringing in psyco, and studying various
python optimization pages, the program just isn't performing like I'd like
it to.

And I think it's because despite the caching and minimal representation
conversion, it's still just too slow converting linear lists to arrays
back and forth and back and forth.

So this leads me to wonder - is there a python database interface that
would allow me to define a -lot- of tables? Like, each word becomes a
table, and then the fields in that table are just the filenames that
contained that word. That way adding filenames to a word shouldn't bog
down much at all.

-But-, are there any database interfaces for python that aren't going to
get a bit upset if you try to give them hundreds of thousands of tables?

Thanks!

 
Reply With Quote
 
 
 
 
Jonathan Gardner
Guest
Posts: n/a
 
      02-16-2006
I'm no expert in BDBs, but I have spent a fair amount of time working
with PostgreSQL and Oracle. It sounds like you need to put some
optimization into your algorithm and data representation.

I would do pretty much like you are doing, except I would only have the
following relations:

- word to word ID
- filename to filename ID
- word ID to filename ID

You're going to want an index on pretty much every column in this
database. That's because you're going to lookup by any one of these
columns for the corresponding value.

I said I wasn't an expert in BDBs. But I do have some experience
building up large databases. In the first stage, you just accumulate
the data. Then you build the indexes only as you need them. Let's say
you are scanning your files. You won't need an index on the
filename-to-ID table. That's because you are just putting data in
there. The word-to-ID table needs an index on the word, but not ID
(you're not looking up by ID yet.) And the word ID-to-filename ID table
doesn't need any indexes yet either. So build up the data without the
indexes. Once your scan is complete, then build up the indexes you'll
need for regular operation. You can probably incrementally add data as
you go.

As far as filename ID and word IDs go, just use a counter to generate
the next number. If you use base255 as the number, you're really not
going to save much space.

And your idea of hundreds of thousands of tables? Very bad. Don't do
it.

 
Reply With Quote
 
 
 
 
Rene Pijlman
Guest
Posts: n/a
 
      02-16-2006
Dan Stromberg:
>is there a python database interface that would allow me to define a
>-lot- of tables? Like, each word becomes a table, and then the fields
>in that table are just the filenames that contained that word.


Give ZODB a try.
http://www.zope.org/Wikis/ZODB/FrontPage
http://www.python.org/workshops/2000...ton/zodb3.html

My first attempt would be: a BTree with the word as key, and a 'list of
filenames' as value.
http://www.zope.org/Wikis/ZODB/Front...00000000000000

--
René Pijlman
 
Reply With Quote
 
bruno at modulix
Guest
Posts: n/a
 
      02-16-2006
Jonathan Gardner wrote:
> I'm no expert in BDBs, but I have spent a fair amount of time working
> with PostgreSQL and Oracle. It sounds like you need to put some
> optimization into your algorithm and data representation.
>
> I would do pretty much like you are doing, except I would only have the
> following relations:
>
> - word to word ID
> - filename to filename ID
> - word ID to filename ID
>
> You're going to want an index on pretty much every column in this
> database.


stop !

I'm not a db expert neither, but putting indexes everywhere is well
known DB antipattern. An index is only useful if the indexed field is
discriminant enough (ie: there must be the less possible records having
the same value for this field). Else, the indexed lookup may end up
taking far more time than a simple linear lookup. Also, indexes slow
down write operations.

> That's because you're going to lookup by any one of these
> columns for the corresponding value.
>
> I said I wasn't an expert in BDBs. But I do have some experience
> building up large databases. In the first stage, you just accumulate
> the data. Then you build the indexes only as you need them.


Yes. And only where it makes sens.

(snip)

> And your idea of hundreds of thousands of tables? Very bad. Don't do
> it.


+100 on this


--
bruno desthuilliers
python -c "print '@'.join(['.'.join([w[::-1] for w in p.split('.')]) for
p in '(E-Mail Removed)'.split('@')])"
 
Reply With Quote
 
Bryan Olson
Guest
Posts: n/a
 
      02-16-2006
Dan Stromberg wrote:
> I've been putting a little bit of time into a file indexing engine

[...]

> So far, I've been taking the approach of using a single-table database
> like gdbm or dbhash [...] and making each entry keyed by
> a word, and under the word in the database is a null terminated list of
> filenames (in http://dcs.nac.uci.edu/~strombrg/base255.html representation).
>

[...]
> the program just isn't performing like I'd like it to.
>
> And I think it's because despite the caching and minimal representation
> conversion, it's still just too slow converting linear lists to arrays
> back and forth and back and forth.


I think I follow. The straightforward method of building the
list of files associated with a word is order n**2 time. On each
new entry, you look up the entire string, append one file-id to
it, and write the new version back.

> So this leads me to wonder - is there a python database interface that
> would allow me to define a -lot- of tables? Like, each word becomes a
> table, and then the fields in that table are just the filenames that
> contained that word. That way adding filenames to a word shouldn't bog
> down much at all.


Well, you could use simple files instead of fancy database tables.

Below is a demo of an alternate technique that uses bsddb B-Trees,
and puts both the word and the file-id in the key. I don't know
how efficient it is for real data, but at least the time won't grow
as Theta(n**2).

--Bryan



import bsddb
import urllib


def add_words_from_file(index, fname, word_iterator):
""" Pass the open-for-write bsddb B-Tree, a filename, and a list
(or any interable) of the words in the file.
"""
fname = urllib.quote_plus(fname)
s = set()
for word in word_iterator:
if word not in s:
s.update(word)
key = '%s %s' % (urllib.quote_plus(word), fname)
index[key] = ''
index.sync()


def lookup(index, word):
""" Pass the B-Tree (as built with add_words_from_file) and a
word to look up. Returns list of files containing the word.
"""
word = urllib.quote_plus(word)
fname_list = []
try:
current = index.set_location(word)
while True:
(key, _) = current
(w, fname) = key.split()
if w != word:
break
fname_list.append(urllib.unquote_plus(fname))
current = index.next()
except KeyError:
pass
return fname_list


def test():
index = bsddb.btopen('junktest.bdb', 'n')
data =[
('bryfile.txt', 'nor heed the rumble of a distant drum'),
('junkfile.txt', 'this is the beast, the beast so sly'),
('word file.txt', 'is this the way it always is here in Baltimore')
]
for (fname, text) in data:
words = text.split()
add_words_from_file(index, fname, words)

for word in ['is', 'the', 'heed', 'this', 'way']:
print '"%s" is in files: %s' % (word, lookup(index, word))

test()
 
Reply With Quote
 
Dan Stromberg
Guest
Posts: n/a
 
      02-17-2006
On Thu, 16 Feb 2006 13:45:28 +0000, Bryan Olson wrote:

> Dan Stromberg wrote:
>> I've been putting a little bit of time into a file indexing engine

> [...]
>
>> So far, I've been taking the approach of using a single-table database
>> like gdbm or dbhash [...] and making each entry keyed by
>> a word, and under the word in the database is a null terminated list of
>> filenames (in http://dcs.nac.uci.edu/~strombrg/base255.html representation).
>>

> [...]
>> the program just isn't performing like I'd like it to.
> >
>> And I think it's because despite the caching and minimal representation
>> conversion, it's still just too slow converting linear lists to arrays
>> back and forth and back and forth.

>
> I think I follow. The straightforward method of building the
> list of files associated with a word is order n**2 time. On each
> new entry, you look up the entire string, append one file-id to
> it, and write the new version back.


Yes, essentially, with the twist that the most recently used words are
kept cached in memory, so they don't have to be converted from string to
list and back to string every time a filename is added to a word.

>> So this leads me to wonder - is there a python database interface that
>> would allow me to define a -lot- of tables? Like, each word becomes a
>> table, and then the fields in that table are just the filenames that
>> contained that word. That way adding filenames to a word shouldn't bog
>> down much at all.

>
> Well, you could use simple files instead of fancy database tables.


That's an interesting thought. Perhaps especially if australopithecine
were saved in a filename like:

~/indices/au/st/ra/lo/pi/th/ec/in/e

> Below is a demo of an alternate technique that uses bsddb B-Trees,
> and puts both the word and the file-id in the key. I don't know
> how efficient it is for real data, but at least the time won't grow
> as Theta(n**2).


Perhaps I'm missing something, but is it not roughly O(1) for
individual insertions, but O(n*m) (n == number of files, m == number of
words) for lookups?

> --Bryan
>
>
>
> import bsddb
> import urllib
>
>
> def add_words_from_file(index, fname, word_iterator):
> """ Pass the open-for-write bsddb B-Tree, a filename, and a list
> (or any interable) of the words in the file.
> """
> fname = urllib.quote_plus(fname)
> s = set()
> for word in word_iterator:
> if word not in s:
> s.update(word)
> key = '%s %s' % (urllib.quote_plus(word), fname)
> index[key] = ''
> index.sync()
>
>
> def lookup(index, word):
> """ Pass the B-Tree (as built with add_words_from_file) and a
> word to look up. Returns list of files containing the word.
> """
> word = urllib.quote_plus(word)
> fname_list = []
> try:
> current = index.set_location(word)
> while True:
> (key, _) = current
> (w, fname) = key.split()
> if w != word:
> break
> fname_list.append(urllib.unquote_plus(fname))
> current = index.next()
> except KeyError:
> pass
> return fname_list
>
>
> def test():
> index = bsddb.btopen('junktest.bdb', 'n')
> data =[
> ('bryfile.txt', 'nor heed the rumble of a distant drum'),
> ('junkfile.txt', 'this is the beast, the beast so sly'),
> ('word file.txt', 'is this the way it always is here in Baltimore')
> ]
> for (fname, text) in data:
> words = text.split()
> add_words_from_file(index, fname, words)
>
> for word in ['is', 'the', 'heed', 'this', 'way']:
> print '"%s" is in files: %s' % (word, lookup(index, word))
>
> test()


 
Reply With Quote
 
Dan Stromberg
Guest
Posts: n/a
 
      02-17-2006
On Thu, 16 Feb 2006 10:09:42 +0100, Rene Pijlman wrote:

> Dan Stromberg:
>>is there a python database interface that would allow me to define a
>>-lot- of tables? Like, each word becomes a table, and then the fields
>>in that table are just the filenames that contained that word.

>
> Give ZODB a try.
> http://www.zope.org/Wikis/ZODB/FrontPage
> http://www.python.org/workshops/2000...ton/zodb3.html


Aside from object persistence, what would this add for me?

Is each object saved in a separate file, or are they bunched together into
one or more files?

> My first attempt would be: a BTree with the word as key, and a 'list of
> filenames' as value.
> http://www.zope.org/Wikis/ZODB/Front...00000000000000


This is basically what I'm doing now, except the most recently used "lists
of filenames" are kept in RAM to avoid going from "on disk string" to "in
memory list" to "on disk string" quite as much.

 
Reply With Quote
 
Dan Stromberg
Guest
Posts: n/a
 
      02-17-2006
On Wed, 15 Feb 2006 23:37:31 -0800, Jonathan Gardner wrote:

> I'm no expert in BDBs, but I have spent a fair amount of time working
> with PostgreSQL and Oracle. It sounds like you need to put some
> optimization into your algorithm and data representation.
>
> I would do pretty much like you are doing, except I would only have the
> following relations:
>
> - word to word ID
> - filename to filename ID
> - word ID to filename ID


I think I could ditch the "word ID to word" table, but I think I'll
probably eventually need the "filename ID to filename" table, so when I
pull a list of filename ID's by word I can get back the filenames.

> You're going to want an index on pretty much every column in this
> database. That's because you're going to lookup by any one of these
> columns for the corresponding value.


I do need some pretty heavy indexing.

In fact, so far everything seems to be clicking along pretty well, except
the part that I'm having trouble indexing - the per-word list of
filenames. And that's because I already have one relation (word
number -> filenames' numbers) that's complicating getting another relation
for the list of filenames (just filename ID to '', but it'd be soooo much
faster if I could nest the relations somehow).

> I said I wasn't an expert in BDBs. But I do have some experience
> building up large databases. In the first stage, you just accumulate the
> data. Then you build the indexes only as you need them. Let's say you
> are scanning your files. You won't need an index on the filename-to-ID
> table. That's because you are just putting data in there. The word-to-ID
> table needs an index on the word, but not ID (you're not looking up by
> ID yet.) And the word ID-to-filename ID table doesn't need any indexes
> yet either. So build up the data without the indexes. Once your scan is
> complete, then build up the indexes you'll need for regular operation.
> You can probably incrementally add data as you go.


Interesting thought! I think I need to ponder this a bit more.

I think the filename <-> filename number tables are pretty inexpensive,
and the word <-> word number tables seem pretty inexpensive too. I think
it's primarily the word number to filename numbers table that's messing
up the performance really bad, and that's probably because the list of
filename numbers is growing so large in some cases, and adding a filename
to a word tends to be O(n), n==the number of filenames already on the word.

May I ask: In what representation would you keep this word number to
filename numbers relation stashed away until later, to build your indexes
from more rapidly in a later pass? Are you thinking to keep them all in
memory, or store them in a flat file somehow?

> As far as filename ID and word IDs go, just use a counter to generate
> the next number. If you use base255 as the number, you're really not
> going to save much space.


That's basically what I'm doing - incrementing a counter for each word,
except I'm converting that counter to base255. The base255 representation
of that counter brings the following benefits:

1) Converting a set of numbers to a string, without losing number
boundaries, becomes just mapping them to base255, and
string.joinfields'ing them.

2) Converting a string to a set of numbers, again without losing number
boundaries, is just string.splitfields'ing them, and mapping them from
base255 back to a list or set of numbers.

3) The numbers can be arbitrarily large, unlike with a, say, 4 or 8 byte
fixed-length record representation

> And your idea of hundreds of thousands of tables? Very bad. Don't do it.


Sigh. I was afraid of that. :-S

I'm thinking though... what if I were to use the current representation
for words that don't appear in that many files, and a table for each word
that has >= 1000 (or whatever threshold) filenames associated with it...?

That should greatly cut down on the number of tables, while still giving
the more efficient representation for the words that need it.

Or not?


 
Reply With Quote
 
Jonathan Gardner
Guest
Posts: n/a
 
      02-17-2006
About the filename ID - word ID table: Any good database (good with
large amounts of data) will handle the memory management for you. If
you get enough data, it may make sense to get bothered with PostgreSQL.
That has a pretty good record on handling very large sets of data, and
intermediate sets as well. Again, I can't speak about BDBs, but
something in the back of my head is saying that the entire table is
loaded into memory. If so, that's not good for large sets of data.

About base255: You can think of strings as numbers. At least, that's
how your computer sees them. Converting a number to base255 is really
silly. If it's a series of bytes (like a Python string is) then base255
is a string. You want to use an incremented number for the ID because
there are some indexes that deal with this kind of data very, very
well. If you have your number spread out like a shotgun, with clusters
here and there, the index might not be very good.

About using many tables: The best answer you are going to get is a
bunch of files---one for each word---with the names or IDs of the
files. You can store these in a single directory, or (as you'll find
out to be more efficient) a tree of directories. But this kind of data
is useless for reverse lookups. If you really do have so much data,
start using a real database that is designed to handle this kind of
data efficiently. The overhead of SQL parsing and network latency soon
gets dwarfed by lookup times anyway.

 
Reply With Quote
 
Jonathan Gardner
Guest
Posts: n/a
 
      02-17-2006
About indexes everywhere: Yes, you don't have to be a DB expert to know
that indexes everywhere is bad. But look at this example. There are
really two ways that the data is going to get accessed in regular use.
Either they are going to ask for all files that have a word (most
likely) or they are going to see which words are in a file.

I'm going to have to name the tables now, aren't I? Here's a simple
schema:

words
--------
word_id
word

files
------
file_id
filename

word_files
--------------
file_id
word_id

If you are going to lookup by word, you'll need an index on words.word.
You'll also need an index on word_files.word_id. And then you'll need
an index on files.file_id.

If you are going to lookup by file, you'll need an index on
files.filename, word_files.file_id, and words.word_id.

So it ends up in this situation you need indexes everywhere.

Now, when you are doing the initial population, you should drop all the
indexes you don't need during population. That means everything but
words.word has to go. (You'll need to find the word_id for previously
seen words.) After the initial population, then is the time to build
and add the indexes. it's much faster to build an index when you have
the entire set of data in front of you than to do it piece-by-piece.
Some indexes actually get built better than they would've piecemeal.

Unfortunately this is no longer strictly topical to Python. But if you
free your mind from thinking in terms of SQL databases and look at
indexes as dicts or whatnot, then you can see that this is really a
general programming problem.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Working with databases (ODBC and ORMs) in Python 3.2 tkpmep@hotmail.com Python 1 11-10-2011 08:35 PM
Free chapter about Python and databases (MySQL and SQLite) Sebastian Bassi Python 5 05-29-2010 07:12 PM
[podcast] Expert panel discussion of XQuery, native XML databases, SQL/XML databases Ken North XML 0 07-14-2005 05:50 AM
Re: Any pure-python relational databases? Thomas Weholt Python 1 07-13-2003 02:06 AM
Re: Python and Databases Bradley D. Larson Python 0 06-27-2003 09:57 PM



Advertisments