"" <> wrote:
> I am running into a memory fault error when processing the
> contents of a database table into a lookup hash.
>
> The problem is that I store item equivalencies (IE) in a db table IE
> with primary key (a, b). a and b are integers and correspond
> to productId values.
>
> For IE products a and b, I insert as primary key (a, b) and, notably,
> leave out (b, a). This obviously reduces the IE table size by 50%.
Is that important? If your data is so large that you need to resort to
space saving tricks even when storing it in a disk-based database, then
trying to load the whole thing into memory in an uncompressed format
seems to be a losing proposition.
>
> To build a single lookup data structure prior to writing to a lookup
> file,
Why? Database systems are designed to efficiently handle such tasks.
If you need to take the data *out* of a database to make a lookup file,
then what is the point of having it in a database in the first place?
> I try to create hash %ie which for each row has:
>
> $ie{a}{b}=1;
> $ie{b}{a}=1;
>
> Then to get all the IE for a product a, one just looks at the slice
> %ie{a} and gets all the values b1, b2, b3, ... for it.
>
> The IE table has 1.4 million rows for 80,000 products, and sometimes
> (b, a) is written [which is not a problem really]. When I create %ie,
> it must
> be upwards of 2.7 million rows, and hence, I am getting a memory fault
> error.
2.7e6/80000 = almost 34 equivalences per item on average. Is that right?
>
> The problem is, can I use a better (meaning smaller) data structure
> to capture the IE data?
Yes, but perhaps you should try to implement that structure in the
database, rather than Perl's memory. Probably the best way would be to
define a canonical identifier to each equivalence class. For example, just
pick the alphabetically first member of each class to be the canonical id
of that class.
In Perl , you would have two top-level hashes. One would be a simple
single-level hash which would map each item ID to the canonical ID for the
class it belongs to, and would have 80,000 entries (the number of items).
The other would have one entry for each canonical item, and the value would
be a reference to either a hash or an array with all the other items in
that class. For example, if you have A1, A2, A3 and B1, B2, B3, with the
As being equivalent to each other and the Bs being equivalent to each
other, you would have something like:
my %equiv=(
'A3' => 'A1',
'B3' => 'B1',
'B1' => 'B1',
'A1' => 'A1',
'B2' => 'B1',
'A2' => 'A1',
);
my %class=(
'B1' => {
'B3' => undef,
'B1' => undef,
'B2' => undef,
},
'A1' => {
'A3' => undef,
'A1' => undef,
'A2' => undef,
},
);
> I could lookup each product's IE data as needed, but that would be
> 80,000+ single queries to the database. I don't even want to try
> that.
If you are going to read 1.4 million rows anyway, I don't see why issuing
80,000 queries would be out of the question.
>
> Also, my algorithm for matching products as IE pretty much puts
> all combinations of (a, b, c, d) in the table for products a b c d.
If you are motivated to save 50% by only storing one ordering of each pair,
it seems silly to use many times as much space to store each pairing
separately. You are straining at a gnat but swallowing a camel.
Xho
--
--------------------
http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB