Velocity Reviews > getting a submatrix of all true

# getting a submatrix of all true

Scott David Daniels
Guest
Posts: n/a

 07-06-2003
In article <3f07bd72\$(E-Mail Removed)>,
http://www.velocityreviews.com/forums/(E-Mail Removed) (Scott David Daniels) writes:
> Folllowing Bengt and (E-Mail Removed) (Anton Vredegoor)'s leads,
> the following code can be fast (at times). It is quite sensitive to the
> probability of non-zeroness, (.01 works well, the .o5 is nowhere near so
> nice).
>
> I get 2.25 min to do 25 x 25 at .05
> 2.75 min to do 30 x 30 at .05
> It gets slow _very_ fast, but gets good numbers if the probability is low
> .25 min to do 45 x 45 at .01
> 1.5 min to do 50 x 50 at .01

OK, a bit faster: (bitcount faster, allow choose to sometimes quit early).
I get 0.25 min to do 25 x 25 at .05
5.5 min to do 30 x 30 at .05 (?! maybe above was wrong)
and:
1 sec to do 45 x 45 at .01
9 sec to do 50 x 50 at .01

-Scott David Daniels
(E-Mail Removed)

###################################

# myopt.py

__version__ = '0.4'
try:
assert list(enumerate('ab')) == [(0, 'a'), (1, 'b')]
except NameError:
def enumerate(iterable):
lst = list(iterable)
return zip(range(len(lst)), lst)

def bitcount(row):
'''Return the number of on bits in the integer'''
assert row >= 0
result = 0
while row:
result += 1
lsbit = row & -row
row ^= lsbit
return result

bytebits = [bitcount(n) for n in range(256)] # precalculate byte counts

def bitcount(row): # replace previous bitcount
'''Return the number of on bits in the integer (byte at a time)'''
assert row >= 0
result = bytebits[row & 255]
while row >= 256:
row >>= 8
result += bytebits[row & 255]
return result

def rowencode(vector):
'''convert from a buncha numbers to a bit-representation'''
bit = 1L
result = 0
for element in vector:
if element:
result |= bit
bit <<= 1
return result

'''An answer represents a result calculation.'''
totalrequests = 0
totalcalcs = 0

'''The columns in colmask are the ones we keep.'''
self.rows = rows
self._score = None

def score(self):
'''Calculate the score lazily'''
self.__class__.totalrequests += 1
if self._score is None:
self.__class__.totalcalcs += 1
return self._score

def __repr__(self):
return '%s(%d:%s, %x):%s' % (self.__class__.__name__,
len(self.rows), self.rows,

totalcalls = 0

def choose(rows, keepcols, N=0, keeprows=None, best=None):
'''Choose rows and columns to keep. Return an Answer for the choice'''
global totalcalls
totalcalls += 1
if keeprows is None:
keeprows = []
try:
while 0 == rows[N] & keepcols:
keeprows.append(N)
N += 1
except IndexError:

# a difference: some kept columns in this row are non-0
# must drop either those columns or this row

# Calculate result if we keep this row (drop problem columns)
withrow = best # Already have a calculated with this mask. use it.
else:
withrow = choose(rows, keepcols & ~rows[N], N+1, keeprows + [N], best)

# Calculate result if we drop this row
skiprow = choose(rows, keepcols, N+1, keeprows, best or withrow)

# Choose the better answer (keep the row if everything is equal).
or withrow.score() >= skiprow.score()):
return withrow
else:
return skiprow

# The data used from the example
X = [ [1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 1, 0, 0] ]

result = []
for element in row:
result.append(symbols[(1 & mask) * 2 + (element != 0)])
return ''.join(result)

'''Print the represented row'''
toohuge = len(data)
rowqueue = rows + [toohuge]
rowqueue.reverse()
nextrow = rowqueue.pop()
for rownumber, row in enumerate(data):
if nextrow > rownumber:
# This row was zapped
print '#', printrep(row, '01OI', keepcols)
else:
assert rownumber == nextrow # This row was kept
nextrow = rowqueue.pop()
print '_', printrep(row, '01~@', keepcols)
assert nextrow == toohuge and not rowqueue

'''Calculate the best-cut for a particular matrix'''
columns = max([len(row) for row in data])
rowdata = [rowencode(row) for row in data]
return choose(rowdata, (1L << columns) - 1)

def main(data=X):
global totalcalls

totalcalls = 0
print 'Requested: %s, Calculated: %s,' % (

print 'Total choose calls required: %s' % totalcalls

def rangen(rows, columns, probability=0.05):
'''create a rows by columns data table with 1s at the given probability'''
import random
result = []
for row in range(rows):
result.append([int(random.random() < probability)
for column in range(columns)])
return result

if __name__ == '__main__':
import sys
if len(sys.argv) < 2:
main(X)
else:
args = sys.argv[1:]
if '.' in args[-1]:
assert len(args) > 1
probability = float(args.pop())
else:
probability = .2

rows = int(args[0])
if len(args) == 1:
cols = rows
else:
assert len(args) == 2
cols = int(args[1])
main(rangen(rows, cols, probability))

John Hunter
Guest
Posts: n/a

 07-07-2003
>>>>> "Jon" == Jon <(E-Mail Removed)> writes:

Jon> Sadly I can only give you a method which is certainly not
Jon> optimal, but at least finds something which is probably not
Jon> too bad, and fairly quickly. Uses the Numeric module to speed
Jon> up, so that a few thousand by a few hundred is quickly
Jon> computable. I'm curious to know if it does much better or
Jon> worse than other algorithms.

Hi Jon, thanks for the suggestion. For my data set, your solution
performed a little better size-wise than my poor man's approach, which
I gave in the original post. And it is certainly fast enough. I
liked your idea of zero-ing out the elements when you made a decision
to delete a row or column. When I was thinking about solutions using
Numeric, I kept running up against the problem of deleting a row or
column (which entails a whole array copy so is expensive).

If anyone wants to try their algorithm on "real world data" which,
unlike the ones discussed here, have lots of correlation between
missing rows and columns that might be exploited, I have uploaded the
missing data matrix to a web server. It can be loaded with:

from urllib import urlopen
url = 'http://nitace.bsd.uchicago.edu:8080/summer/jdh/missing.dat'
L = [ [ int(val) for val in line.split()]
# L is 1072x83

I think I may adapt your algorithm to terminate when the percent
missing falls below some threshold, removing the requirement that
*all* missing values be removed. This will drop the worst offenders
observation or variable wise. Then I can use a regression approach to
fill in the remaining ones.

Jon> Interesting problem!

I agree. There are lots of extensions to the original formulation
that are relevant to the missing value problem. As we have seen in
the posts here, there are good exact solutions for smallish matrices,
but for larger ones a statistical approach is required. The right
statistical solution surely depends on the correlation structure of
the matrix.

Also, it would be nice to generalize to higher dimensions (eg 3D
matrices). For my problem, I have N observations of M variables over
D days. So this is the 3D version of the 2D problem above. Your
want to impose extra requirements, like, no days can be dropped, or at
most 10% of the days can be dropped, or if you want to compute changes
in variables over days, that no days can be dropped.

The more general formulation is something like

Given an N-D (N0, M0, P0, ...) boolean matrix X with a covariance
matrix C, what is the largest submatrix of X of dimensions (N1, M1,
P1, ...) such that the fraction of remaining dimensions satisfies

(N1/N0, M1/M0, P1/P0, ...) >= (alpha, beta, gamma, ..)

and the total percent true <= p.

The original problem is a special case of this where alpha=0, beta=0,
p=0 and C unspecified.

Anyone looking for a dissertation project ?

John Hunter

John Hunter
Guest
Posts: n/a

 07-07-2003
>>>>> "John" == John Hunter <(E-Mail Removed)> writes:

John> I think I may adapt your algorithm to terminate when the
John> percent missing falls below some threshold, removing the
John> requirement that *all* missing values be removed. This will
John> drop the worst offenders observation or variable wise. Then
John> I can use a regression approach to fill in the remaining
John> ones.

I think I have found a major improvement on your algorithm Jon, by
changing the rule which determines whether to drop a row or a column
after you have found the worst offenders for each. The original
approach was to keep the one with the most valid (zero) elements,
which makes sense since that is the quantity you are trying to
maximize. But consider the case where there is correlation so that
true values on a row or column tend to be correlated with other true
values on that row or column, as in

X = 1 0
1 0
1 0
1 0
1 0
0 0
0 0

Your algorithm would keep the column, since keeping a row gives you
only one zero whereas keeping a column gives you 2. By looking at the
*fraction* of ones in the worst offending row and column, you can do
much better. By deleting the row or column with the greatest fraction
of ones, you reduce the likelihood that you'll encounter a one in
future iterations. Using this approach, I was able to find a 893x67
submatrix with a score of 59831 versus the 462x81 submatrix with a
score of 37422.

I have also added thresholds to the code, so that you can terminate
with a fraction of missing values remaining. By setting the
thresholds to zero, you get the special case of no missing values.

The example below also includes a no-copy optimization that Jon
emailed me off list. The algorithm run time for a 1073x82 matrix is
around 30ms on my 700MHz machine.

John Hunter

from __future__ import division
import sys, time
from urllib import urlopen
from Numeric import *

def trim_missing(a, thresh=(0.05,0.05) ):
"""
a is a numObservations by numVariables boolean matrix

thresh = (othresh, vthresh) are the observation and variable
thresholds that trigger deletion

remove any rows where percent missing vars > vthresh
remove any cols where percent missing obs > othresh

This is done iteratively rechecking the percent missing on each
iteration or row or col removal, with the largest offender removed
on each iteration

return value is score, rowInd, colInd where the two ind arrays
are the indices of the rows and cols that were kept

"""
othresh, vthresh = thresh # theshold for deletion
dr=zeros(a.shape[0]) # Arrays to note deleted rows...
dc=zeros(a.shape[1]) # ... and columns
sizeleft=list(a.shape) # Remaining dimensions
sr=sum(a,1) # Column scores = ij's to remove
sc=sum(a,0) # Row scores
while 1: # Keep deleting till none left
mr=argmax(sr) # index highest row score
mc=argmax(sc) # index highest column score
fracRow = sr[mr]/sizeleft[1]
fracCol = sc[mc]/sizeleft[0]

if fracRow<=vthresh and fracCol<=othresh:
break # stop when missing criteria are satisfied
if fracRow-vthresh<fracCol-othresh:
sr=sr-a[:,mc] # rows reduced by contents of column
sc[mc]=0 # sum of column now zero
dc[mc]=1 # tags column as deleted
sizeleft[1]-=1
else:
# deleting the row
sc=sc-a[mr,:] # columns reduced by contents of row
sr[mr]=0 # sum of row is now zero
dr[mr]=1 # tags row as deleted
sizeleft[0]-=1

score=(a.shape[0]-sum(dr))*(a.shape[1]-sum(dc))
dr=compress(dr==0,range(a.shape[0])) # gives return format of
dc=compress(dc==0,range(a.shape[1])) # optimrcs posted by Bengt

return score, dr, dc

url = 'http://nitace.bsd.uchicago.edu:8080/summer/jdh/missing.dat'
L = [ [ int(val) for val in line.split()]
a= array(L)
start = time.time()
score, goodRows, goodCols = trim_missing(a, thresh=(0.0, 0.0))
end = time.time()
print end-start
print 'Original size', a.shape
print 'Number of non-zeros in', sum(ravel(a))
print 'Reduced size', len(goodRows), len(goodCols)
print 'Score', score

Bengt Richter
Guest
Posts: n/a

 07-08-2003
On Mon, 07 Jul 2003 11:00:11 -0500, John Hunter <(E-Mail Removed)> wrote:

>>>>>> "Jon" == Jon <(E-Mail Removed)> writes:

>
> Jon> Sadly I can only give you a method which is certainly not
> Jon> optimal, but at least finds something which is probably not
> Jon> too bad, and fairly quickly. Uses the Numeric module to speed
> Jon> up, so that a few thousand by a few hundred is quickly
> Jon> computable. I'm curious to know if it does much better or
> Jon> worse than other algorithms.
>
>Hi Jon, thanks for the suggestion. For my data set, your solution
>performed a little better size-wise than my poor man's approach, which
>I gave in the original post. And it is certainly fast enough. I
>liked your idea of zero-ing out the elements when you made a decision
>to delete a row or column. When I was thinking about solutions using
>Numeric, I kept running up against the problem of deleting a row or
>column (which entails a whole array copy so is expensive).
>
>If anyone wants to try their algorithm on "real world data" which,
>unlike the ones discussed here, have lots of correlation between
>missing rows and columns that might be exploited, I have uploaded the
>missing data matrix to a web server. It can be loaded with:
>
> from urllib import urlopen
> url = 'http://nitace.bsd.uchicago.edu:8080/summer/jdh/missing.dat'
> L = [ [ int(val) for val in line.split()]
> # L is 1072x83
>

The statistics on this data is interesting, referring to counts of 1's:
(not very tested)

[15:31] C:\pywk\clp\submx>python showmx.py -f missing.dat -counts
nr=1072, nc=83
totcc=6239, totrc=6239 => *100./(83*1072) => 7.01
count:freq -- cols with count ones occurred freq times
0:16 1:4 8:3 14:2 16:4 20:4 21:3 23:1
25:6 27:1 28:7 32:4 41:1 59:1 60:3 62:3
65:1 95:2 96:1 108:3 125:1 134:1 158:1 177:1
199:1 206:2 237:1 241:1 252:1 254:1 1061:2
count:freq -- rows with count ones occurred freq times
0:3 1:2 2:460 3:67 4:140 5:66 6:42 7:49
8:61 9:17 10:46 11:23 12:8 13:8 14:12 15:10
16:3 17:5 18:1 19:7 20:4 21:2 22:5 23:1
27:2 28:2 30:2 31:1 33:1 34:1 36:2 38:1
40:1 41:1 44:4 45:7 48:1 55:2 56:2

And sorted per descending frequency of number of cols and rows with particular counts:

[15:31] C:\pywk\clp\submx>python showmx.py -f missing.dat -counts -freq
nr=1072, nc=83
totcc=6239, totrc=6239 => *100./(83*1072) => 7.01
freq:count -- cols with count ones occurred freq times
16:0 7:28 6:25 4:32 4:20 4:16 4:1 3:108
3:62 3:60 3:21 3:8 2:1061 2:206 2:95 2:14
1:254 1:252 1:241 1:237 1:199 1:177 1:158 1:134
1:125 1:96 1:65 1:59 1:41 1:27 1:23
freq:count -- rows with count ones occurred freq times
460:2 140:4 67:3 66:5 61:8 49:7 46:10 42:6
23:11 17:9 12:14 10:15 8:13 8:12 7:45 7:19
5:22 5:17 4:44 4:20 3:16 3:0 2:56 2:55
2:36 2:30 2:28 2:27 2:21 2:1 1:48 1:41
1:40 1:38 1:34 1:33 1:31 1:23 1:18

There are two columns with 1061 ones out of a possible 1072, so they account for 2
in most of the row counts. Hence the most popular row count (2) mostly goes to 0.
Interestingly, 4(2) is 'way more popular than 3(1). This doesn't look like what
I'd expect from uniform overall probability of some percentage. Discounting the
stuck columns, no ones shouldn't be more likely than one one, since there's only
one way to get all zeroes, but there's then 81 equally probable ways to get one one.

Look at a uniform 5% on 1072 x 83 (twice to show variation).
There's a distribution with tails, centered on 5% of 1072 and 83 it seems:

[17:48] C:\pywk\clp\submx>showmx.py 1072 83 5 -counts
nr=1072, nc=83
totcc=4338, totrc=4338 => *100./(83*1072) => 4.88
count:freq -- cols with count ones occurred freq times
35:1 39:1 40:2 41:1 42:5 43:1 45:5 46:3
47:2 48:6 49:3 50:5 51:5 52:5 53:5 54:5
55:5 56:5 58:3 59:4 61:1 62:2 63:1 64:1
65:3 70:1 78:2
count:freq -- rows with count ones occurred freq times
0:21 1:74 2:134 3:219 4:218 5:176 6:113 7:64
8:29 9:15 10:6 11:2 12:1

[17:48] C:\pywk\clp\submx>showmx.py 1072 83 5 -counts
nr=1072, nc=83
totcc=4394, totrc=4394 => *100./(83*1072) => 4.94
count:freq -- cols with count ones occurred freq times
37:1 40:1 41:3 43:1 44:2 45:2 46:5 47:5
48:1 49:3 50:6 51:8 52:7 53:7 54:3 55:3
56:4 57:3 58:3 59:1 60:1 61:1 62:1 63:1
64:3 65:2 68:2 69:1 70:1 72:1
count:freq -- rows with count ones occurred freq times
0:19 1:58 2:150 3:218 4:218 5:173 6:111 7:63
8:38 9:13 10:5 11:5 12:1

>I think I may adapt your algorithm to terminate when the percent
>missing falls below some threshold, removing the requirement that
>*all* missing values be removed. This will drop the worst offenders
>observation or variable wise. Then I can use a regression approach to
>fill in the remaining ones.
>
> Jon> Interesting problem!

Yes. So many, so little time ... ;-/

>
>I agree. There are lots of extensions to the original formulation
>that are relevant to the missing value problem. As we have seen in
>the posts here, there are good exact solutions for smallish matrices,
>but for larger ones a statistical approach is required. The right
>statistical solution surely depends on the correlation structure of
>the matrix.

I'm pretty sure there are better exact algorithms than the brute force
one I first posted, but I haven't done a new one since, just some little
things to show things about the data, as above.
>
>Also, it would be nice to generalize to higher dimensions (eg 3D
>matrices). For my problem, I have N observations of M variables over
>D days. So this is the 3D version of the 2D problem above. Your
>want to impose extra requirements, like, no days can be dropped, or at
>most 10% of the days can be dropped, or if you want to compute changes
>in variables over days, that no days can be dropped.
>

I'm wondering if you might not have other considerations about which
data points to drop too. E.g., if an observation makes the difference
between being able to invert a matrix or not, you might opt to drop
several other things instead, even though in terms of count the set
would be less optimum. (Curiousness: what do your observations represent?)

>The more general formulation is something like
>
> Given an N-D (N0, M0, P0, ...) boolean matrix X with a covariance
> matrix C, what is the largest submatrix of X of dimensions (N1, M1,
> P1, ...) such that the fraction of remaining dimensions satisfies
>
> (N1/N0, M1/M0, P1/P0, ...) >= (alpha, beta, gamma, ..)
>
> and the total percent true <= p.
>
>The original problem is a special case of this where alpha=0, beta=0,
>p=0 and C unspecified.
>
>Anyone looking for a dissertation project ?
>
>John Hunter
>

Regards,
Bengt Richter

John Hunter
Guest
Posts: n/a

 07-08-2003
>>>>> "Bengt" == Bengt Richter <(E-Mail Removed)> writes:

Bengt> Curiousness: what do your observations represent?

By day, I do research in neurology and neurobiology. By night, stock
market analysis: academia doesn't pay! I have written several modules
to parse the Yahoo Finance pages, eg,
http://biz.yahoo.com/z/a/x/xoma.html and
http://biz.yahoo.com/p/x/xoma.html. The 2 fields which you identified
as missing on all but 2 observations are Revenue Estimates: "Year Ago
Sales" and "Sales Growth" for This Year, both of which are N/A for
almost all tickers.

Jon> Interesting problem!
Bengt> Yes. So many, so little time ... ;-/

Exactly.

JDH

John Hunter
Guest
Posts: n/a

 07-08-2003
>>>>> "John" == John Hunter <(E-Mail Removed)> writes:
John>Using this approach, I was able to find a 893x67 submatrix
John>with a score of 59831 versus the 462x81 submatrix with a
John>score of 37422.

Onwards and upwards.... Since this is beginning to look like a
So I followed the lead of simulated annealing algorithms and added a
little noise to the step where the decision to delete the row or

from MLab import randn
def trim_missing(a, thresh=(0.05,0.05), sigma=0.005 ):
...snip...
if fracRow-vthresh+sigma*randn()<fracCol-othresh:
#delete the column
else:
#delete the row

where sigma is like the temperature parameter of a simulated annealing
algorithm.

Repeatedly stepping over a variety of sigmas in search of a minimum
has yielded minor improvement for the dataset of interest. Best
results:

Deterministic: 59831 893 67
Random: 59928 908 66 for sigma=0.0017

For random arrays the results are more striking. For random matrices
500x500, comparing the scores of the deterministic (sigma=0) versus
random (best sigma in [0,0.01]) searches:

score score
% ones mean determ mean random mean(ratio) std(ratio)
1% 41753.2 42864.6 1.03 0.0093
5% 5525.8 6174.2 1.12 0.06
10% 1968.3 2193.9 1.12 0.05

For a matrix of the same number of total elements, but unbalanced in
the dimensions, eg, 1000x250, the results are more pronounced, in
favor of the random search algorithm

score score
% ones mean determ mean random mean(ratio) std(ratio)
1% 48724.0 51451.8 1.06 0.018
5% 6340.5 7542.3 1.19 0.038
10% 2092.5 2458.6 1.180 0.091

Below is a script that performs a random search of a random matrix for
a variety of sigmas.

JDH

from __future__ import division
import sys, time
from urllib import urlopen
from Numeric import *
from MLab import randn, rand, mean, std

def trim_missing(a, thresh=(0.05,0.05), sigma=0.005 ):
"""
a is a numObservations by numVariables boolean matrix

thresh = (othresh, vthresh) are the observation and variable
thresholds that trigger deletion

remove any rows where percent missing vars > vthresh
remove any cols where percent missing obs > othresh

This is done iteratively rechecking the percent missing on each
iteration or row or col removal, with the largest offender removed
on each iteration

return value is score, rowInd, colInd where the two ind arrays
are the indices of the rows and cols that were kept

"""
othresh, vthresh = thresh # theshold for deletion
dr=zeros(a.shape[0]) # Arrays to note deleted rows...
dc=zeros(a.shape[1]) # ... and columns
sizeleft=list(a.shape) # Remaining dimensions
sr=sum(a,1) # Column scores = ij's to remove
sc=sum(a,0) # Row scores
while 1: # Keep deleting till none left
mr=argmax(sr) # index highest row score
mc=argmax(sc) # index highest column score
fracRow = sr[mr]/sizeleft[1]
fracCol = sc[mc]/sizeleft[0]

if fracRow<=vthresh and fracCol<=othresh:
break # stop when missing criteria are satisfied
if fracRow-vthresh+sigma*randn()<fracCol-othresh:
sr=sr-a[:,mc] # rows reduced by contents of column
sc[mc]=0 # sum of column now zero
dc[mc]=1 # tags column as deleted
sizeleft[1]-=1
else:
# deleting the row
sc=sc-a[mr,:] # columns reduced by contents of row
sr[mr]=0 # sum of row is now zero
dr[mr]=1 # tags row as deleted
sizeleft[0]-=1

score=(a.shape[0]-sum(dr))*(a.shape[1]-sum(dc))
dr=compress(dr==0,range(a.shape[0])) # gives return format of
dc=compress(dc==0,range(a.shape[1])) # optimrcs posted by Bengt

return score, dr, dc

results = []
for trial in range(10):
a=rand(1000,250)
a=where(a>0.90,1,0)

score0, goodRows, goodCols = trim_missing(a, thresh=(0,0), sigma=0)
maxScore = 0, score0, len(goodRows), len(goodCols)
print 'Deterministic:', score0, len(goodRows), len(goodCols)
for sigma in arange(0.0001, 0.01, 0.001):
for i in range(10):
score, goodRows, goodCols = trim_missing(
a, thresh=(0,0), sigma=sigma)
if score>maxScore[1]:
#print '\t', sigma, score, len(goodRows), len(goodCols)
maxScore = sigma, score, len(goodRows), len(goodCols)
print '\t', maxScore

results.append((score0, maxScore[1], maxScore[1]/score0))
results = array(results)
print mean(results)
print std(results)