Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   getting a submatrix of all true (http://www.velocityreviews.com/forums/t319169-getting-a-submatrix-of-all-true.html)

John Hunter 07-02-2003 07:16 PM

getting a submatrix of all true
 

I have a largish data set (1000 observations x 100 floating point
variables), and some of the of the data are missing. I want to try a
variety of clustering, neural network, etc, algorithms on the data,
and to keep life simple I want to reduce the dimensions of the matrix
so that I have no missing values, since not all the algorithms are
able to handle them and there is sufficient redundancy in the
variables that I can afford to lose some.

I am currently using a hack that works, but it makes me wonder if
there is an optimal solution. I define optimal as the removal of rows
and columns such that there are no missing values and
max(numRows*numCols).

My current approach is to drop rows (observations) that have more than
some prespecified number of missing variables, and then drop the
columns (variables) of the reduced data set that have any missing
values. I chose the threshold for dropping a row by eyeballing the
distribution of number of missing variables per observation, pick a
number on the low end of the distribution, and dropping the rows that
exceed the threshold.

Another way of formulating the question: for a sparse boolean matrix
(sparse on True), what is the optimal way to remove rows and columns
so that the total number of elements in the matrix is maximal and
there are no True values left.


Example:

0 0 0
0 0 0 candidate sub matrix has 12 elements
0 0 0
0 0 0

1 0 0 0 1
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 candidate submatrix has 15 elements
0 0 0 0 0 0 0 0 0 0
0 0 1 0 0

0 0
0 0 candidate submatrix has 8 elements
0 0
0 0

I want to programatically extract the 15 element matrix

Following the approach described above, I get the desired answer in
the example below, though this is a hack solution and I have the
feeling there is a better one.

from Numeric import nonzero, array, take, sum

X = array([[1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]])

goodObsInd = nonzero(sum(X,1)<2) # observations with < 2 missing variables
X = take(X, goodObsInd) # drop the bad

goodVarInd = nonzero(sum(X)==0) # variables with no missing data
X = take(X, goodVarInd, 1 ) # drop the bad variables

print X


John Hunter


Bengt Richter 07-03-2003 04:25 AM

Re: getting a submatrix of all true
 
On Wed, 02 Jul 2003 14:16:57 -0500, John Hunter <jdhunter@ace.bsd.uchicago.edu> wrote:

>
>I have a largish data set (1000 observations x 100 floating point
>variables), and some of the of the data are missing. I want to try a
>variety of clustering, neural network, etc, algorithms on the data,
>and to keep life simple I want to reduce the dimensions of the matrix
>so that I have no missing values, since not all the algorithms are
>able to handle them and there is sufficient redundancy in the
>variables that I can afford to lose some.
>
>I am currently using a hack that works, but it makes me wonder if
>there is an optimal solution. I define optimal as the removal of rows
>and columns such that there are no missing values and
>max(numRows*numCols).
>
>My current approach is to drop rows (observations) that have more than
>some prespecified number of missing variables, and then drop the
>columns (variables) of the reduced data set that have any missing
>values. I chose the threshold for dropping a row by eyeballing the
>distribution of number of missing variables per observation, pick a
>number on the low end of the distribution, and dropping the rows that
>exceed the threshold.
>
>Another way of formulating the question: for a sparse boolean matrix
>(sparse on True), what is the optimal way to remove rows and columns
>so that the total number of elements in the matrix is maximal and
>there are no True values left.
>
>
>Example:
>
> 0 0 0
> 0 0 0 candidate sub matrix has 12 elements
> 0 0 0
> 0 0 0
>
>1 0 0 0 1
>0 0 0 0 0 0 0 0 0 0
>0 0 0 0 0 0 0 0 0 0 candidate submatrix has 15 elements
>0 0 0 0 0 0 0 0 0 0
>0 0 1 0 0
>
> 0 0
> 0 0 candidate submatrix has 8 elements
> 0 0
> 0 0
>
>I want to programatically extract the 15 element matrix


If I understand your original optimality definition, that would be suboptimal.
I.e., how about the 16-element matrix? (x's mark the corresponding zeroes)

1 0 0 0 1
x x 0 x x
x x 0 x x
x x 0 x x
x x 1 x x

Or do they have to be adjacent?

>
>Following the approach described above, I get the desired answer in
>the example below, though this is a hack solution and I have the
>feeling there is a better one.
>
> from Numeric import nonzero, array, take, sum
>
> X = array([[1, 0, 0, 0, 1],
> [0, 0, 0, 0, 0],
> [0, 0, 0, 0, 0],
> [0, 0, 0, 0, 0],
> [0, 0, 1, 0, 0]])
>
> goodObsInd = nonzero(sum(X,1)<2) # observations with < 2 missing variables
> X = take(X, goodObsInd) # drop the bad
>
> goodVarInd = nonzero(sum(X)==0) # variables with no missing data
> X = take(X, goodVarInd, 1 ) # drop the bad variables
>
> print X
>


Brute force seems to work on this little example. Maybe it can
be memoized and optimized and/or whatever to handle your larger matrix fast enough?

====< submatrix.py >================================
X = [
[1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]
]

def optimrcs(mx, rows=[], cols=[]):
if not rows or not cols: return 0,[],[]
maxscore = 0
for r in rows:
if not mx[r].count(1): continue
for c in cols:
if not mx[r][c]: continue
# eval column submatrix vs row submatrix
colsxthis = cols[:]; colsxthis.remove(c)
colscore, rset, cset = optimrcs(mx, rows, colsxthis)
if colscore>maxscore:
maxscore, maxrows, maxcols = colscore, rset, cset
rowsxthis = rows[:]; rowsxthis.remove(r)
rowscore, rset, cset = optimrcs(mx, rowsxthis, cols)
if rowscore >= maxscore:
maxscore, maxrows, maxcols = rowscore, rset, cset
if not maxscore:
return len(rows)*len(cols), rows, cols
return maxscore, maxrows,maxcols

if __name__ == '__main__':
score, rowsel, colsel = optimrcs(X, range(5), range(5))
print 'Score = %s, rows = %s, cols = %s' % (score,rowsel,colsel)
print
for r in range(5):
row = X[r]
for c in range(5):
if r in rowsel and c in colsel:
print '<%s>'% row[c],
else:
print ' %s '% row[c],
print
================================================== ==
Running it results thus:

[21:24] C:\pywk\clp>submatrix.py
Score = 16, rows = [1, 2, 3, 4], cols = [0, 1, 3, 4]

1 0 0 0 1
<0> <0> 0 <0> <0>
<0> <0> 0 <0> <0>
<0> <0> 0 <0> <0>
<0> <0> 1 <0> <0>

Regards,
Bengt Richter

Terry Reedy 07-03-2003 06:16 AM

Re: getting a submatrix of all true
 

"John Hunter" <jdhunter@ace.bsd.uchicago.edu> wrote in message
news:mailman.1057173484.29754.python-list@python.org...
>
> I have a largish data set (1000 observations x 100 floating point
> variables), and some of the of the data are missing.


All too typical -- missing data are the bane of statistics.

> I want to try a
> variety of clustering, neural network, etc, algorithms on the data,
> and to keep life simple I want to reduce the dimensions of the

matrix
> so that I have no missing values, since not all the algorithms are
> able to handle them and there is sufficient redundancy in the
> variables that I can afford to lose some.


Statisticians have tried a variety of approaches. Googling for '
statistics "missing data" 'will give you some leads if you want.

Terry J. Reedy



John Hunter 07-03-2003 04:41 PM

Re: getting a submatrix of all true
 
>>>>> "Bengt" == Bengt Richter <bokr@oz.net> writes:


Bengt> If I understand your original optimality definition, that
Bengt> would be suboptimal. I.e., how about the 16-element
Bengt> matrix? (x's mark the corresponding zeroes)

Bengt> Or do they have to be adjacent?

No, in general they won't be. I missed that one. Just goes to show
that my "solution" is not one, which I knew. But it does efficiently
deliver good solutions, where good means large but not optimal.

Bengt> Brute force seems to work on this little example. Maybe it
Bengt> can be memoized and optimized and/or whatever to handle
Bengt> your larger matrix fast enough?

Thanks for the example. Unfortunately, it is too slow even for
moderate size matrices (30,10). I've been running it for over two
hours for a 30x10 matrix and it hasn't finished. And my data are
1000x100!

Here is a little code to generate larger matrices for testing....

from Numeric import zeros, array, Int, put, reshape
from RandomArray import uniform

numRows, numCols = 30,10
numMissing = int(0.05*numRows*numCols) # 5% missing

X = zeros( (300,))
ind = uniform(0, numRows*numCols, (numMissing,)).astype(Int)

put(X,ind,1)
X = reshape(X, (numRows,numCols))

# turn it into a list for your code....
X = [ [val for val in row] for row in X]
for row in X: print row

Last night, I began to formulate the problem as a logic statement,
hoping this would give me an idea of how to proceed. But no progress
yet. But I have come to the conclusion that with P ones, brute force
requires 2^P combinations. With 1000x100 with 5% missing that gives
me 2^5000. Not good.

Thanks for your suggestion, though. If you have any more thoughts,
let me know.

John Hunter





John Hunter 07-03-2003 04:43 PM

Re: getting a submatrix of all true
 
>>>>> "Terry" == Terry Reedy <tjreedy@udel.edu> writes:


Terry> Statisticians have tried a variety of approaches. Googling
Terry> for ' statistics "missing data" 'will give you some leads
Terry> if you want.

I have done some searching. I'm familiar with the common methods
(delete every row that contains any missing, replace missing via mean
or regression or something clever) but haven't seen any discussion of
dropping variables and observations together to yield data sets with
no missing values. Have you seen something like this?

John Hunter


Terry Reedy 07-03-2003 09:14 PM

Re: getting a submatrix of all true
 

"John Hunter" <jdhunter@ace.bsd.uchicago.edu> wrote in message
news:mailman.1057250709.8853.python-list@python.org...
> >>>>> "Terry" == Terry Reedy <tjreedy@udel.edu> writes:

>
>
> Terry> Statisticians have tried a variety of approaches.

Googling
> Terry> for ' statistics "missing data" 'will give you some leads
> Terry> if you want.
>
> I have done some searching. I'm familiar with the common methods
> (delete every row that contains any missing, replace missing via

mean
> or regression or something clever) but haven't seen any discussion

of
> dropping variables and observations together to yield data sets with
> no missing values. Have you seen something like this?


There are also calculation methods for at least some analyses that
allow for missing data . One of the google hits is for the book
Statistical Analysis with Missing Data. I have not seen it, but it is
not unique.

As I hinted, there are no really nice solutions to missing data. I
have done both row and column deletion. Sometimes I have done
multiple analyses with different deletion strategies: once with enough
vars deleted so all or most cases are complete, and again with enough
cases deleted so that all or most var are complete.

I would start by counting (with a program) the number of missing
values for each row and then construction the frequency distribution
thereof. Then the same for the columns, with the addition of a
correlation table or tables.

One thing one can do with vars is to combine some to make a composite
measure. For instance, if three variables more-or-less measure the
same thing, one can combine (perhaps by the mean of those present) to
make one variable that is present if any of the three are, so it is
only missing for cases (rows) that are missing all three. This type
of work requires that you look at the variables and consider their
meaning, rather than just inputing them into a blind proceedure that
consisders all vars to be the same.

Terry J. Reedy



Jon 07-04-2003 06:16 PM

Re: getting a submatrix of all true
 
bokr@oz.net (Bengt Richter) wrote in message news:<be0b84$89a$0@216.39.172.122>...
> On Wed, 02 Jul 2003 14:16:57 -0500, John Hunter <jdhunter@ace.bsd.uchicago.edu> wrote:
> >I am currently using a hack that works, but it makes me wonder if
> >there is an optimal solution. I define optimal as the removal of rows
> >and columns such that there are no missing values and
> >max(numRows*numCols).


There must be! For each missing entry you can choose to drop either
the row or the column - giving a maximum of 2^N possible ways for N
missing elements. Given that the order in which the rows/columns are
deleted is unimportant the number of possibilities to test is less
than that, but the difficulty is that your decision about row or
column for one element affects the decision for other elements if they
share rows or columns. Graph theoretic approaches probably help - it's
similar to reordering matrix elements to avoid fill in when
decomposing sparse matrices. Depending on N, you might be able to test
all possibilities in a reasonable time, but I have a nasty feeling you
need to check all 2^M to be sure of an optimal solution, where M is
the subset of N which either a share a row or column with another
problematic element.

> >Another way of formulating the question: for a sparse boolean matrix
> >(sparse on True), what is the optimal way to remove rows and columns
> >so that the total number of elements in the matrix is maximal and
> >there are no True values left.


Sadly I can only give you a method which is certainly not optimal, but
at least finds something which is probably not too bad, and fairly
quickly. Uses the Numeric module to speed up, so that a few thousand
by a few hundred is quickly computable. I'm curious to know if it does
much better or worse than other algorithms.

Interesting problem!

Cheers,

Jon

================================================== ===================
from Numeric import *

def pickrcs(a,debug=0):
b=array(a,copy=1) # local copy of array is modified
dr=zeros(b.shape[0]) # Arrays to note deleted rows...
dc=zeros(b.shape[1]) # ... and columns
sizeleft=[b.shape[0],b.shape[1]] # Remaining dimensions
while 1: # Keep deleting till none left
sr=sum(b,1) # Column scores = ij's to remove
sc=sum(b,0) # Row scores
mr=argmax(sr) # index highest row score
mc=argmax(sc) # index highest column score
if sr[mr]==0 and sc[mc]==0 :
break # stop when nothing to delete
# pick the row/column to delete fewest useful elements
if sizeleft[1]-sr[mr] > sizeleft[0]-sc[mc]:
b[:,mc]=0 # Zero out column
dc[mc]=1 # tags column as deleted
sizeleft[1]-=1
else:
b[mr,:]=0 # Zero out row
dr[mr]=1
sizeleft[0]-=1
# end of deletions loop - should be no missing elements now
#
# paranoia!, check if anything was left after deletions
if sum(dr)<b.shape[0] and sum(dc)<b.shape[1]:
dr=compress(dr==0,range(b.shape[0])) # gives return format of
dc=compress(dc==0,range(b.shape[1])) # optimrcs posted by Bengt
score=dr.shape[0]*dc.shape[0] # Richter
return score,dr,dc
else:
print "Sorry, didn't manage to let anything survive"
return 0,0,0

# test code

X = [
[1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]
]

if __name__ == '__main__':
a=array(X)
score, rowsel, colsel = pickrcs(a)
print "score=",score
for r in range(5):
row = X[r]
for c in range(5):
if r in rowsel and c in colsel:
print '<%s>'% row[c],
else:
print ' %s '% row[c],
print
# now try for a larger problem - use a random pattern
import RandomArray,time
start=time.time()
x0=1000 # problem size
x1=600
a=RandomArray.random((x0,x1))
a=where(a>0.999,1,0) # ~ 0.1% of elements are 1
print "Using a large random matrix"
print "Number of non-zeros=",sum(ravel(a)),"in",x0,"x",x1,"matrix"
score, rowsel, colsel = pickrcs(a)
print 'Score = %s' % (score)
print 'Number of rows deleted=',a.shape[0]-rowsel.shape[0]
print 'Number of columns deleted=',a.shape[1]-colsel.shape[0]
print time.time()-start," seconds"



==sample output (of course second part is random!)===============

score= 16
1 0 0 0 1
<0> <0> 0 <0> <0>
<0> <0> 0 <0> <0>
<0> <0> 0 <0> <0>
<0> <0> 1 <0> <0>
Using a large random matrix
Number of non-zeros= 607 in 1000 x 600 matrix
Score = 331776
Number of rows deleted= 424
Number of columns deleted= 24
3.38999998569 seconds

Anton Vredegoor 07-04-2003 10:24 PM

Re: getting a submatrix of all true
 
John Hunter <jdhunter@ace.bsd.uchicago.edu> wrote:

>Another way of formulating the question: for a sparse boolean matrix
>(sparse on True), what is the optimal way to remove rows and columns
>so that the total number of elements in the matrix is maximal and
>there are no True values left.


After having read Bengts code (and scraping the parts of my exploded
head from the floor) and after later getting the hint about the size
of the problem being 2**N, it finally dawned on me that the problem
would be related to getting all possible combinations of a range.

The following code has the advantage that it generates solutions by
index, for example "M[i]" returns the score and the row and column
information for index i. (note that is not the same as finding the
optimal solution, but it still could be handy to be able to generate a
specific one by index)

However for small matrices it is possible to do exhaustive searching
with "max(M)".

Maybe if this would be combined with some heuristic algorithm it would
be better (genetic algorithm?)

Code below (needs Python 2.3, very lightly tested), I hope this helps,

Anton

class Combinations:

def __init__(self,n,k):
self.n,self.k,self.count = n,k,self.noverk(n,k)

def __getitem__(self,index):
#combination number index
if index > self.count - 1: raise IndexError
res,x,rest = [],self.n,index
for i in range(self.k,0,-1):
while self.noverk(x,i) > rest : x -= 1
rest -= self.noverk(x,i)
res.append(x)
return res

def noverk(self,n,k):
return reduce(lambda a,b: a*(n-b)/(b+1),range(k),1)

class AllCombinations:

def __init__(self,n):
self.n, self.count = n, 2**n

def __getitem__(self,index):
#combination number index for all k in combinations(n,k)
if index > self.count - 1: raise IndexError
n,rest = self.n,index
for k in range(n+1):
c = Combinations(n,k)
if rest - c.count < 0:
return c[rest]
rest -= c.count

class ScoreMatrix:

def __init__(self, X):
self.AC = AllCombinations(len(X))
self.X = X

def __getitem__(self,i):
#score selected rows and columns by index
rows = self.AC[i][::-1]
if rows:
Y = [self.X[i] for i in rows]
Z = [(i,z) for i,z in enumerate(zip(*Y)) if 1 not in z]
cols = [i for i,z in Z]
score = sum(map(len,[z for i,z in Z]))
return score,rows,cols
else: return 0,[],[]

def test():
X = [ [1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 1, 0, 0], ]

M = ScoreMatrix(X)
sc,rows,cols = max(M)
print sc
for i,r in enumerate(X):
for j,c in enumerate(r):
if i in rows and j in cols: print c,
else: print 'x',
print

if __name__=='__main__':
test()

output:

16
x x x x x
0 0 x 0 0
x x x x x
0 0 x 0 0
0 0 x 0 0
0 0 x 0 0

Anton Vredegoor 07-05-2003 10:49 AM

Re: getting a submatrix of all true
 
John Hunter <jdhunter@ace.bsd.uchicago.edu> wrote:

>>>>>> "Bengt" == Bengt Richter <bokr@oz.net> writes:

> Bengt> Brute force seems to work on this little example. Maybe it
> Bengt> can be memoized and optimized and/or whatever to handle
> Bengt> your larger matrix fast enough?
>
>Thanks for the example. Unfortunately, it is too slow even for
>moderate size matrices (30,10). I've been running it for over two
>hours for a 30x10 matrix and it hasn't finished. And my data are
>1000x100!


I posted a somewhat long version before (inspired by Bengts idea) but
I remembered an easier way to generate all combinations, and also I
noticed that there is freedom in choosing to analyze rows or columns,
which can be computationally advantageous. Below is a version that
depends mostly on 2**N where N is the smallest of those two values.
Unfortunately 2**100 is still way to big a number, but my code can now
routinely solve 100x15 matrices ...

>Last night, I began to formulate the problem as a logic statement,
>hoping this would give me an idea of how to proceed. But no progress
>yet. But I have come to the conclusion that with P ones, brute force
>requires 2^P combinations. With 1000x100 with 5% missing that gives
>me 2^5000. Not good.


IMO it depends (except for the amount time spent in copying data of
course) on the smallest of the number of rows or columns, so that's
2**100 in this case. Maybe for some matrices, your number is
preferable?

>Thanks for your suggestion, though. If you have any more thoughts,
>let me know.


I hope you don't mind me following this too :-) Nice problem, and I'm
not so sure about my math as I sound above, if anyone can prove me
right or wrong I'd be happy anyway ...

Anton

---

class ScoreMatrix:

def __init__(self, X):
n1,n2 = len(X),len(X[0])
if n2<n1:
self.X = zip(*X)
self.rotated = True
n = n2
else:
self.X = X
n = n1
self.rotated = False
self.R = range(n)
self.count = 2**n

def __getitem__(self,i):
#score selected rows and columns by index
if not (-1<i<self.count): raise IndexError
rows =[x for j,x in enumerate(self.R) if 1<<j &i]
if rows:
Y = [self.X[i] for i in rows]
Z = [(i,z) for i,z in enumerate(zip(*Y)) if 1 not in z]
cols = [i for i,z in Z]
score = sum(map(len,[z for i,z in Z]))
if self.rotated: return score,cols,rows
else: return score,rows,cols
else: return 0,[],[]

def test():
from random import random,randint
f = .05
r,c = 100,15
X=[]
for i in range(r):
X.append([])
for j in range(c):
X[i].append(int(random()<f))
M = ScoreMatrix(X)
highscore = 0,[],[]
for i,sc in enumerate(M):
if sc > highscore:
mi,highscore = i,sc
sc,rows,cols = highscore
print "maximum score: %s" %(sc)
print "index of this score: %s" %(mi)
print "tabledata:"
for i,r in enumerate(X):
for j,c in enumerate(r):
if i in rows and j in cols: print c,
else: print 'x',
print

if __name__=='__main__':
test()




Scott David Daniels 07-06-2003 06:10 AM

Re: getting a submatrix of all true
 
Folllowing Bengt and anton@vredegoor.doge.nl (Anton Vredegoor)'s leads,
the following code can be fast (at times). It is quite sensitive to the
probability of non-zeroness, (.01 works well, the .o5 is nowhere near so
nice).

I get 2.25 min to do 25 x 25 at .05
2.75 min to do 30 x 30 at .05
It gets slow _very_ fast, but gets good numbers if the probability is low
.25 min to do 45 x 45 at .01
1.5 min to do 50 x 50 at .01


-Scott David Daniels
Scott.Daniels@Acm.Org (not fully back up yet, though).


Here's the code:
################################
# myopt.py
__version__ = '0.3'

try:
assert list(enumerate('ab')) == [(0, 'a'), (1, 'b')]
except NameError:
def enumerate(iterable):
# Not exact, but close enough
lst = list(iterable)
return zip(range(len(lst)), lst)


def bitcount(row):
'''Return the number of on bits in the integer argsA'''
assert row >= 0
result = 0
while row:
result += 1
lsbit = row & -row
row ^= lsbit
return result


def rowencode(vector):
'''convert from a buncha numbers to a bit-representation'''
bit = 1L
result = 0
for element in vector:
if element:
result |= bit
bit <<= 1
return result


def rowdecode(value, mask=-1):
'''convert from a bit-rep to a buncha numbers (at least as long as mask)'''
if mask < 0: mask = value
assert value >= 0
result = []
while mask or value:
result.append(value & 1)
mask >>= 1
value >>= 1
return result


class Answer(object):
'''An answer represents a result calculation.'''
__slots__ = 'rows', 'colmask', '_score'
totalrequests = 0
totalcalcs = 0

def __init__(self, colmask, rows):
'''The columns in colmask are the ones we keep.'''
self.colmask = colmask
self.rows = rows
self._score = None

def score(self):
'''Calculate the score lazily'''
self.__class__.totalrequests += 1
if self._score is None:
self.__class__.totalcalcs += 1
self._score = bitcount(self.colmask) * len(self.rows)
return self._score

def __repr__(self):
return '%s(%d:%s, %x):%s' % (self.__class__.__name__,
len(self.rows), self.rows,
self.colmask, self._score)


totalcalls = 0

def choose(rows, keepcols, N=0, keeprows=None):
'''Choose rows and columns to keep. Return an Answer for the choice'''
global totalcalls
totalcalls += 1
if keeprows is None:
keeprows = []
try:
while 0 == rows[N] & keepcols:
keeprows.append(N)
N += 1
except IndexError:
return Answer(keepcols, keeprows)

# a difference: some kept columns in this row are non-0
# must drop either those columns or this row

# Calculate result if we keep this row (drop problem columns)
withrow = choose(rows, keepcols & ~rows[N], N+1, keeprows + [N])

# Calculate result if we drop this row
skiprow = choose(rows, keepcols, N+1, keeprows)

# Choose the better answer (keep the row if everything is equal).
if (withrow.colmask == skiprow.colmask
or withrow.score() >= skiprow.score()):
return withrow
else:
return skiprow


# The data used from the example
X = [ [1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 1, 0, 0] ]


def printrep(row, symbols, mask=0):
'''A row representing a single row (column-masked by mask)'''
assert mask >= 0
result = []
for element in row:
result.append(symbols[(1 & mask) * 2 + (element != 0)])
mask >>= 1
assert mask == 0 # mask doesn't extend beyond data.
return ''.join(result)


def printanswer(data, rows, keepcols):
'''Print the represented row'''
toohuge = len(data)
rowqueue = rows + [toohuge]
rowqueue.reverse()
nextrow = rowqueue.pop()
for rownumber, row in enumerate(data):
if nextrow > rownumber:
# This row was zapped
print '#', printrep(row, '01OI', keepcols)
else:
assert rownumber == nextrow # This row was kept
nextrow = rowqueue.pop()
print '_', printrep(row, '01~@', keepcols)
assert nextrow == toohuge and not rowqueue


def getanswer(data):
'''Calculate the best-cut for a particular matrix'''
columns = max([len(row) for row in data])
rowdata = [rowencode(row) for row in data]
return choose(rowdata, (1L << columns) - 1)


def main(data=X):
global totalcalls

totalcalls = 0
answer = getanswer(data)
print 'Requested: %s, Calculated: %s,' % (
Answer.totalrequests, Answer.totalcalcs),

print 'answer: %r,' % answer,
print 'Score: %s' % answer.score()
print 'Total choose calls required: %s' % totalcalls

printanswer(data, answer.rows, answer.colmask)



def rangen(rows, columns, probability=0.05):
'''create a rows by columns data table with 1s at the given probability'''
import random
result = []
for row in range(rows):
result.append([int(random.random() < probability)
for column in range(columns)])
return result


if __name__ == '__main__':
import sys
assert getanswer([[0]]).score() == 1
assert getanswer([[0,1], [1,0]]).score() == 1
assert getanswer([[0,1,0], [1,0,0]]).score() == 2
if len(sys.argv) < 2:
main(X)
else:
args = sys.argv[1:]
if '.' in args[-1]:
assert len(args) > 1
probability = float(args.pop())
else:
probability = .2

rows = int(args[0])
if len(args) == 1:
cols = rows
else:
assert len(args) == 2
cols = int(args[1])
main(rangen(rows, cols, probability))




All times are GMT. The time now is 09:26 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.