Thursday, 15 August 2013

Python - How to reduce the number of entries per row or symmetric matrix by keeping K largest values

Python - How to reduce the number of entries per row or symmetric matrix
by keeping K largest values

I have a symmetric similarity matrix and I want to keep only the k largest
value in each row.
Here's some code that does exactly what I want, but I'm wondering if
there's a better way. Particularly the flatten/reshape is a bit clumsy.
Thanks in advance.
Note that nrows (below) will have to scale into the tens of thousands.
from scipy.spatial.distance import pdist, squareform
random.seed(1)
nrows = 4
a = (random.rand(nrows,nrows))
# Generate a symmetric similarity matrix
s = 1-squareform( pdist( a, 'cosine' ) )
print "Start with:\n", s
# Generate the sorted indices
ss = argsort(s.view(np.ndarray), axis=1)[:,::-1]
s2 = ss + (arange(ss.shape[0])*ss.shape[1])[:,None]
# Zero-out after k-largest-value entries in each row
k = 3 # Number of top-values to keep, per row
s = s.flatten()
s[s2[:,k:].flatten()] = 0
print "Desired output:\n", s.reshape(nrows,nrows)
Gives:
Start with:
[[ 1. 0.61103296 0.82177072 0.92487807]
[ 0.61103296 1. 0.94246304 0.7212526 ]
[ 0.82177072 0.94246304 1. 0.87247418]
[ 0.92487807 0.7212526 0.87247418 1. ]]
Desired output:
[[ 1. 0. 0.82177072 0.92487807]
[ 0. 1. 0.94246304 0.7212526 ]
[ 0. 0.94246304 1. 0.87247418]
[ 0.92487807 0. 0.87247418 1. ]]

No comments:

Post a Comment