在给定稀疏矩阵数据的情况下,Python 中计算余弦相似度的最快方法是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17627219/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:42:45  来源:igfitidea点击:

What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

pythonnumpypandassimilaritycosine-similarity

提问by zbinsd

Given a sparse matrix listing, what's the best way to calculate the cosine similarity between each of the columns (or rows) in the matrix? I would rather not iterate n-choose-two times.

给定一个稀疏矩阵列表,计算矩阵中每一列(或行)之间的余弦相似度的最佳方法是什么?我宁愿不迭代 n-choose-2 次。

Say the input matrix is:

假设输入矩阵是:

A= 
[0 1 0 0 1
 0 0 1 1 1
 1 1 0 1 0]

The sparse representation is:

稀疏表示是:

A = 
0, 1
0, 4
1, 2
1, 3
1, 4
2, 0
2, 1
2, 3

In Python, it's straightforward to work with the matrix-input format:

在 Python 中,使用矩阵输入格式很简单:

import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine

A = np.array(
[[0, 1, 0, 0, 1],
[0, 0, 1, 1, 1],
[1, 1, 0, 1, 0]])

dist_out = 1-pairwise_distances(A, metric="cosine")
dist_out

Gives:

给出:

array([[ 1.        ,  0.40824829,  0.40824829],
       [ 0.40824829,  1.        ,  0.33333333],
       [ 0.40824829,  0.33333333,  1.        ]])

That's fine for a full-matrix input, but I really want to start with the sparse representation (due to the size and sparsity of my matrix). Any ideas about how this could best be accomplished? Thanks in advance.

这对于全矩阵输入很好,但我真的想从稀疏表示开始(由于矩阵的大小和稀疏性)。关于如何最好地实现这一点的任何想法?提前致谢。

采纳答案by Jeff

You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output:

您可以直接使用 sklearn 计算稀疏矩阵行上的成对余弦相似度。从 0.17 版本开始,它还支持稀疏输出:

from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse

A =  np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]])
A_sparse = sparse.csr_matrix(A)

similarities = cosine_similarity(A_sparse)
print('pairwise dense output:\n {}\n'.format(similarities))

#also can output sparse matrices
similarities_sparse = cosine_similarity(A_sparse,dense_output=False)
print('pairwise sparse output:\n {}\n'.format(similarities_sparse))

Results:

结果:

pairwise dense output:
[[ 1.          0.40824829  0.40824829]
[ 0.40824829  1.          0.33333333]
[ 0.40824829  0.33333333  1.        ]]

pairwise sparse output:
(0, 1)  0.408248290464
(0, 2)  0.408248290464
(0, 0)  1.0
(1, 0)  0.408248290464
(1, 2)  0.333333333333
(1, 1)  1.0
(2, 1)  0.333333333333
(2, 0)  0.408248290464
(2, 2)  1.0

If you want column-wise cosine similarities simply transpose your input matrix beforehand:

如果您想要逐列余弦相似度,只需事先转置您的输入矩阵:

A_sparse.transpose()

回答by cheeyos

You should check out scipy.sparse. You can apply operations on those sparse matrices just like how you use a normal matrix.

您应该查看scipy.sparse。您可以像使用普通矩阵一样对这些稀疏矩阵应用运算。

回答by Waylon Flinn

The following method is about 30 times faster than scipy.spatial.distance.pdist. It works pretty quickly on large matrices (assuming you have enough RAM)

以下方法比 快约 30 倍scipy.spatial.distance.pdist。它在大型矩阵上运行得非常快(假设您有足够的 RAM)

See below for a discussion of how to optimize for sparsity.

有关如何优化稀疏性的讨论,请参见下文。

# base similarity matrix (all dot products)
# replace this with A.dot(A.T).toarray() for sparse representation
similarity = numpy.dot(A, A.T)


# squared magnitude of preference vectors (number of occurrences)
square_mag = numpy.diag(similarity)

# inverse squared magnitude
inv_square_mag = 1 / square_mag

# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[numpy.isinf(inv_square_mag)] = 0

# inverse of the magnitude
inv_mag = numpy.sqrt(inv_square_mag)

# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = similarity * inv_mag
cosine = cosine.T * inv_mag

If your problem is typical for large scale binary preference problems, you have a lot more entries in one dimension than the other. Also, the short dimension is the one whose entries you want to calculate similarities between. Let's call this dimension the 'item' dimension.

如果您的问题是大规模二元偏好问题的典型问题,那么您在一个维度中的条目比另一维度多得多。此外,短维度是您要计算其条目之间相似性的维度。我们将此维度称为“项目”维度。

If this is the case, list your 'items' in rows and create Ausing scipy.sparse. Then replace the first line as indicated.

如果是这种情况,请在行中列出您的“项目”并A使用scipy.sparse. 然后按照指示替换第一行。

If your problem is atypical you'll need more modifications. Those should be pretty straightforward replacements of basic numpyoperations with their scipy.sparseequivalents.

如果您的问题不典型,则需要进行更多修改。这些应该是基本numpy操作的相当简单的替换scipy.sparse

回答by Vaali

Hi you can do it this way

嗨,你可以这样做

    temp = sp.coo_matrix((data, (row, col)), shape=(3, 59))
    temp1 = temp.tocsr()

    #Cosine similarity
    row_sums = ((temp1.multiply(temp1)).sum(axis=1))
    rows_sums_sqrt = np.array(np.sqrt(row_sums))[:,0]
    row_indices, col_indices = temp1.nonzero()
    temp1.data /= rows_sums_sqrt[row_indices]
    temp2 = temp1.transpose()
    temp3 = temp1*temp2

回答by zbinsd

I took all these answers and wrote a script to 1. validate each of the results (see assertion below) and 2. see which is the fastest. Code and results are below:

我接受了所有这些答案并编写了一个脚本来 1. 验证每个结果(见下面的断言)和 2. 看看哪个是最快的。代码和结果如下:

# Imports
import numpy as np
import scipy.sparse as sp
from scipy.spatial.distance import squareform, pdist
from sklearn.metrics.pairwise import linear_kernel
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity

# Create an adjacency matrix
np.random.seed(42)
A = np.random.randint(0, 2, (10000, 100)).astype(float).T

# Make it sparse
rows, cols = np.where(A)
data = np.ones(len(rows))
Asp = sp.csr_matrix((data, (rows, cols)), shape = (rows.max()+1, cols.max()+1))

print "Input data shape:", Asp.shape

# Define a function to calculate the cosine similarities a few different ways
def calc_sim(A, method=1):
    if method == 1:
        return 1 - squareform(pdist(A, metric='cosine'))
    if method == 2:
        Anorm = A / np.linalg.norm(A, axis=-1)[:, np.newaxis]
        return np.dot(Anorm, Anorm.T)
    if method == 3:
        Anorm = A / np.linalg.norm(A, axis=-1)[:, np.newaxis]
        return linear_kernel(Anorm)
    if method == 4:
        similarity = np.dot(A, A.T)

        # squared magnitude of preference vectors (number of occurrences)
        square_mag = np.diag(similarity)

        # inverse squared magnitude
        inv_square_mag = 1 / square_mag

        # if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
        inv_square_mag[np.isinf(inv_square_mag)] = 0

        # inverse of the magnitude
        inv_mag = np.sqrt(inv_square_mag)

        # cosine similarity (elementwise multiply by inverse magnitudes)
        cosine = similarity * inv_mag
        return cosine.T * inv_mag
    if method == 5:
        '''
        Just a version of method 4 that takes in sparse arrays
        '''
        similarity = A*A.T
        square_mag = np.array(A.sum(axis=1))
        # inverse squared magnitude
        inv_square_mag = 1 / square_mag

        # if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
        inv_square_mag[np.isinf(inv_square_mag)] = 0

        # inverse of the magnitude
        inv_mag = np.sqrt(inv_square_mag).T

        # cosine similarity (elementwise multiply by inverse magnitudes)
        cosine = np.array(similarity.multiply(inv_mag))
        return cosine * inv_mag.T
    if method == 6:
        return cosine_similarity(A)

# Assert that all results are consistent with the first model ("truth")
for m in range(1, 7):
    if m in [5]: # The sparse case
        np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(Asp, method=m))
    else:
        np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(A, method=m))

# Time them:
print "Method 1"
%timeit calc_sim(A, method=1)
print "Method 2"
%timeit calc_sim(A, method=2)
print "Method 3"
%timeit calc_sim(A, method=3)
print "Method 4"
%timeit calc_sim(A, method=4)
print "Method 5"
%timeit calc_sim(Asp, method=5)
print "Method 6"
%timeit calc_sim(A, method=6)

Results:

结果:

Input data shape: (100, 10000)
Method 1
10 loops, best of 3: 71.3 ms per loop
Method 2
100 loops, best of 3: 8.2 ms per loop
Method 3
100 loops, best of 3: 8.6 ms per loop
Method 4
100 loops, best of 3: 2.54 ms per loop
Method 5
10 loops, best of 3: 73.7 ms per loop
Method 6
10 loops, best of 3: 77.3 ms per loop

回答by tianshi miao

I have tried some methods above. However, the experiment by @zbinsd has its limitation. The sparsity of matrix used in the experiment is extremely low while the real sparsity is usually over 90%. In my condition, the sparse is with the shape of (7000, 25000) and the sparsity of 97%. The method 4 is extremely slow and I can't tolerant getting the results. I use the method 6 which is finished in 10 s. Amazingly, I try the method below and it's finished in only 0.247 s.

我已经尝试了上面的一些方法。但是,@zbinsd 的实验有其局限性。实验中使用的矩阵稀疏度极低,而实际稀疏度通常在90%以上。在我的情况下,稀疏的形状为 (7000, 25000),稀疏度为 97%。方法 4 非常慢,我不能容忍得到结果。我使用方法6,它在10秒内完成。令人惊讶的是,我尝试了下面的方法,它仅在 0.247 秒内完成。

import sklearn.preprocessing as pp

def cosine_similarities(mat):
    col_normed_mat = pp.normalize(mat.tocsc(), axis=0)
    return col_normed_mat.T * col_normed_mat

This efficient method is linked by enter link description here

这种有效的方法通过在此处输入链接描述进行链接

回答by David

Building off of Vaali's solution:

基于 Vaali 的解决方案:

def sparse_cosine_similarity(sparse_matrix):
    out = (sparse_matrix.copy() if type(sparse_matrix) is csr_matrix else
           sparse_matrix.tocsr())
    squared = out.multiply(out)
    sqrt_sum_squared_rows = np.array(np.sqrt(squared.sum(axis=1)))[:, 0]
    row_indices, col_indices = out.nonzero()
    out.data /= sqrt_sum_squared_rows[row_indices]
    return out.dot(out.T)

This takes a sparse matrix (preferably a csr_matrix) and returns a csr_matrix. It should do the more intensive parts using sparse calculations with pretty minimal memory overhead. I haven't tested it extensively though, so caveat emptor(Update: I feel confident in this solution now that I've tested and benchmarked it)

这需要一个稀疏矩阵(最好是 csr_matrix)并返回一个 csr_matrix。它应该使用稀疏计算以非常小的内存开销来完成更密集的部分。不过,我还没有对其进行广泛的测试,所以请注意空客(更新:我对这个解决方案充满信心,因为我已经对其进行了测试和基准测试)

Also, here is the sparse version of Waylon's solution in case it helps anyone, not sure which solution is actually better.

此外,这里是 Waylon 解决方案的稀疏版本,以防它对任何人有帮助,但不确定哪种解决方案实际上更好。

def sparse_cosine_similarity_b(sparse_matrix):
    input_csr_matrix = sparse_matrix.tocsr()
    similarity = input_csr_matrix * input_csr_matrix.T
    square_mag = similarity.diagonal()
    inv_square_mag = 1 / square_mag
    inv_square_mag[np.isinf(inv_square_mag)] = 0
    inv_mag = np.sqrt(inv_square_mag)
    return similarity.multiply(inv_mag).T.multiply(inv_mag)

Both solutions seem to have parity with sklearn.metrics.pairwise.cosine_similarity

两种解决方案似乎都与 sklearn.metrics.pairwise.cosine_similarity 相同

:-D

:-D

Update:

更新:

Now I have tested both solutions against my existing Cython implementation: https://github.com/davidmashburn/sparse_dot/blob/master/test/benchmarks_v3_output_table.txtand it looks like the first algorithm performs the best of the three most of the time.

现在我已经针对我现有的 Cython 实现测试了这两种解决方案:https: //github.com/davidmashburn/sparse_dot/blob/master/test/benchmarks_v3_output_table.txt看起来第一个算法在大多数情况下表现最好的三个.

回答by Daniel

def norm(vector):
    return sqrt(sum(x * x for x in vector))    

def cosine_similarity(vec_a, vec_b):
        norm_a = norm(vec_a)
        norm_b = norm(vec_b)
        dot = sum(a * b for a, b in zip(vec_a, vec_b))
        return dot / (norm_a * norm_b)

This method seems to be somewhat faster than using sklearn's implementation if you pass in one pair of vectors at a time.

如果一次传入一对向量,这种方法似乎比使用 sklearn 的实现要快一些。

回答by moshik

I suggest to run in two steps:

我建议分两步运行:

1) generate mapping A that maps A:column index->non zero objects

1)生成映射A的映射A:列索引->非零对象

2) for each object i (row) with non-zero occurrences(columns) {k1,..kn} calculate cosine similarity just for elements in the union set A[k1] U A[k2] U.. A[kn]

2)对于每个对象 i(行)具有非零出现(列){k1,..kn} 计算余弦相似度,仅针对联合集 A[k1] UA[k2] U.. A[kn] 中的元素

Assuming a big sparse matrix with high sparsity this will gain a significant boost over brute force

假设一个具有高稀疏度的大稀疏矩阵,这将比蛮力获得显着提升