Python 2个数字列表之间的余弦相似度

Question

提问by Rob Alsod

I need to calculate the cosine similaritybetween two lists, let's say for example list 1 which is dataSetIand list 2 which is dataSetII. I cannot use anything such as numpyor a statistics module. I must use common modules (math, etc) (and the least modules as possible, at that, to reduce time spent).

我需要计算两个列表之间的余弦相似度，例如列表 1和列表 2 。我不能使用诸如numpy或统计模块之类的任何东西。我必须使用通用模块（数学等）（并且尽可能使用最少的模块，以减少花费的时间）。dataSetIdataSetII

Let's say dataSetIis [3, 45, 7, 2]and dataSetIIis [2, 54, 13, 15]. The length of the lists are alwaysequal.

让我们说dataSetIis[3, 45, 7, 2]和dataSetIIis [2, 54, 13, 15]。列表的长度总是相等的。

Of course, the cosine similarity is between 0 and 1, and for the sake of it, it will be rounded to the third or fourth decimal with format(round(cosine, 3)).

当然，余弦相似度在0 和 1之间，为此，会用将其四舍五入到小数点后的第三或第四位format(round(cosine, 3))。

Thank you very much in advance for helping.

非常感谢您的帮助。

Answer 1

回答by pkacprzak

import math
from itertools import izip

def dot_product(v1, v2):
    return sum(map(lambda x: x[0] * x[1], izip(v1, v2)))

def cosine_measure(v1, v2):
    prod = dot_product(v1, v2)
    len1 = math.sqrt(dot_product(v1, v1))
    len2 = math.sqrt(dot_product(v2, v2))
    return prod / (len1 * len2)

You can round it after computing:

您可以在计算后四舍五入：

cosine = format(round(cosine_measure(v1, v2), 3))

If you want it really short, you can use this one-liner:

如果你想要它真的很短，你可以使用这个单行：

from math import sqrt
from itertools import izip

def cosine_measure(v1, v2):
    return (lambda (x, y, z): x / sqrt(y * z))(reduce(lambda x, y: (x[0] + y[0] * y[1], x[1] + y[0]**2, x[2] + y[1]**2), izip(v1, v2), (0, 0, 0)))

Answer 2

回答by charmoniumQ

You should try SciPy. It has a bunch of useful scientific routines for example, "routines for computing integrals numerically, solving differential equations, optimization, and sparse matrices." It uses the superfast optimized NumPy for its number crunching. See herefor installing.

你应该试试SciPy。它有很多有用的科学程序，例如，“数值计算积分、求解微分方程、优化和稀疏矩阵的程序”。它使用超快速优化的 NumPy 进行数字运算。安装请看这里。

Note that spatial.distance.cosine computes the distance, and not the similarity. So, you must subtract the value from 1 to get the similarity.

请注意， spatial.distance.cosine 计算的是distance，而不是相似度。因此，您必须从 1 中减去该值才能获得相似度。

from scipy import spatial

dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
result = 1 - spatial.distance.cosine(dataSetI, dataSetII)

Answer 3

回答by Mike Housky

I don't suppose performance matters much here, but I can't resist. The zip() function completely recopies both vectors (more of a matrix transpose, actually) just to get the data in "Pythonic" order. It would be interesting to time the nuts-and-bolts implementation:

我不认为性能在这里很重要，但我无法抗拒。zip() 函数完全复制了两个向量（更多的是矩阵转置，实际上）只是为了以“Pythonic”顺序获取数据。对具体实施时间进行计时会很有趣：

import math
def cosine_similarity(v1,v2):
    "compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy/math.sqrt(sumxx*sumyy)

v1,v2 = [3, 45, 7, 2], [2, 54, 13, 15]
print(v1, v2, cosine_similarity(v1,v2))

Output: [3, 45, 7, 2] [2, 54, 13, 15] 0.972284251712

That goes through the C-like noise of extracting elements one-at-a-time, but does no bulk array copying and gets everything important done in a single for loop, and uses a single square root.

这经历了一次提取一个元素的类 C 噪声，但没有批量复制数组，并在单个 for 循环中完成所有重要的工作，并使用单个平方根。

ETA: Updated print call to be a function. (The original was Python 2.7, not 3.3. The current runs under Python 2.7 with a from __future__ import print_functionstatement.) The output is the same, either way.

ETA：将打印调用更新为函数。（原始版本是 Python 2.7，而不是 3.3。当前在 Python 2.7 下运行并带有一条from __future__ import print_function语句。）无论哪种方式，输出都是相同的。

CPYthon 2.7.3 on 3.0GHz Core 2 Duo:

CPYthon 2.7.3 在 3.0GHz Core 2 Duo 上：

>>> timeit.timeit("cosine_similarity(v1,v2)",setup="from __main__ import cosine_similarity, v1, v2")
2.4261788514654654
>>> timeit.timeit("cosine_measure(v1,v2)",setup="from __main__ import cosine_measure, v1, v2")
8.794677709375264

So, the unpythonic way is about 3.6 times faster in this case.

因此，在这种情况下，非pythonic 方式大约快 3.6 倍。

Answer 4

回答by Akavall

You can use cosine_similarityfunction form sklearn.metrics.pairwisedocs

您可以使用cosine_similarity函数表单sklearn.metrics.pairwise文档

In [23]: from sklearn.metrics.pairwise import cosine_similarity

In [24]: cosine_similarity([[1, 0, -1]], [[-1,-1, 0]])
Out[24]: array([[-0.5]])

Answer 5

回答by rajeshwerkushwaha

You can do this in Python using simple function:

您可以使用简单的函数在 Python 中执行此操作：

def get_cosine(text1, text2):
  vec1 = text1
  vec2 = text2
  intersection = set(vec1.keys()) & set(vec2.keys())
  numerator = sum([vec1[x] * vec2[x] for x in intersection])
  sum1 = sum([vec1[x]**2 for x in vec1.keys()])
  sum2 = sum([vec2[x]**2 for x in vec2.keys()])
  denominator = math.sqrt(sum1) * math.sqrt(sum2)
  if not denominator:
     return 0.0
  else:
     return round(float(numerator) / denominator, 3)
dataSet1 = [3, 45, 7, 2]
dataSet2 = [2, 54, 13, 15]
get_cosine(dataSet1, dataSet2)

Answer 6

回答by McKelvin

I did a benchmarkbased on several answers in the question and the following snippet is believed to be the best choice:

我根据问题中的几个答案做了一个基准测试，以下片段被认为是最佳选择：

def dot_product2(v1, v2):
    return sum(map(operator.mul, v1, v2))


def vector_cos5(v1, v2):
    prod = dot_product2(v1, v2)
    len1 = math.sqrt(dot_product2(v1, v1))
    len2 = math.sqrt(dot_product2(v2, v2))
    return prod / (len1 * len2)

The result makes me surprised that the implementation based on scipyis not the fastest one. I profiled and find that cosine in scipy takes a lot of time to cast a vector from python list to numpy array.

结果让我惊讶的是，基于的实现scipy并不是最快的。我分析并发现 scipy 中的余弦需要很多时间才能将向量从 python 列表转换为 numpy 数组。

Answer 7

回答by Isira

You can use this simple function to calculate the cosine similarity:

您可以使用这个简单的函数来计算余弦相似度：

def cosine_similarity(a, b):
return sum([i*j for i,j in zip(a, b)])/(math.sqrt(sum([i*i for i in a]))* math.sqrt(sum([i*i for i in b])))

Answer 8

回答by sten

Using numpy compare one list of numbers to multiple lists(matrix):

使用 numpy 将一个数字列表与多个列表（矩阵）进行比较：

def cosine_similarity(vector,matrix):
   return ( np.sum(vector*matrix,axis=1) / ( np.sqrt(np.sum(matrix**2,axis=1)) * np.sqrt(np.sum(vector**2)) ) )[::-1]

Answer 9

回答by dontloo

another version based on numpyonly

另一个版本numpy仅基于

from numpy import dot
from numpy.linalg import norm

cos_sim = dot(a, b)/(norm(a)*norm(b))

Answer 10

回答by Mohammed

without using any imports

不使用任何进口

math.sqrt(x)

数学.sqrt(x)

can be replaced with

可以替换为

x** .5

without using numpy.dot() you have to create your own dot function using list comprehension:

不使用 numpy.dot() 您必须使用列表理解创建自己的点函数：

def dot(A,B): 
    return (sum(a*b for a,b in zip(A,B)))

and then its just a simple matter of applying the cosine similarity formula:

然后它只是应用余弦相似度公式的一个简单问题：

def cosine_similarity(a,b):
    return dot(a,b) / ( (dot(a,a) **.5) * (dot(b,b) ** .5) )

Python 2个数字列表之间的余弦相似度

提问by Rob Alsod

回答by pkacprzak

回答by charmoniumQ

回答by Mike Housky

回答by Akavall

回答by rajeshwerkushwaha

回答by McKelvin

回答by Isira

回答by sten

回答by dontloo

回答by Mohammed

相关推荐

最近更新

标签

Python 2个数字列表之间的余弦相似度

提问by Rob Alsod

回答by pkacprzak

回答by charmoniumQ

回答by Mike Housky

回答by Akavall

回答by rajeshwerkushwaha

回答by McKelvin

回答by Isira

回答by sten

回答by dontloo

回答by Mohammed

相关推荐

Python ValueError: 数据不是二进制的并且未指定 pos_label

Python 逐字遍历字符串

Python Matplotlib 颜色条背景和标签放置

Python UnicodeDecodeError: 'utf8' 编解码器无法解码字节“0xc3”

相关推荐

最近更新

标签