Python 给定2个句子字符串计算余弦相似度

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15173225/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 13:31:34  来源:igfitidea点击:

Calculate cosine similarity given 2 sentence strings

pythonstringnlpsimilaritycosine-similarity

提问by alvas

From Python: tf-idf-cosine: to find document similarity, it is possible to calculate document similarity using tf-idf cosine. Without importing external libraries, are that any ways to calculate cosine similarity between 2 strings?

来自Python: tf-idf-cosine: to find document Similarity,可以使用 tf-idf cosine 计算文档相似度。在不导入外部库的情况下,有什么方法可以计算 2 个字符串之间的余弦相似度吗?

s1 = "This is a foo bar sentence ."
s2 = "This sentence is similar to a foo bar sentence ."
s3 = "What is this string ? Totally not related to the other two lines ."

cosine_sim(s1, s2) # Should give high cosine similarity
cosine_sim(s1, s3) # Shouldn't give high cosine similarity value
cosine_sim(s2, s3) # Shouldn't give high cosine similarity value

采纳答案by vpekar

A simple pure-Python implementation would be:

一个简单的纯 Python 实现是:

import math
import re
from collections import Counter

WORD = re.compile(r"\w+")


def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
    sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator


def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)


text1 = "This is a foo bar sentence ."
text2 = "This sentence is similar to a foo bar sentence ."

vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)

cosine = get_cosine(vector1, vector2)

print("Cosine:", cosine)

Prints:

印刷:

Cosine: 0.861640436855

The cosine formula used here is described here.

这里所用的余弦公式描述这里

This does not include weighting of the words by tf-idf, but in order to use tf-idf, you need to have a reasonably large corpus from which to estimate tfidf weights.

这不包括 tf-idf 对单词的权重,但为了使用 tf-idf,您需要有一个相当大的语料库来估计 tfidf 权重。

You can also develop it further, by using a more sophisticated way to extract words from a piece of text, stem or lemmatise it, etc.

您还可以通过使用更复杂的方式从一段文本中提取单词、词干或词形还原等进一步开发它。

回答by mbatchkarov

The short answer is "no, it is not possible to do that in a principled way that works even remotely well". It is an unsolved problem in natural language processing research and also happens to be the subject of my doctoral work. I'll very briefly summarize where we are and point you to a few publications:

简短的回答是“不,不可能以一种甚至远程运行良好的原则性方式做到这一点”。这是自然语言处理研究中一个未解决的问题,也恰好是我博士工作的主题。我将非常简要地总结我们所处的位置,并指出一些出版物:

Meaning of words

词的意思

The most important assumption here is that it is possible to obtain a vector that represents each wordin the sentence in quesion. This vector is usually chosen to capture the contexts the word can appear in. For example, if we only consider the three contexts "eat", "red" and "fluffy", the word "cat" might be represented as [98, 1, 87], because if you were to read a very very long piece of text (a few billion words is not uncommon by today's standard), the word "cat" would appear very often in the context of "fluffy" and "eat", but not that often in the context of "red". In the same way, "dog" might be represented as [87,2,34] and "umbrella" might be [1,13,0]. Imagening these vectors as points in 3D space, "cat" is clearly closer to "dog" than it is to "umbrella", therefore "cat" also means something more similar to "dog" than to an "umbrella".

这里最重要的假设是有可能得到一个表示每个的向量在疑问句中。通常选择这个向量来捕捉单词可能出现的上下文。 例如,如果我们只考虑“吃”、“红色”和“蓬松”三个上下文,则单词“猫”可能表示为 [98, 1 , 87],因为如果你要阅读一段很长的文字(按照今天的标准,几十亿字并不罕见),“猫”这个词会经常出现在“蓬松”和“吃”的上下文中,但在“红色”的上下文中并不常见。同样,“狗”可能表示为 [87,2,34],而“伞”可能表示为 [1,13,0]。将这些向量成像为 3D 空间中的点,“猫”显然更接近“狗”而不是“伞”,因此“猫”

This line of work has been investigated since the early 90s (e.g. thiswork by Greffenstette) and has yielded some surprisingly good results. For example, here is a few random entries in a thesaurus I built recently by having my computer read wikipedia:

这方面的工作从 90 年代初就开始研究(例如Greffenstette 的这项工作),并产生了一些令人惊讶的好结果。例如,这是我最近通过让我的计算机阅读维基百科构建的同义词库中的一些随机条目:

theory -> analysis, concept, approach, idea, method
voice -> vocal, tone, sound, melody, singing
james -> william, john, thomas, robert, george, charles

These lists of similar words were obtained entirely without human intervention- you feed text in and come back a few hours later.

这些相似词的列表完全是在没有人工干预的情况下获得的——你输入文本并在几个小时后回来。

The problem with phrases

词组问题

You might ask why we are not doing the same thing for longer phrases, such as "ginger foxes love fruit". It's because we do not have enough text. In order for us to reliablyestablish what X is similar to, we need to see many examples of X being used in context. When X is a single word like "voice", this is not too hard. However, as X gets longer, the chances of finding natural occurrences of X get exponentially slower. For comparison, Google has about 1B pages containing the word "fox" and not a single page containing "ginger foxes love fruit", despite the fact that it is a perfectly valid English sentence and we all understand what it means.

您可能会问为什么我们不对较长的短语做同样的事情,例如“姜狐狸喜欢水果”。那是因为我们没有足够的文字。为了让我们可靠地确定 X 与什么相似,我们需要看到许多 X 在上下文中使用的例子。当 X 是像“voice”这样的单个词时,这并不太难。但是,随着 X 变长,找到 X 自然出现的机会会呈指数级变慢。相比之下,Google 有大约 1B 个页面包含“fox”这个词,而没有一个页面包含“ginger foxes lovefruit”,尽管它是一个完全有效的英语句子,我们都明白它的意思。

Composition

作品

To tackle the problem of data sparsity, we want to perform composition, i.e. to take vectors for words, which are easy to obtain from real text, and to put the together in a way that captures their meaning. The bad news is nobody has been able to do that well so far.

为了解决数据稀疏性的问题,我们想要进行组合,即为单词取向量,这些向量很容易从真实文本中获得,并以一种能够捕捉它们含义的方式组合在一起。坏消息是,到目前为止,没有人能够做到这一点。

The simplest and most obvious way is to add or multiply the individual word vectors together. This leads to undesirable side effect that "cats chase dogs" and "dogs chase cats" would mean the same to your system. Also, if you are multiplying, you have to be extra careful or every sentences will end up represented by [0,0,0,...,0], which defeats the point.

最简单和最明显的方法是将单个词向量相加或相乘。这会导致不良的副作用,即“猫追狗”和“狗追猫”对您的系统来说意义相同。另外,如果你是乘法,你必须格外小心,否则每个句子最终都会用 [0,0,0,...,0] 表示,这就失去了意义。

Further reading

进一步阅读

I will not discuss the more sophisticated methods for composition that have been proposed so far. I suggest you read Katrin Erk's "Vector space models of word meaning and phrase meaning: a survey". This is a very good high-level survey to get you started. Unfortunately, is not freely available on the publisher's website, email the author directly to get a copy. In that paper you will find references to many more concrete methods. The more comprehensible ones are by Mitchel and Lapata (2008)and Baroni and Zamparelli (2010).

我不会讨论目前已经提出的更复杂的合成方法。我建议您阅读 Katrin Erk 的“词义和短语含义的向量空间模型:调查”。这是一个非常好的高级调查,可以帮助您入门。不幸的是,在出版商的网站上不能免费获得,请直接给作者发电子邮件以获取副本。在那篇论文中,您将找到对许多更具体方法的引用。比较容易理解的是Mitchel 和 Lapata (2008)以及Baroni 和 Zamparelli (2010)



Edit after comment by @vpekar: The bottom line of this answer is to stress the fact that while naive methods do exist(e.g. addition, multiplication, surface similarity, etc), these are fundamentally flawedand in general one should not expect great performance from them.

在@vpekar 发表评论后进行编辑:这个答案的底线是强调这样一个事实,虽然朴素的方法确实存在(例如加法、乘法、表面相似性等),但这些方法存在根本性的缺陷,一般来说不应期望从他们。

回答by novice_dev

Thanks @vpekar for your implementation. It helped a lot. I just found that it misses the tf-idf weight while calculating the cosine similarity. The Counter(word) returns a dictionary which has the list of words along with their occurence.

感谢@vpekar 的实施。它有很大帮助。我刚刚发现它在计算余弦相似度时错过了 tf-idf 权重。Counter(word) 返回一个字典,其中包含单词及其出现的列表。

cos(q, d) = sim(q, d) = (q · d)/(|q||d|) = (sum(qi, di)/(sqrt(sum(qi2)))*(sqrt(sum(vi2))) where i = 1 to v)

cos(q, d) = sim(q, d) = (q · d)/(|q||d|) = (sum(qi, di)/(sqrt(sum(qi2)))*(sqrt( sum(vi2))) 其中 i = 1 到 v)

  • qi is the tf-idf weight of term i in the query.
  • di is the tf-idf
  • weight of term i in the document. |q| and |d| are the lengths of q and d.
  • This is the cosine similarity of q and d . . . . . . or, equivalently, the cosine of the angle between q and d.
  • qi 是查询中术语 i 的 tf-idf 权重。
  • di 是 tf-idf
  • 文档中术语 i 的权重。|q| 和|d| 是 q 和 d 的长度。
  • 这是 q 和 d 的余弦相似度。. . . . . 或者,等效地,q 和 d 之间夹角的余弦。

Please feel free to view my code here. But first you will have to download the anaconda package. It will automatically set you python path in Windows. Add this python interpreter in Eclipse.

请随时在这里查看我的代码。但首先你必须下载 anaconda 包。它会自动在 Windows 中设置你的 python 路径。在 Eclipse 中添加这个 python 解释器。

回答by TheSN

Well, if you are aware of word embeddingslike Glove/Word2Vec/Numberbatch, your job is half done. If not let me explain how this can be tackled. Convert each sentence into word tokens, and represent each of these tokens as vectors of high dimension (using the pre-trained word embeddings, or you could trainthem yourself even!). So, now you just don't capture their surface similarity but rather extract the meaning of each word which comprise the sentence as a whole. After this calculate their cosine similarity and you are set.

好吧,如果你知道Glove/Word2Vec/Numberbatch 这样的词嵌入,你的工作就完成了一半。如果不是,让我解释一下如何解决这个问题。将每个句子转换为单词标记,并将这些标记中的每一个表示为高维向量(使用预先训练的词嵌入,或者您甚至可以自己训练它们!)。所以,现在你只是不捕捉它们的表面相似性,而是提取构成整个句子的每个单词的含义。在这之后计算他们的余弦相似度,你就设置好了。

回答by Manideep Karthik

Try this. Download the file 'numberbatch-en-17.06.txt' from https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-en-17.06.txt.gzand extract it. The function 'get_sentence_vector' uses a simple sum of word vectors. However it can be improved by using weighted sum where weights are proportional to Tf-Idf of each word.

尝试这个。从https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-en-17.06.txt.gz下载文件“numberbatch-en-17.06.txt”并解压。函数“get_sentence_vector”使用词向量的简单总和。然而,它可以通过使用加权和来改进,其中权重与每个单词的 Tf-Idf 成正比。

import math
import numpy as np

std_embeddings_index = {}
with open('path/to/numberbatch-en-17.06.txt') as f:
    for line in f:
        values = line.split(' ')
        word = values[0]
        embedding = np.asarray(values[1:], dtype='float32')
        std_embeddings_index[word] = embedding

def cosineValue(v1,v2):
    "compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy/math.sqrt(sumxx*sumyy)


def get_sentence_vector(sentence, std_embeddings_index = std_embeddings_index ):
    sent_vector = 0
    for word in sentence.lower().split():
        if word not in std_embeddings_index :
            word_vector = np.array(np.random.uniform(-1.0, 1.0, 300))
            std_embeddings_index[word] = word_vector
        else:
            word_vector = std_embeddings_index[word]
        sent_vector = sent_vector + word_vector

    return sent_vector

def cosine_sim(sent1, sent2):
    return cosineValue(get_sentence_vector(sent1), get_sentence_vector(sent2))

I did run for the given sentences and found the following results

我确实运行了给定的句子并发现了以下结果

s1 = "This is a foo bar sentence ."
s2 = "This sentence is similar to a foo bar sentence ."
s3 = "What is this string ? Totally not related to the other two lines ."

print cosine_sim(s1, s2) # Should give high cosine similarity
print cosine_sim(s1, s3) # Shouldn't give high cosine similarity value
print cosine_sim(s2, s3) # Shouldn't give high cosine similarity value

0.9851735249068168
0.6570885718962608
0.6589335425458225

回答by Shaina Raza

I have similar solution but might be useful for pandas

我有类似的解决方案,但可能对熊猫有用

import math
import re
from collections import Counter
import pandas as pd

WORD = re.compile(r"\w+")


def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
    sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator


def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)

df=pd.read_csv('/content/drive/article.csv')
df['vector1']=df['headline'].apply(lambda x: text_to_vector(x)) 
df['vector2']=df['snippet'].apply(lambda x: text_to_vector(x)) 
df['simscore']=df.apply(lambda x: get_cosine(x['vector1'],x['vector2']),axis=1)