Python:字符串的语义相似度得分
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17022691/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python: Semantic similarity score for Strings
提问by user8472
Are there any libraries for computing semantic similarity scores for a pair of sentences ?
是否有任何库可以计算一对句子的语义相似度分数?
I'm aware of WordNet's semantic database, and how I can generate the score for 2 words, but I'm looking for libraries that do all pre-processing tasks like port-stemming, stop word removal, etc, on whole sentences and outputs a score for how related the two sentences are.
我知道 WordNet 的语义数据库,以及如何生成 2 个单词的分数,但我正在寻找能够对整个句子和输出执行所有预处理任务的库,例如端口词干提取、停用词删除等两个句子相关程度的分数。
I found a workin progress that's written using the .NET framework that computes the score using an array of pre-processing steps. Is there any project that does this in python?
我发现了一项使用 .NET 框架编写的正在进行的工作,该框架使用一系列预处理步骤计算分数。有没有在python中执行此操作的项目?
I'm not looking for the sequence of operations that would help me find the score (as is asked for here)
I'd love to implement each stage on my own, or glue functions from different libraries so that it works for sentence pairs, but I need this mostly as a tool to test inferences on data.
我不是在寻找可以帮助我找到分数的操作序列(正如这里所要求的那样)
我很想自己实现每个阶段,或者从不同的库中粘合函数,以便它适用于句子对,但我主要需要它作为测试数据推断的工具。
EDIT:I was considering using NLTK and computing the score for every pair of words iterated over the two sentences, and then draw inferences from the standard deviation of the results, but I don't know if that's a legitimate estimate of similarity. Plus, that'll take a LOT of time for long strings.
Again, I'm looking for projects/libraries that already implement this intelligently. Something that lets me do this:
编辑:我正在考虑使用 NLTK 并计算在两个句子上迭代的每对单词的分数,然后从结果的标准偏差中得出推论,但我不知道这是否是对相似性的合理估计。另外,对于长字符串,这将花费很多时间。
同样,我正在寻找已经智能地实现了这一点的项目/库。让我这样做的事情:
import amazing_semsim_package
str1='Birthday party ruined as cake explodes'
str2='Grandma mistakenly bakes cake using gunpowder'
>>similarity(str1,str2)
>>0.889
采纳答案by Justin Muller
The best package I've seen for this is Gensim, found at the Gensim Homepage. I've used it many times, and overall been very happy with it's ease of use; it is written in Python, and has an easy to follow tutorialto get you started, which compares 9 strings. It can be installed via pip, so you won't have a lot of hassle getting it installed I hope.
我见过的最好的包是 Gensim,可以在Gensim 主页上找到。我已经使用过很多次了,总的来说对它的易用性非常满意;它是用 Python 编写的,并且有一个易于遵循的教程来帮助您入门,其中比较了 9 个字符串。它可以通过 pip 安装,所以我希望你安装它不会有很多麻烦。
Which scoring algorithm you use depends heavily on the context of your problem, but I'd suggest starting of with the LSI functionality if you want something basic. (That's what the tutorial walks you through.)
您使用哪种评分算法在很大程度上取决于您的问题的上下文,但如果您想要一些基本的东西,我建议从 LSI 功能开始。(这就是本教程将引导您完成的内容。)
If you go through the tutorial for gensim, it will walk you through comparing two strings, using the Similarities function. This will allow you to see how your stings compare to each other, or to some other sting, on the basis of the text they contain.
如果您阅读 gensim 教程,它将引导您使用 Similarities 函数比较两个字符串。这将允许您根据它们包含的文本查看您的刺彼此之间或与其他一些刺的比较。
If you're interested in the science behind how it works, check out this paper.
如果您对其工作原理背后的科学感兴趣,请查看这篇论文。
回答by pypat
AFAIK the most powerfull NLP-Lib for Python is http://nltk.org/
AFAIK 最强大的 Python 自然语言处理库是http://nltk.org/
回答by Damir Olejar
Unfortunately, I cannot help you with PY but you may take a look at my old project that uses dictionaries to accomplish the Semantic comparisons between the sentences (which can later be coded in PY implementing the vector-space analysis). It should be just a few hrs of coding to translate from JAVA to PY. https://sourceforge.net/projects/semantics/
不幸的是,我无法帮助您使用 PY,但您可以查看我的旧项目,该项目使用字典来完成句子之间的语义比较(稍后可以在 PY 中编码以实现向量空间分析)。从 JAVA 转换为 PY 应该只需要几个小时的编码时间。 https://sourceforge.net/projects/semantics/

