java 句子之间的语义相似度
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2037832/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
semantic similarity between sentences
提问by salma
i am doing project.i need any opensource tool or technique to find the semantic similarity between sentences where i give input as two sentences and output as score (i.e.,semantic similarity).can any one know this information.i hope i will get reply soon.thank you all.
我正在做项目。我需要任何开源工具或技术来找到句子之间的语义相似性,其中我将输入作为两个句子并输出作为分数(即语义相似性)。任何人都知道这些信息。我希望我能得到答复很快。谢谢大家。
回答by ferdystschenko
Salma, I'm afraid this is not the right forum for your question as it's not directly related to programming. I recommend that you ask your question again on corpora list. You also may want to search their archives first.
Salma,恐怕这不是您问题的正确论坛,因为它与编程没有直接关系。我建议您在语料库列表中再次提问。您可能还想先搜索他们的档案。
Apart from that, your question is not precise enough, and I'll explain what I mean by that. I assume that your project is about computing the semantic similarity between sentences and not about something else to which semantic similarity is just one thing among many. If this is the case, then there are a few things to consider: First of all, neither from the perspective of computational linguistics nor of theoretical linguistics is it clear what the term 'semantic similarity' means exactly. There are numerous different views and definitions of it, all depending on the type of problem to be solved, the tools and techniques which are at hand, and the background of the one approaching this task, etc. Consider these examples:
除此之外,你的问题不够准确,我会解释我的意思。我假设您的项目是关于计算句子之间的语义相似性,而不是关于语义相似性只是其中之一的其他事物。如果是这种情况,那么有几点需要考虑: 首先,无论是从计算语言学的角度还是从理论语言学的角度来看,“语义相似性”一词的确切含义都不清楚。它有许多不同的观点和定义,所有这些都取决于要解决的问题的类型、手头的工具和技术,以及完成这项任务的背景等。请考虑以下示例:
- Pete and Rob have found a dog near the station.
- Pete and Rob have never found a dog near the station.
- Pete and Rob both like programming a lot.
- Patricia found a dog near the station.
- It was a dog who found Pete and Rob under the snow.
- 皮特和罗布在车站附近发现了一只狗。
- 皮特和罗布从未在车站附近发现过一只狗。
- Pete 和 Rob 都非常喜欢编程。
- 帕特里夏在车站附近发现了一只狗。
- 是一只狗在雪下发现了皮特和罗布。
Which of the sentences 2-4 are similar to 1? 2 is the exact opposite of 1, still it is about Pete and Rob (not) finding a dog. 3 is about Pete and Rob, but in a completely different context. 4 is about find a dog near the station, although the finder being someone else. 5 is about Pete, Rob, a dog, and a 'finding' event but in a different way than in 1. As for me, I would not be able to rank these examples according to their similarity even without having to write a computer program.
2-4 中哪些句子与 1 相似?2 与 1 正好相反,仍然是关于皮特和罗布(不是)找到一只狗。3 是关于皮特和罗布,但在一个完全不同的背景下。4是关于在车站附近找一只狗,虽然发现者是别人。5 是关于皮特、罗布、一只狗和一个“发现”事件,但与 1 的方式不同。至于我,即使不必编写计算机程序,我也无法根据它们的相似性对这些示例进行排名.
In order to compute semantic similarity you need to first decide what you want to be treated as 'semantically similar' and what not. In order to compute semantic similarity on the sentence level, you ideally would compare some kind of meaning representation of the sentences. Meaning representation normally come as logic formula and are extremely complex to generate. However, there are tools which attempt to do this, e.g. Boxer
为了计算语义相似度,您首先需要决定什么是“语义相似”,什么不是。为了计算句子级别的语义相似度,您最好比较句子的某种含义表示。意义表示通常以逻辑公式的形式出现,生成起来极其复杂。然而,有一些工具试图做到这一点,例如Boxer
As a simplistic but often practical approach, you would define semantic similarity as the sum of the similarities between the words in one sentence and the other. This makes the problem a lot easier, although there are still some difficult issues to be addressed since semantic similarity of words is just as badly defined as that of sentences. If you want to get an impression of this, take a look into the book 'Lexical Semantics' by D.A. Cruse (1986). However, there are quite a number of tools and techniques to compute the semantic similarity between word. Some of them define it basically as the negative distance of two words in a taxonomy like Word Netor the Wikipedia taxonomy (see this paperwhich describes an API for this). Others compute semantic similarity by using some statistical measures calculated over large text corpora. They are based on the insight that similar words occur in similar context. A third approach to calculating semantic similarity between sentences orwords is concerned with vector space models which you may know from information retrieval. To get an overview about these latter techniques, take a look at chapter 8.5 in the book Foundations of statistical natural language processingby Manning and Schütze.
作为一种简单但通常实用的方法,您可以将语义相似度定义为一个句子中的单词与另一个句子中的单词之间的相似度的总和。这使问题变得容易了很多,尽管仍有一些困难的问题需要解决,因为单词的语义相似性与句子的语义相似性定义一样糟糕。如果您想对此有一个印象,请查看 DA Cruse (1986) 所著的“词法语义学”一书。然而,有相当多的工具和技术来计算词之间的语义相似度。他们中的一些人基本上将其定义为Word Net或 Wikipedia 分类法等分类法中两个单词的负距离(请参阅本文它描述了一个 API)。其他人通过使用在大型文本语料库上计算的一些统计度量来计算语义相似度。它们基于相似词出现在相似上下文中的洞察力。计算句子或单词之间语义相似度的第三种方法与向量空间模型有关,您可能从信息检索中了解到。要了解后一种技术的概述,请查看Manning 和 Schütze所著《统计自然语言处理基础》一书中的第 8.5 章。
Hope this gets you off on your feet for now.
希望这能让你暂时站起来。
回答by Damir Olejar
I have developed a simple open-source tool that does the semantic comparison according to categories: https://sourceforge.net/projects/semantics/files/
我开发了一个简单的开源工具,可以根据类别进行语义比较:https: //sourceforge.net/projects/semantics/files/
It works with sentences of any length, is simple, stable, fast, small in size...
Here is a sample output:
Similarity between the sentences
-Pete and Rob have found a dog near the station.
-Pete and Rob have never found a dog near the station.
is: 1.0000000000
它适用于任何长度的句子,简单、稳定、快速、体积小……这是一个示例输出:
句子之间的相似性
-Pete 和 Rob 在车站附近发现了一只狗。
-Pete 和 Rob 从未在车站附近发现过一只狗。
是:1.0000000000
Similarity between the sentences
-Patricia found a dog near the station.
-It was a dog who found Pete and Rob under the snow.
is: 0.7363210405107239
句子之间的
相似之处 - Patricia 在车站附近发现了一只狗。
- 是一只狗在雪下发现了皮特和罗布。
是:0.7363210405107239
Similarity between the sentences
-Patricia found a dog near the station.
-I am fine, thanks!
is: 0.0
句子之间的
相似之处 - Patricia 在车站附近发现了一只狗。
-我很好,谢谢!
是:0.0
Similarity between the sentences
-Hello there, how are you?
-I am fine, thanks!
is: 0.29160592175990213
句子之间的相似性 -
你好,你好吗?
-我很好,谢谢!
是:0.29160592175990213
USAGE:
用法:
import semantics.Compare;
public class USAGE {
public static void main(String[] args) {
String a = "This is a first sentence.";
String b = "This is a second one.";
Compare c = new Compare(a,b);
System.out.println("Similarity between the sentences\n-"+a+"\n-"+b+"\n is: " + c.getResult());
}
}

