java 文本相似度算法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5794103/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 12:47:33  来源:igfitidea点击:

Text similarity Algorithms

javaalgorithmtextsimilarity

提问by N00programmer

I'm doing a Java project where I have to make a text similarity program. I want it to take 2 text documents, then compare them with each other and get the similarity of it. How similar they are to each other.

我正在做一个 Java 项目,我必须在其中制作一个文本相似性程序。我希望它采用 2 个文本文档,然后将它们相互比较并获得它的相似性。他们彼此多么相似。

I'll later put an already database which can find the synonyms for the words and go through the text to see if one of the text document writers just changed the words to other synonyms while the text is exactly the same. Same thing with moving paragrafs up or down. Yes, as was it a plagarism program...

稍后我将放置一个已经存在的数据库,该数据库可以找到单词的同义词并浏览文本以查看文本文档作者之一是否只是将单词更改为其他同义词而文本完全相同。向上或向下移动段落也是如此。是的,因为它是一个抄袭程序......

I want to hear from you people what kind of algoritms you would recommend.

我想听听你们推荐什么样的算法。

I've found Levenstein and Cosine similarity by looking here and other places. Both of them seem to be mentioned a lot. Hamming distance is another my teacher told me about.

通过查看这里和其他地方,我发现了 Levenstein 和 Cosine 的相似性。这两个人似乎都被提及了很多。汉明距离是我的老师告诉我的另一个。

I got some questions related to those since I'm not really getting Wikipedia. Could someone explain those things to me?

我有一些与这些相关的问题,因为我并没有真正了解维基百科。有人可以向我解释这些事情吗?

Levenstein: This algorithm changed by sub, add and elimination the word and see how close it is to the other word in the text document. But how can that be used on a whole text file? I can see how it can be used on a word, but not on a sentence or a whole text document from one to another.

Levenstein:该算法通过子、添加和消除单词进行更改,并查看它与文本文档中的另一个单词的接近程度。但是如何在整个文本文件上使用它呢?我可以看到它如何用于一个单词,但不能用于一个句子或一个从一个到另一个的整个文本文档。

Cosine: It's measure of similarity between two vectors by measuring the cosine of the angle between them. What I don't understand here how two text can become 2 vectors and what about the words/sentence in those?

余弦:通过测量它们之间夹角的余弦来衡量两个向量之间的相似性。我在这里不明白两个文本如何变成 2 个向量以及其中的单词/句子怎么样?

Hamming: This distance seems to work better than Levenstein but it's only on equal strings. How come it's important when 2 documents and even the sentences in those aren't two strings of equal length?

汉明:这个距离似乎比 Levenstein 效果更好,但它只是在相等的弦上。当 2 个文档甚至其中的句子不是两个等长的字符串时,这怎么重要?

Wikipedia should make sense but it's not. I'm sorry if the questions sound too stupid but it's hanging me down and I think there's people in here who's quite capeable to explain it so even newbeginners into this field can get it.

维基百科应该有意义,但事实并非如此。如果这些问题听起来太愚蠢,我很抱歉,但它让我失望,我认为这里有人非常有能力解释它,所以即使是这个领域的新手也能理解。

Thanks for your time.

谢谢你的时间。

采纳答案by Jerry Coffin

Levenstein: in theory you coulduse it for a whole text file, but it's really not very suitable for the task. It's really intended for single words or (at most) a short phrase.

Levenstein:理论上你可以将它用于整个文本文件,但它确实不太适合这项任务。它真的适用于单个单词或(最多)一个短语。

Cosine: You start by simply counting the unique words in each document. The answers to a previous questioncover the computation once you've done that.

余弦:您首先简单地计算每个文档中的唯一单词。上一个问题的答案涵盖了完成计算后的计算。

I've never used Hamming distance for this purpose, so I can't say much about it.

我从来没有为此使用汉明距离,所以我不能说太多。

I would add TFIDF (Term Frequency * Inverted Document Frequency) to the list. It's fairly similar to Cosine distance, but 1) tends to do a better job on shorter documents, and 2) does a better job of taking into account what words are extremely common in an entire corpus rather than just the ones that happen to be common to two particular documents.

我会将 TFIDF(术语频率 * 倒置文档频率)添加到列表中。它与余弦距离非常相似,但 1) 倾向于在较短的文档上做得更好,并且 2) 在考虑整个语料库中哪些词非常常见而不仅仅是那些碰巧常见的词方面做得更好到两个特定的文件。

One final note: for anyof these to produce useful results, you nearly need to screen out stop words before you try to compute the degree of similarity (though TFIDF seems to do better than the others if yo skip this). At least in my experience, it's extremely helpful to stem the words (remove suffixes) as well. When I've done it, I used Porter's stemmer algorithm.

最后一个注意事项:要使这些方法中的任何一个产生有用的结果,您几乎都需要在尝试计算相似度之前筛选出停用词(尽管如果跳过此步骤,TFIDF 似乎比其他方法做得更好)。至少根据我的经验,词干(删除后缀)也是非常有帮助的。完成后,我使用了 Porter 的词干分析器算法。

For your purposes, you probably want to use what I've dubbed an inverted thesaurus, which lets you look up a word, and for each word substitute a single canonical word for that meaning. I tried this on one project, and didn't find it as useful as expected, but it sounds like for your project it would probably be considerably more useful.

出于您的目的,您可能想要使用我称之为倒叙词库的工具,它可以让您查找一个词,并为每个词替换一个单一的规范词来表示该含义。我在一个项目上尝试过这个,但没有发现它像预期的那样有用,但听起来对于你的项目来说它可能会更有用。

回答by Summer_More_More_Tea

Basic idea of comparing similarity between two documents, which is a topic in information retrieval, is extracting some fingerprint and judge whether they share some information based on the fingerprint.

比较两个文档之间的相似度是信息检索中的一个主题,其基本思想是提取一些指纹并根据指纹判断它们是否共享某些信息。

Just some hints, the Winnowing: Local Algorithms for Document Fingerprintingmaybe a choice and a good start point to your problem.

只是一些提示,Winnowing: Local Algorithms for Document Fingerprinting可能是解决您问题的一个选择和一个好的起点。

回答by corsiKa

Consider the example on wikipedia for Levenshtein distance:

考虑维基百科上关于 Levenshtein 距离的例子:

For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

   1. kitten → sitten (substitution of 's' for 'k')
   2. sitten → sittin (substitution of 'i' for 'e')
   3. sittin → sitting (insertion of 'g' at the end).

Now, replace "kitten" with "text from first paper", and "sitting" with "text from second paper".

现在,将“小猫”替换为“第一张纸上的文字”,将“坐下”替换为“第二张纸上的文字”。

Paper[] papers = getPapers();
for(int i = 0; i < papers.length - 1; i++) {
    for(int j = i + 1; j < papers.length; j++) {
        Paper first = papers[i];
        Paper second = papers[j];
        int dist = compareSimilarities(first.text,second.text);
        System.out.println(first.name + "'s paper compares to " + second.name + "'s paper with a similarity score of " + dist);
    }
}

Compare those results and peg the kids with the lowest distance scores.

比较这些结果并确定距离分数最低的孩子。

In your compareSimilaritiesmethod, you could use any or all of the comparison algorithms. Another one you could incorporate in to the formula is "longest common substring" (which is a good method of finding plagerism.)

在您的compareSimilarities方法中,您可以使用任何或所有比较算法。您可以加入公式的另一个是“最长公共子串”(这是查找抄袭的好方法。)