python python脚本中检测相似文档的算法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/101569/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 19:29:47  来源:igfitidea点击:

Algorithm to detect similar documents in python script

pythonalgorithmdiff

提问by user17451

I need to write a module to detect similar documents. I have read many papers of fingerprints of documents techniques and others, but I do not know how to write code or implement such a solution. The algorithm should work for Chinese, Japanese, English and German language or be language independent. How can I accomplish this?

我需要编写一个模块来检测类似的文档。我已经阅读了许多关于文档指纹技术等的论文,但我不知道如何编写代码或实现这样的解决方案。该算法应该适用于中文、日语、英语和德语,或者是独立于语言的。我怎样才能做到这一点?

回答by e-satis

Bayesian filters have exactly this purpose. That's the techno you'll find in most tools that identify spam.

贝叶斯过滤器正是有这个目的。这就是您在大多数识别垃圾邮件的工具中都能找到的技术。

Example, to detect a language (from http://sebsauvage.net/python/snyppets/#bayesian) :

例如,检测一种语言(来自http://sebsauvage.net/python/snyppets/#bayesian):

from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('french','La souris est rentrée dans son trou.')
guesser.train('english','my tailor is rich.')
guesser.train('french','Je ne sais pas si je viendrai demain.')
guesser.train('english','I do not plan to update my website soon.')

>>> print guesser.guess('Jumping out of cliffs it not a good idea.')
[('english', 0.99990000000000001), ('french', 9.9999999999988987e-005)]

>>> print guesser.guess('Demain il fera très probablement chaud.')
[('french', 0.99990000000000001), ('english', 9.9999999999988987e-005)]

But it works to detect any type you will train it for : technical text, songs, jokes, etc. As long as you can provide enought material to let the tool learn what does you document looks like.

但它可以检测您将训练它的任何类型:技术文本、歌曲、笑话等。只要您能提供足够的材料,让该工具了解您的文档是什么样的。

回答by Jeremiah Peschka

If these are pure text documents, or you have a method to extract the text from the documents, you can use a technique called shingling.

如果这些是纯文本文档,或者您有从文档中提取文本的方法,则可以使用称为 shingling 的技术。

You first compute a unique hash for each document. If these are the same, you are done.

您首先为每个文档计算一个唯一的哈希值。如果这些都一样,你就完成了。

If not, you break each document down into smaller chunks. These are your 'shingles.'

如果没有,您将每个文档分解成更小的块。这些是你的“带状疱疹”。

Once you have the shingles, you can then compute identity hashes for each shingle and compare the hashes of the shingles to determine if the documents are actually the same.

获得带状疱疹后,您可以计算每个带状疱疹的身份哈希值并比较带状疱疹的哈希值以确定文档是否实际上相同。

The other technique you can use is to generate n-grams of the entire documents and compute the number of similar n-grams in each document and produce a weighted score for each document. Basically an n-gram is splitting a word into smaller chunks. 'apple' would become ' a', ' ap', 'app', 'ppl', 'ple', 'le '. (This is technically a 3-gram) This approach can become quite computationally expensive over a large number of documents or over two very large documents. Of course, common n-grams 'the', ' th, 'th ', etc need to be weighted to score them lower.

您可以使用的另一种技术是生成整个文档的 n-gram 并计算每个文档中相似 n-gram 的数量,并为每个文档生成一个加权分数。基本上,n-gram 将一个单词拆分成更小的块。'apple' 会变成 'a'、'ap'、'app'、'ppl'、'ple'、'le'。(这在技术上是一个 3-gram)对于大量文档或两个非常大的文档,这种方法在计算上可能变得非常昂贵。当然,常见的 n-gram 'the'、'th、'th' 等需要加权以降低它们的分数。

I've posted about this on my blog and there are some links in the post to a few other articles on the subject Shingling - it's not just for roofers.

我已经在我的博客上发布了有关此内容的帖子,并且该帖子中有一些链接指向有关Shingling主题的其他几篇文章- 这不仅适用于屋顶工

Best of luck!

祝你好运!

回答by nosklo

You can use or at last study difflibfrom Python's stdlib to write your code.

您可以使用或最后学习Python 标准库中的difflib来编写您的代码。

It is very flexible, and has algorithms to find differences between lists of strings, and to point these differences. Then you can use the get_close_matches()to find similar words:

它非常灵活,并且有算法来查找字符串列表之间的差异,并指出这些差异。然后你可以使用get_close_matches()来查找相似的词:

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']

It is not the solution but maybe it is a start.

这不是解决方案,但也许是一个开始。

回答by nosklo

Similarity can be found easily without classification. Try this O(n2) but works fine.

无需分类即可轻松找到相似性。试试这个 O(n2) 但效果很好。

def jaccard_similarity(doc1, doc2):
    a = sets(doc1.split())
    b = sets(doc2.split())
    similarity = float(len(a.intersection(b))*1.0/len(a.union(b))) #similarity belongs to [0,1] 1 means its exact replica.
    return similarity

回答by SquareCog

You need to make your question more concrete. If you've already read the fingerprinting papers, you already know the principles at work, so describing common approaches here would not be beneficial. If you haven't, you should also check out papers on "duplicate detection" and various web spam detection related papers that have come out of Stanford, Google, Yahoo, and MS in recent years.

你需要让你的问题更具体。如果您已经阅读了指纹识别文件,那么您已经了解了工作原理,因此在此处描述常用方法将无济于事。如果您还没有,您还应该查看近年来来自斯坦福、谷歌、雅虎和 MS 的关于“重复检测”的论文和各种与网络垃圾邮件检测相关的论文。

Are you having specific problems with coding the described algorithms?

您是否在编码所描述的算法时遇到特定问题?

Trouble getting started?

入门有问题?

The first thing I'd probably do is separate the tokenization (the process of extracting "words" or other sensible sequences) from the duplicate detection logic, so that it is easy to plug in different parsers for different languages and keep the duplicate detection piece the same.

我可能要做的第一件事是将标记化(提取“单词”或其他合理序列的过程)与重复检测逻辑分开,以便很容易插入不同语言的不同解析器并保留重复检测部分相同。

回答by Ants Aasma

There is a rather good talk on neural networkson Google Techtalks that talks about using layered Boltzmann machines to generate feature vectors for documents that can then be used to measure document distance. The main issue is the requirement to have a large sample document set to train the network to discover relevant features.

Google Techtalks有一个关于神经网络的相当不错的讨论,它讨论了使用分层玻尔兹曼机为文档生成特征向量,然后可以使用这些向量来测量文档距离。主要问题是需要有一个大样本文档集来训练网络发现相关特征。

回答by Ants Aasma

If you're prepared to index the files that you want to search amongst, Xapian is an excellent engine, and provides Python bindings:

如果您准备索引要搜索的文件,Xapian 是一个出色的引擎,并提供 Python 绑定:

http://xapian.org/

http://xapian.org/

http://xapian.org/docs/bindings/python/

http://xapian.org/docs/bindings/python/

回答by Jon Mills

I think Jeremy has hit the nail on the head - if you just want to detect if files are different, a hash algorithm like MD5 or SHA1 is a good way to go.

我认为 Jeremy 一针见血——如果你只想检测文件是否不同,像 MD5 或 SHA1 这样的哈希算法是一个不错的选择。

Linus Torvalds' Git source control software uses SHA1 hashing in just this way - to check when files have been modified.

Linus Torvalds 的 Git 源代码控制软件正是以这种方式使用 SHA1 散列 - 检查文件何时被修改。

回答by Jiayao Yu

If you are trying to detect the documents that are talking about the same topic, you could try collecting the most frequently used words, throw away the stop words. Documents that have a similar distribution of the most frequently used words are probably talking about similar things. You may need to do some stemmingand extend the concept to n-gramsif you want higher accuracy. For more advanced techniques, look into machine learning.

如果您试图检测讨论同一主题的文档,您可以尝试收集最常用的词,扔掉停用词。最常用词的分布相似的文档很可能在谈论类似的事情。如果您想要更高的准确性,您可能需要进行一些词干提取并将概念扩展到n-gram。如需更高级的技术,请研究机器学习。

回答by user225145

You might want to look into the DustBuster algorithm as outlined in this paper.

您可能想要研究本文中概述的 DustBuster 算法。

From the paper, they're able to detect duplicate pages without even examining the page contents. Of course examining the contents increases the efficacy, but using raw server logs is adequate for the method to detect duplicate pages.

从论文中,他们甚至无需检查页面内容就能够检测到重复的页面。当然,检查内容会提高效率,但使用原始服务器日志足以检测重复页面的方法。

Similar to the recommendation of using MD5 or SHA1 hashes, the DustBuster method largely relies on comparing file size as it primary signal. As simple as it sounds, it's rather effective for an initial first pass.

与使用 MD5 或 SHA1 哈希的建议类似,DustBuster 方法在很大程度上依赖于比较文件大小作为主要信号。听起来很简单,但它对于初始的第一次通过相当有效。