java 文本相似度算法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2325588/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Text similarity algorithm
提问by EugeneP
I have two subtitles files. I need a function that tells whether they represent the same text, or the similar text
我有两个字幕文件。我需要一个函数来判断它们是代表相同的文本还是相似的文本
Sometimes there are comments like "The wind is blowing... the music is playing" in one file only. But 80% percent of the contents will be the same. The function must return TRUE (files represent the same text). And sometimes there are misspellings like 1 instead of l (one - L ) as here: She 1eft the baggage. Of course, it means function must return TRUE.
有时只在一个文件中出现“风在吹……音乐在播放”之类的评论。但 80% 的内容将是相同的。该函数必须返回 TRUE(文件表示相同的文本)。有时会出现拼写错误,例如 1 而不是 l (one - L ),如下所示: She 1eft the bag。当然,这意味着函数必须返回 TRUE。
My comments:
The function should return percentage of the similarity of texts - AGREE
我的评论:
该函数应返回文本相似度的百分比 - 同意
"all the people were happy" and "all the people were not happy" - here that'd be considered as a misspelling, so that'd be considered the same text. To be exact, the percentage the function returns will be lower, but high enough to say the phrases are similar
“所有人都快乐”和“所有人都不快乐”——这里被认为是拼写错误,因此被认为是相同的文本。确切地说,函数返回的百分比会更低,但足以说明这些短语是相似的
Do consider whether you want to apply Levenshtein on a whole file or just a search string - not sure about Levenshtein, but the algorithm must be applied to the file as a whole. It'll be a very long string, though.
请考虑是否要将 Levenshtein 应用于整个文件或只是搜索字符串 - 不确定 Levenshtein,但该算法必须应用于整个文件。不过,这将是一个很长的字符串。
采纳答案by bcosca
Levenshtein algorithm: http://en.wikipedia.org/wiki/Levenshtein_distance
Levenshtein 算法:http: //en.wikipedia.org/wiki/Levenshtein_distance
Anything other than a result of zero means the text are not "identical". "Similar" is a measure of how far/near they are. Result is an integer.
除了零结果之外的任何内容都意味着文本不是“相同的”。“相似”是衡量它们有多远/多近的指标。结果是一个整数。
回答by Yonatan
For the problem you've described (i.e. compering large strings), you can use Cosine Similarity, which return a number between 0 (completely different) to 1 (identical), base on the term frequencyvectors.
对于您描述的问题(即比较大字符串),您可以使用Cosine Similarity,它根据术语频率向量返回 0(完全不同)到 1(相同)之间的数字。
You might want to look at several implementations that are described here: Cosine Similarity
您可能想查看此处描述的几种实现:余弦相似度
回答by Chinmay Kanchi
Have a look at approximate grep. It might give you pointers, though it's almost certain to perform abysmally on large chunks of text like you're talking about.
看看近似的 grep。它可能会为您提供指导,尽管它几乎肯定会像您谈论的那样在大块文本上表现不佳。
EDIT: The original version of agrep isn't open source, so you might get links to OSS versions from http://en.wikipedia.org/wiki/Agrep
编辑:agrep 的原始版本不是开源的,因此您可能会从http://en.wikipedia.org/wiki/Agrep获得指向 OSS 版本的链接
回答by soulmerge
You're expecting too much here, it looks like you would have to write a function for your specific needs. I would recommend starting with an existing file comparison application (maybe diffalready has everything you need) and improve it to provide good results for your input.
您在这里期望过高,看起来您必须为您的特定需求编写一个函数。我建议从现有的文件比较应用程序开始(也许diff已经拥有您需要的一切)并对其进行改进,以便为您的输入提供良好的结果。
回答by FiveO
There are many alternatives to the Levenshtein distance. For example the Jaro-Winkler distance.
Levenshtein 距离有很多替代方法。例如Jaro-Winkler 距离。
The choice for such algorithm is depending on the language, type of words, are the words entered by human and many more...
这种算法的选择取决于语言、单词类型、人类输入的单词等等……
Here you find a helpful implementation of several algorithms within one library

