java 文本相似度算法

Question

提问by EugeneP

I have two subtitles files. I need a function that tells whether they represent the same text, or the similar text

我有两个字幕文件。我需要一个函数来判断它们是代表相同的文本还是相似的文本

Sometimes there are comments like "The wind is blowing... the music is playing" in one file only. But 80% percent of the contents will be the same. The function must return TRUE (files represent the same text). And sometimes there are misspellings like 1 instead of l (one - L ) as here: She 1eft the baggage. Of course, it means function must return TRUE.

有时只在一个文件中出现“风在吹……音乐在播放”之类的评论。但 80% 的内容将是相同的。该函数必须返回 TRUE（文件表示相同的文本）。有时会出现拼写错误，例如 1 而不是 l (one - L )，如下所示： She 1eft the bag。当然，这意味着函数必须返回 TRUE。

My comments:
The function should return percentage of the similarity of texts - AGREE

我的评论：
该函数应返回文本相似度的百分比 - 同意

"all the people were happy" and "all the people were not happy" - here that'd be considered as a misspelling, so that'd be considered the same text. To be exact, the percentage the function returns will be lower, but high enough to say the phrases are similar

“所有人都快乐”和“所有人都不快乐”——这里被认为是拼写错误，因此被认为是相同的文本。确切地说，函数返回的百分比会更低，但足以说明这些短语是相似的

Do consider whether you want to apply Levenshtein on a whole file or just a search string - not sure about Levenshtein, but the algorithm must be applied to the file as a whole. It'll be a very long string, though.

请考虑是否要将 Levenshtein 应用于整个文件或只是搜索字符串 - 不确定 Levenshtein，但该算法必须应用于整个文件。不过，这将是一个很长的字符串。

Answer 1

采纳答案by bcosca

Levenshtein algorithm: http://en.wikipedia.org/wiki/Levenshtein_distance

Levenshtein 算法：http: //en.wikipedia.org/wiki/Levenshtein_distance

Anything other than a result of zero means the text are not "identical". "Similar" is a measure of how far/near they are. Result is an integer.

除了零结果之外的任何内容都意味着文本不是“相同的”。“相似”是衡量它们有多远/多近的指标。结果是一个整数。

Answer 2

回答by Yonatan

For the problem you've described (i.e. compering large strings), you can use Cosine Similarity, which return a number between 0 (completely different) to 1 (identical), base on the term frequencyvectors.

对于您描述的问题（即比较大字符串），您可以使用Cosine Similarity，它根据术语频率向量返回 0（完全不同）到 1（相同）之间的数字。

You might want to look at several implementations that are described here: Cosine Similarity

您可能想查看此处描述的几种实现：余弦相似度

Answer 3

回答by Chinmay Kanchi

Have a look at approximate grep. It might give you pointers, though it's almost certain to perform abysmally on large chunks of text like you're talking about.

看看近似的 grep。它可能会为您提供指导，尽管它几乎肯定会像您谈论的那样在大块文本上表现不佳。

EDIT: The original version of agrep isn't open source, so you might get links to OSS versions from http://en.wikipedia.org/wiki/Agrep

编辑：agrep 的原始版本不是开源的，因此您可能会从http://en.wikipedia.org/wiki/Agrep获得指向 OSS 版本的链接

Answer 4

回答by soulmerge

You're expecting too much here, it looks like you would have to write a function for your specific needs. I would recommend starting with an existing file comparison application (maybe diffalready has everything you need) and improve it to provide good results for your input.

您在这里期望过高，看起来您必须为您的特定需求编写一个函数。我建议从现有的文件比较应用程序开始（也许diff已经拥有您需要的一切）并对其进行改进，以便为您的输入提供良好的结果。

Answer 5

回答by FiveO

There are many alternatives to the Levenshtein distance. For example the Jaro-Winkler distance.

Levenshtein 距离有很多替代方法。例如Jaro-Winkler 距离。

The choice for such algorithm is depending on the language, type of words, are the words entered by human and many more...

这种算法的选择取决于语言、单词类型、人类输入的单词等等……

Here you find a helpful implementation of several algorithms within one library

在这里，您可以找到一个库中多种算法的有用实现

java 文本相似度算法

提问by EugeneP

采纳答案by bcosca

回答by Yonatan

回答by Chinmay Kanchi

回答by soulmerge

回答by FiveO

相关推荐

最近更新

标签

java 文本相似度算法

提问by EugeneP

采纳答案by bcosca

回答by Yonatan

回答by Chinmay Kanchi

回答by soulmerge

回答by FiveO

相关推荐

java 使用字符串列表作为组合框的来源

java 如何使用 struts2 标签迭代 JSP 中的 bean 数组列表

java Spring：构造函数注入具有基于注释的配置的原始值（属性）

java java重复字符

相关推荐

最近更新

标签