java 用于句子相似性检测的 BLEU 分数实现

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5390397/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 10:54:27  来源:igfitidea点击:

BLEU score implementation for sentence similarity detection

javaalgorithmnlptext-processingmachine-translation

提问by KNsiva

I need to calculate BLEU score for identifying whether two sentences are similar or not.I have read some articles which are mostly about BLEU score for Measuring Machine translation accuracy.But i'm in need of a BLEU score to find out similarity between sentences in a same language[English].(i.e)(Both the sentences are in English).Thanks in anticipation.

我需要计算 BLEU 分数来识别两个句子是否相似。我读过一些文章,主要是关于测量机器翻译准确性的 BLEU 分数。但我需要一个 BLEU 分数来找出句子之间的相似性同一种语言[英语]。(即)(两个句子都是英语)。感谢期待。

采纳答案by ealdent

Well, if you just want to calculate the BLEU score, it's straightforward. Treat one sentence as the reference translation and the other as the candidate translation.

好吧,如果您只想计算 BLEU 分数,那很简单。将一个句子作为参考译文,将另一个作为候选译文。

回答by dmcer

For sentence level comparisons, use smoothed BLEU

对于句子级别的比较,请使用平滑的 BLEU

The standard BLEU score used for machine translation evaluation (BLEU:4) is only really meaningful at the corpus level, since any sentence that does not have at least one 4-gram match will be given a score of 0.

用于机器翻译评估的标准 BLEU 分数 (BLEU:4) 仅在语料库级别才真正有意义,因为任何没有至少一个 4-gram 匹配的句子都将被给予 0 分

This happens because, at its core, BLEU is really just the geometric meanof n-gram precisions that is scaled by a brevity penalty to prevent very short sentences with some matching material from being given inappropriately high scores. Since the geometric mean is calculated by multiplying together all the terms to be included in the mean, having a zero for any of the n-gram counts results in the entire score being zero.

发生这种情况是因为,从本质上讲,BLEU 实际上只是n-gram 精度的几何平均值,它通过简洁惩罚进行缩放,以防止带有某些匹配材料的非常短的句子被给予不适当的高分。由于几何平均数是通过将要包含在平均数中的所有项相乘来计算的,因此对于任何 n-gram 计数为零会导致整个分数为零。

If you want to apply BLEU to individual sentences, you're better off using smoothed BLEU (Lin and Och 2004- see sec. 4), whereby you add 1 to each of the n-gram counts before you calculate the n-gram precisions. This will prevent any of the n-gram precisions from being zero, and thus will result in non-zero values even when there are not any 4-gram matches.

如果您想将 BLEU 应用于单个句子,最好使用平滑的 BLEU(Lin 和 Och 2004- 参见第 4 节),在计算 n-gram 精度之前,您将每个 n-gram 计数加 1 . 这将防止任何 n-gram 精度为零,因此即使没有任何 4-gram 匹配,也会导致非零值。

Java Implementation

Java实现

You'll find a Java implementation of both BLEU and smooth BLEU in the Stanford machine translation package Phrasal.

您可以在斯坦福机器翻译包Phrasal 中找到 BLEU 和平滑 BLEU 的 Java 实现。

Alternatives

备择方案

As Andreas already mentioned, you might want to use an alternative scoring metric such as Levenstein's string edit distance. However, one problem with using the traditional Levenstein string edit distance to compare sentences is that it isn't explicitly aware of word boundaries.

正如 Andreas 已经提到的,您可能希望使用替代评分指标,例如Levenstein 的字符串编辑距离。然而,使用传统的 Levenstein 字符串编辑距离来比较句子的一个问题是它没有明确地意识到单词边界。

Other alternatives include:

其他替代方案包括:

  • Word Error Rate- This is essentially the Levenstein distance applied to a sequence of words rather than a sequence of characters. It's widely used for scoring speech recognition systems.
  • Translation Edit Rate (TER)- This is similar to word error rate, but it allows for an additional swap edit operation for adjacent words and phrases. This metric has become popular within the machine translation community since it correlates better with human judgments than other sentence similarity measures such as BLEU. The most recent variant of this metric, known as Translation Edit Rate Plus (TERp), allows for matching of synonyms using WordNet as well as paraphrases of multiword sequences ("died" ~= "kicked the bucket").
  • METEOR- This metric first calculates an alignment that allows for arbitrary reordering of the words in the two sentences being compared. If there are multiple possible ways to align the sentences, METEOR selects the one that minimizes crisscrossing alignment edges. Like TERp, METEOR allows for matching of WordNet synonyms and paraphrases of multiword sequences. After alignment, the metric computes the similarity between the two sentences using the number of matching words to calculate a F-α score, a balanced measure of precision and recall, which is then scaled by a penalty for the amount of word order scrambling present in the alignment.
  • 单词错误率- 这本质上是应用于单词序列而不是字符序列的 Levenstein 距离。它广泛用于对语音识别系统进行评分。
  • 翻译编辑率 (TER)- 这类似于单词错误率,但它允许对相邻单词和短语进行额外的交换编辑操作。该度量标准在机器翻译社区中变得流行,因为它与人类判断的相关性比其他句子相似性度量(如 BLEU)更好。该指标的最新变体,称为Translation Edit Rate Plus (TERp),允许使用 WordNet 匹配同义词以及多词序列的释义(“死”~=“踢桶”)。
  • METEOR- 该指标首先计算对齐方式,允许对正在比较的两个句子中的单词进行任意重新排序。如果有多种可能的方法来对齐句子,则 METEOR 会选择最小化交叉对齐边缘的一种。与 TERp 一样,METEOR 允许匹配 WordNet 同义词和多词序列的释义。对齐后,该度量使用匹配词的数量来计算两个句子之间的相似度,以计算F-α 分数,这是精确度和召回率的平衡度量,然后通过对存在于中的词序打乱量的惩罚进行缩放对齐。

回答by Mohamed Ibrahim

回答by Andreas

Maybe the (Levenstein) edit distance is also an option, or the Hamming distance. Either way, the BLEU score is also appropriate for the job; it measures the similarity of one sentence against a reference, so that only makes sense when they're in the same language like with your problem.

也许(Levenstein)编辑距离也是一种选择,或者汉明距离。无论哪种方式,BLEU 分数也适合该工作;它衡量一个句子与参考文献的相似度,因此只有当它们使用与您的问题相同的语言时才有意义。

回答by brlaranjeira

You can use Moses multi-bleu script, where you can also use multiple references: https://github.com/moses-smt/mosesdecoder/blob/RELEASE-2.1.1/scripts/generic/multi-bleu.perl

您可以使用 Moses multi-bleu 脚本,也可以使用多个引用:https: //github.com/moses-smt/mosesdecoder/blob/RELEASE-2.1.1/scripts/generic/multi-bleu.perl