如何在 Python NLTK 中计算 Vader“复合”极性分数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40325980/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:23:48  来源:igfitidea点击:

How is the Vader 'compound' polarity score calculated in Python NLTK?

pythonnlpnltksentiment-analysisvader

提问by alicecongcong

I'm using the Vader SentimentAnalyzer to obtain the polarity scores. I used the probability scores for positive/negative/neutral before, but I just realized the "compound" score, ranging from -1 (most neg) to 1 (most pos) would provide a single measure of polarity. I wonder how the "compound" score computed. Is that calculated from the [pos, neu, neg] vector?

我正在使用 Vader SentimentAnalyzer 来获得极性分数。我之前使用了正/负/中性的概率分数,但我刚刚意识到“复合”分数,范围从 -1(最负)到 1(最正)将提供单一的极性度量。我想知道“复合”分数是如何计算的。这是根据 [pos, neu, neg] 向量计算的吗?

回答by alvas

The VADER algorithm outputs sentiment scores to 4 classes of sentiments https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L441:

VADER 算法将情绪分数输出到 4 类情绪https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L441

  • neg: Negative
  • neu: Neutral
  • pos: Positive
  • compound: Compound (i.e. aggregated score)
  • neg: 消极的
  • neu: 中性的
  • pos: 积极的
  • compound:复合(即总分)

Let's walk through the code, the first instance of compound is at https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L421, where it computes:

让我们来看看代码,复合的第一个实例位于https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L421,它计算:

compound = normalize(sum_s)

The normalize()function is defined as such at https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L107:

normalize()函数在https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L107 中定义如下:

def normalize(score, alpha=15):
    """
    Normalize the score to be between -1 and 1 using an alpha that
    approximates the max expected value
    """
    norm_score = score/math.sqrt((score*score) + alpha)
    return norm_score

So there's a hyper-parameter alpha.

所以有一个超参数alpha

As for the sum_s, it is a sum of the sentiment arguments passed to the score_valence()function https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L413

至于sum_s,它是传递给score_valence()函数的情绪参数的总和https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L413

And if we trace back this sentimentargument, we see that it's computed when calling the polarity_scores()function at https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L217:

如果我们追溯这个sentiment论点,我们会看到它是polarity_scores()https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L217调用函数时计算出来的:

def polarity_scores(self, text):
    """
    Return a float for sentiment strength based on the input text.
    Positive values are positive valence, negative value are negative
    valence.
    """
    sentitext = SentiText(text)
    #text, words_and_emoticons, is_cap_diff = self.preprocess(text)

    sentiments = []
    words_and_emoticons = sentitext.words_and_emoticons
    for item in words_and_emoticons:
        valence = 0
        i = words_and_emoticons.index(item)
        if (i < len(words_and_emoticons) - 1 and item.lower() == "kind" and \
            words_and_emoticons[i+1].lower() == "of") or \
            item.lower() in BOOSTER_DICT:
            sentiments.append(valence)
            continue

        sentiments = self.sentiment_valence(valence, sentitext, item, i, sentiments)

    sentiments = self._but_check(words_and_emoticons, sentiments)

Looking at the polarity_scoresfunction, what it's doing is to iterate through the whole SentiText lexicon and checks with the rule-based sentiment_valence()function to assign the valence score to the sentiment https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L243, see Section 2.1.1 of http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf

看看这个polarity_scores函数,它正在做的是遍历整个 SentiText 词典并检查基于规则的sentiment_valence()函数以将价分数分配给情绪https://github.com/nltk/nltk/blob/develop/nltk/情绪/vader.py#L243,参见http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf 的第 2.1.1 节

So going back to the compound score, we see that:

所以回到复合分数,我们看到:

  • the compoundscore is a normalized score of sum_sand
  • sum_sis the sum of valence computed based on some heuristics and a sentiment lexicon (aka. Sentiment Intensity) and
  • the normalized score is simply the sum_sdivided by its square plus an alpha parameter that increases the denominator of the normalization function.
  • compound分数是一个归一化的分数sum_s
  • sum_s是基于一些启发式和情感词典(又名情感强度)计算的价和
  • 归一化分数只是sum_s除以它的平方加上增加归一化函数分母的 alpha 参数。


Is that calculated from the [pos, neu, neg] vector?

这是根据 [pos, neu, neg] 向量计算的吗?

Not really =)

不是真的=)

If we take a look at the score_valencefunction https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L411, we see that the compound score is computed with the sum_sbefore the pos, neg and neu scores are computed using _sift_sentiment_scores()that computes the invidiual pos, neg and neu scores using the raw scores from sentiment_valence()without the sum.

如果我们看一下score_valence函数https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L411,我们会看到复合分数是用sum_spos、neg 和neu 分数是使用_sift_sentiment_scores()计算单个 pos、neg 和 neu 分数而计算的,使用原始分数来自sentiment_valence()没有总和。



If we take a look at this alphamathemagic, it seems the output of the normalization is rather unstable (if left unconstrained), depending on the value of alpha:

如果我们看一下这个alpha数学魔术,似乎归一化的输出相当不稳定(如果不受约束),取决于 的值alpha

alpha=0:

alpha=0

enter image description here

在此处输入图片说明

alpha=15:

alpha=15

enter image description here

在此处输入图片说明

alpha=50000:

alpha=50000

enter image description here

在此处输入图片说明

alpha=0.001:

alpha=0.001

enter image description here

在此处输入图片说明

It gets funky when it's negative:

当它是负数时它会变得时髦:

alpha=-10:

alpha=-10

enter image description here

在此处输入图片说明

alpha=-1,000,000:

alpha=-1,000,000

enter image description here

在此处输入图片说明

alpha=-1,000,000,000:

alpha=-1,000,000,000

enter image description here

在此处输入图片说明

回答by leonfrench

"About the Scoring" section at the github repohas a description.

github 存储库中的“关于评分”部分有说明。