如何在 Python NLTK 中计算 Vader“复合”极性分数？

Question

提问by alicecongcong

I'm using the Vader SentimentAnalyzer to obtain the polarity scores. I used the probability scores for positive/negative/neutral before, but I just realized the "compound" score, ranging from -1 (most neg) to 1 (most pos) would provide a single measure of polarity. I wonder how the "compound" score computed. Is that calculated from the [pos, neu, neg] vector?

我正在使用 Vader SentimentAnalyzer 来获得极性分数。我之前使用了正/负/中性的概率分数，但我刚刚意识到“复合”分数，范围从 -1（最负）到 1（最正）将提供单一的极性度量。我想知道“复合”分数是如何计算的。这是根据 [pos, neu, neg] 向量计算的吗？

Answer 1

回答by alvas

The VADER algorithm outputs sentiment scores to 4 classes of sentiments https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L441:

VADER 算法将情绪分数输出到 4 类情绪https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L441：

neg: Negative
neu: Neutral
pos: Positive
compound: Compound (i.e. aggregated score)

neg：消极的
neu：中性的
pos：积极的
compound：复合（即总分）

Let's walk through the code, the first instance of compound is at https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L421, where it computes:

让我们来看看代码，复合的第一个实例位于https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L421，它计算：

compound = normalize(sum_s)

The normalize()function is defined as such at https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L107:

该normalize()函数在https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L107 中定义如下：

def normalize(score, alpha=15):
    """
    Normalize the score to be between -1 and 1 using an alpha that
    approximates the max expected value
    """
    norm_score = score/math.sqrt((score*score) + alpha)
    return norm_score

So there's a hyper-parameter alpha.

所以有一个超参数alpha。

As for the sum_s, it is a sum of the sentiment arguments passed to the score_valence()function https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L413

至于sum_s，它是传递给score_valence()函数的情绪参数的总和https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L413

And if we trace back this sentimentargument, we see that it's computed when calling the polarity_scores()function at https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L217:

如果我们追溯这个sentiment论点，我们会看到它是polarity_scores()在https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L217调用函数时计算出来的：

def polarity_scores(self, text):
    """
    Return a float for sentiment strength based on the input text.
    Positive values are positive valence, negative value are negative
    valence.
    """
    sentitext = SentiText(text)
    #text, words_and_emoticons, is_cap_diff = self.preprocess(text)

    sentiments = []
    words_and_emoticons = sentitext.words_and_emoticons
    for item in words_and_emoticons:
        valence = 0
        i = words_and_emoticons.index(item)
        if (i < len(words_and_emoticons) - 1 and item.lower() == "kind" and \
            words_and_emoticons[i+1].lower() == "of") or \
            item.lower() in BOOSTER_DICT:
            sentiments.append(valence)
            continue

        sentiments = self.sentiment_valence(valence, sentitext, item, i, sentiments)

    sentiments = self._but_check(words_and_emoticons, sentiments)

Looking at the polarity_scoresfunction, what it's doing is to iterate through the whole SentiText lexicon and checks with the rule-based sentiment_valence()function to assign the valence score to the sentiment https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L243, see Section 2.1.1 of http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf

看看这个polarity_scores函数，它正在做的是遍历整个 SentiText 词典并检查基于规则的sentiment_valence()函数以将价分数分配给情绪https://github.com/nltk/nltk/blob/develop/nltk/情绪/vader.py#L243，参见http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf 的第 2.1.1 节

So going back to the compound score, we see that:

所以回到复合分数，我们看到：

the compoundscore is a normalized score of sum_sand
sum_sis the sum of valence computed based on some heuristics and a sentiment lexicon (aka. Sentiment Intensity) and
the normalized score is simply the sum_sdivided by its square plus an alpha parameter that increases the denominator of the normalization function.

该compound分数是一个归一化的分数sum_s和
sum_s是基于一些启发式和情感词典（又名情感强度）计算的价和
归一化分数只是sum_s除以它的平方加上增加归一化函数分母的 alpha 参数。

Is that calculated from the [pos, neu, neg] vector?

这是根据 [pos, neu, neg] 向量计算的吗？

Not really =)

不是真的=)

If we take a look at the score_valencefunction https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L411, we see that the compound score is computed with the sum_sbefore the pos, neg and neu scores are computed using _sift_sentiment_scores()that computes the invidiual pos, neg and neu scores using the raw scores from sentiment_valence()without the sum.

如果我们看一下score_valence函数https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L411，我们会看到复合分数是用sum_spos、neg 和neu 分数是使用_sift_sentiment_scores()计算单个 pos、neg 和 neu 分数而计算的，使用原始分数来自sentiment_valence()没有总和。

If we take a look at this alphamathemagic, it seems the output of the normalization is rather unstable (if left unconstrained), depending on the value of alpha:

如果我们看一下这个alpha数学魔术，似乎归一化的输出相当不稳定（如果不受约束），取决于的值alpha：

alpha=0:

alpha=0：

alpha=15:

alpha=15：

alpha=50000:

alpha=50000：

alpha=0.001:

alpha=0.001：

It gets funky when it's negative:

当它是负数时它会变得时髦：

alpha=-10:

alpha=-10：

alpha=-1,000,000:

alpha=-1,000,000：

alpha=-1,000,000,000:

alpha=-1,000,000,000：

Answer 2

回答by leonfrench

"About the Scoring" section at the github repohas a description.

github 存储库中的“关于评分”部分有说明。

如何在 Python NLTK 中计算 Vader“复合”极性分数？

提问by alicecongcong

回答by alvas

回答by leonfrench

相关推荐

最近更新

标签

如何在 Python NLTK 中计算 Vader“复合”极性分数？

提问by alicecongcong

回答by alvas

回答by leonfrench

相关推荐

Python 如何以 png 格式保存 Plotly Offline 图？

Python 相对导入 - ModuleNotFoundError: 没有名为 x 的模块

Python pandas：选择列值为空/无/南的行

Python 索引错误：索引 2 超出轴 0 的范围，大小为 2

相关推荐

最近更新

标签