如何在 Python NLTK 中计算 Vader“复合”极性分数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40325980/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How is the Vader 'compound' polarity score calculated in Python NLTK?
提问by alicecongcong
I'm using the Vader SentimentAnalyzer to obtain the polarity scores. I used the probability scores for positive/negative/neutral before, but I just realized the "compound" score, ranging from -1 (most neg) to 1 (most pos) would provide a single measure of polarity. I wonder how the "compound" score computed. Is that calculated from the [pos, neu, neg] vector?
我正在使用 Vader SentimentAnalyzer 来获得极性分数。我之前使用了正/负/中性的概率分数,但我刚刚意识到“复合”分数,范围从 -1(最负)到 1(最正)将提供单一的极性度量。我想知道“复合”分数是如何计算的。这是根据 [pos, neu, neg] 向量计算的吗?
回答by alvas
The VADER algorithm outputs sentiment scores to 4 classes of sentiments https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L441:
VADER 算法将情绪分数输出到 4 类情绪https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L441:
neg
: Negativeneu
: Neutralpos
: Positivecompound
: Compound (i.e. aggregated score)
neg
: 消极的neu
: 中性的pos
: 积极的compound
:复合(即总分)
Let's walk through the code, the first instance of compound is at https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L421, where it computes:
让我们来看看代码,复合的第一个实例位于https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L421,它计算:
compound = normalize(sum_s)
The normalize()
function is defined as such at https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L107:
该normalize()
函数在https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L107 中定义如下:
def normalize(score, alpha=15):
"""
Normalize the score to be between -1 and 1 using an alpha that
approximates the max expected value
"""
norm_score = score/math.sqrt((score*score) + alpha)
return norm_score
So there's a hyper-parameter alpha
.
所以有一个超参数alpha
。
As for the sum_s
, it is a sum of the sentiment arguments passed to the score_valence()
function https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L413
至于sum_s
,它是传递给score_valence()
函数的情绪参数的总和https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L413
And if we trace back this sentiment
argument, we see that it's computed when calling the polarity_scores()
function at https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L217:
如果我们追溯这个sentiment
论点,我们会看到它是polarity_scores()
在https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L217调用函数时计算出来的:
def polarity_scores(self, text):
"""
Return a float for sentiment strength based on the input text.
Positive values are positive valence, negative value are negative
valence.
"""
sentitext = SentiText(text)
#text, words_and_emoticons, is_cap_diff = self.preprocess(text)
sentiments = []
words_and_emoticons = sentitext.words_and_emoticons
for item in words_and_emoticons:
valence = 0
i = words_and_emoticons.index(item)
if (i < len(words_and_emoticons) - 1 and item.lower() == "kind" and \
words_and_emoticons[i+1].lower() == "of") or \
item.lower() in BOOSTER_DICT:
sentiments.append(valence)
continue
sentiments = self.sentiment_valence(valence, sentitext, item, i, sentiments)
sentiments = self._but_check(words_and_emoticons, sentiments)
Looking at the polarity_scores
function, what it's doing is to iterate through the whole SentiText lexicon and checks with the rule-based sentiment_valence()
function to assign the valence score to the sentiment https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L243, see Section 2.1.1 of http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf
看看这个polarity_scores
函数,它正在做的是遍历整个 SentiText 词典并检查基于规则的sentiment_valence()
函数以将价分数分配给情绪https://github.com/nltk/nltk/blob/develop/nltk/情绪/vader.py#L243,参见http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf 的第 2.1.1 节
So going back to the compound score, we see that:
所以回到复合分数,我们看到:
- the
compound
score is a normalized score ofsum_s
and sum_s
is the sum of valence computed based on some heuristics and a sentiment lexicon (aka. Sentiment Intensity) and- the normalized score is simply the
sum_s
divided by its square plus an alpha parameter that increases the denominator of the normalization function.
- 该
compound
分数是一个归一化的分数sum_s
和 sum_s
是基于一些启发式和情感词典(又名情感强度)计算的价和- 归一化分数只是
sum_s
除以它的平方加上增加归一化函数分母的 alpha 参数。
Is that calculated from the [pos, neu, neg] vector?
这是根据 [pos, neu, neg] 向量计算的吗?
Not really =)
不是真的=)
If we take a look at the score_valence
function https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L411, we see that the compound score is computed with the sum_s
before the pos, neg and neu scores are computed using _sift_sentiment_scores()
that computes the invidiual pos, neg and neu scores using the raw scores from sentiment_valence()
without the sum.
如果我们看一下score_valence
函数https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L411,我们会看到复合分数是用sum_s
pos、neg 和neu 分数是使用_sift_sentiment_scores()
计算单个 pos、neg 和 neu 分数而计算的,使用原始分数来自sentiment_valence()
没有总和。
If we take a look at this alpha
mathemagic, it seems the output of the normalization is rather unstable (if left unconstrained), depending on the value of alpha
:
如果我们看一下这个alpha
数学魔术,似乎归一化的输出相当不稳定(如果不受约束),取决于 的值alpha
:
alpha=0
:
alpha=0
:
alpha=15
:
alpha=15
:
alpha=50000
:
alpha=50000
:
alpha=0.001
:
alpha=0.001
:
It gets funky when it's negative:
当它是负数时它会变得时髦:
alpha=-10
:
alpha=-10
:
alpha=-1,000,000
:
alpha=-1,000,000
:
alpha=-1,000,000,000
:
alpha=-1,000,000,000
:
回答by leonfrench
"About the Scoring" section at the github repohas a description.
github 存储库中的“关于评分”部分有说明。