Python 用于情感分析的 nltk NaiveBayesClassifier 训练

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20827741/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:18:22  来源:igfitidea点击:

nltk NaiveBayesClassifier training for sentiment analysis

pythonnlpnltksentiment-analysistextblob

提问by student001

I am training the NaiveBayesClassifierin Python using sentences, and it gives me the error below. I do not understand what the error might be, and any help would be good.

我正在NaiveBayesClassifier使用句子在 Python 中训练,它给了我下面的错误。我不明白错误可能是什么,任何帮助都会很好。

I have tried many other input formats, but the error remains. The code given below:

我尝试了许多其他输入格式,但错误仍然存​​在。下面给出的代码:

from text.classifiers import NaiveBayesClassifier
from text.blob import TextBlob
train = [('I love this sandwich.', 'pos'),
         ('This is an amazing place!', 'pos'),
         ('I feel very good about these beers.', 'pos'),
         ('This is my best work.', 'pos'),
         ("What an awesome view", 'pos'),
         ('I do not like this restaurant', 'neg'),
         ('I am tired of this stuff.', 'neg'),
         ("I can't deal with this", 'neg'),
         ('He is my sworn enemy!', 'neg'),
         ('My boss is horrible.', 'neg') ]

test = [('The beer was good.', 'pos'),
        ('I do not enjoy my job', 'neg'),
        ("I ain't feeling dandy today.", 'neg'),
        ("I feel amazing!", 'pos'),
        ('Gary is a friend of mine.', 'pos'),
        ("I can't believe I'm doing this.", 'neg') ]
classifier = nltk.NaiveBayesClassifier.train(train)

I am including the traceback below.

我包括下面的回溯。

Traceback (most recent call last):
  File "C:\Users60\Desktop\train01.py", line 15, in <module>
    all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
  File "C:\Users60\Desktop\train01.py", line 15, in <genexpr>
    all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
  File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 87, in word_tokenize
    return _word_tokenize(text)
  File "C:\Python27\lib\site-packages\nltk\tokenize\treebank.py", line 67, in tokenize
    text = re.sub(r'^\"', r'``', text)
  File "C:\Python27\lib\re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

采纳答案by π?δα? ?κ??

You need to change your data structure. Here is your trainlist as it currently stands:

您需要更改数据结构。这是您train目前的清单:

>>> train = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

The problem is, though, that the first element of each tuple should be a dictionary of features. So I will change your list into a data structure that the classifier can work with:

但问题是,每个元组的第一个元素应该是一个特征字典。因此,我会将您的列表更改为分类器可以使用的数据结构:

>>> from nltk.tokenize import word_tokenize # or use some other tokenizer
>>> all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
>>> t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]

Your data should now be structured like this:

您的数据现在应该是这样结构的:

>>> t
[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), . . .]

Note that the first element of each tuple is now a dictionary. Now that your data is in place and the first element of each tuple is a dictionary, you can train the classifier like so:

请注意,每个元组的第一个元素现在是一个字典。现在您的数据已经就位,并且每个元组的第一个元素是一个字典,您可以像这样训练分类器:

>>> import nltk
>>> classifier = nltk.NaiveBayesClassifier.train(t)
>>> classifier.show_most_informative_features()
Most Informative Features
                    this = True              neg : pos    =      2.3 : 1.0
                    this = False             pos : neg    =      1.8 : 1.0
                      an = False             neg : pos    =      1.6 : 1.0
                       . = True              pos : neg    =      1.4 : 1.0
                       . = False             neg : pos    =      1.4 : 1.0
                 awesome = False             neg : pos    =      1.2 : 1.0
                      of = False             pos : neg    =      1.2 : 1.0
                    feel = False             neg : pos    =      1.2 : 1.0
                   place = False             neg : pos    =      1.2 : 1.0
                horrible = False             pos : neg    =      1.2 : 1.0

If you want to use the classifier, you can do it like this. First, you begin with a test sentence:

如果你想使用分类器,你可以这样做。首先,你从一个测试语句开始:

>>> test_sentence = "This is the best band I've ever heard!"

Then, you tokenize the sentence and figure out which words the sentence shares with all_words. These constitute the sentence's features.

然后,您对句子进行标记并找出该句子与 all_words 共享哪些单词。这些构成了句子的特征。

>>> test_sent_features = {word: (word in word_tokenize(test_sentence.lower())) for word in all_words}

Your features will now look like this:

您的功能现在将如下所示:

>>> test_sent_features
{'love': False, 'deal': False, 'tired': False, 'feel': False, 'is': True, 'am': False, 'an': False, 'sandwich': False, 'ca': False, 'best': True, '!': True, 'what': False, 'i': True, '.': False, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'these': False, 'stuff': False, 'place': False, 'my': False, 'view': False}

Then you simply classify those features:

然后您只需对这些特征进行分类:

>>> classifier.classify(test_sent_features)
'pos' # note 'best' == True in the sentence features above

This test sentence appears to be positive.

这个测试语句似乎是肯定的。

回答by alvas

@275365's tutorial on the data structure for NLTK's bayesian classifier is great. From a more high level, we can look at it as,

@275365 关于 NLTK 贝叶斯分类器数据结构的教程很棒。从更高的层面上,我们可以将其视为,

We have inputs sentences with sentiment tags:

我们有带有情感标签的输入句子:

training_data = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

Let's consider our feature sets to be individual words, so we extract a list of all possible words from the training data (let's call it vocabulary) as such:

让我们将我们的特征集视为单个单词,因此我们从训练数据中提取所有可能单词的列表(我们称之为词汇表),如下所示:

from nltk.tokenize import word_tokenize
from itertools import chain
vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))

Essentially, vocabularyhere is the same @275365's all_word

本质上,vocabulary这里是相同的@275365all_word

>>> all_words = set(word.lower() for passage in training_data for word in word_tokenize(passage[0]))
>>> vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))
>>> print vocabulary == all_words
True

From each data point, (i.e. each sentence and the pos/neg tag), we want to say whether a feature (i.e. a word from the vocabulary) exist or not.

从每个数据点(即每个句子和 pos/neg 标签),我们想判断一个特征(即词汇表中的一个词)是否存在。

>>> sentence = word_tokenize('I love this sandwich.'.lower())
>>> print {i:True for i in vocabulary if i in sentence}
{'this': True, 'i': True, 'sandwich': True, 'love': True, '.': True}

But we also want to tell the classifier which word don't exist in the sentence but in the vocabulary, so for each data point, we list out all possible words in the vocabulary and say whether a word exist or not:

但是我们也想告诉分类器哪个词不存在于句子中但存在于词汇表中,因此对于每个数据点,我们列出词汇表中所有可能的词并判断一个词是否存在:

>>> sentence = word_tokenize('I love this sandwich.'.lower())
>>> x =  {i:True for i in vocabulary if i in sentence}
>>> y =  {i:False for i in vocabulary if i not in sentence}
>>> x.update(y)
>>> print x
{'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'good': False, 'best': False, '!': False, 'these': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'ca': False, 'do': False, 'sandwich': True, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'i': True, 'stuff': False, 'place': False, 'my': False, 'awesome': False, 'view': False}

But since this loops through the vocabulary twice, it's more efficient to do this:

但是由于这会遍历词汇表两次,所以这样做会更有效率:

>>> sentence = word_tokenize('I love this sandwich.'.lower())
>>> x = {i:(i in sentence) for i in vocabulary}
{'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'good': False, 'best': False, '!': False, 'these': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'ca': False, 'do': False, 'sandwich': True, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'i': True, 'stuff': False, 'place': False, 'my': False, 'awesome': False, 'view': False}

So for each sentence, we want to tell the classifier for each sentence which word exist and which word doesn't and also give it the pos/neg tag. We can call that a feature_set, it's a tuple made up of a x(as shown above) and its tag.

所以对于每个句子,我们想告诉每个句子的分类器哪个词存在,哪个词不存在,并给它 pos/neg 标签。我们可以称之为 a feature_set,它是一个由 a x(如上所示)及其标签组成的元组。

>>> feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]
[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), ...]

Then we feed these features and tags in the feature_set into the classifier to train it:

然后我们将 feature_set 中的这些特征和标签输入到分类器中进行训练:

from nltk import NaiveBayesClassifier as nbc
classifier = nbc.train(feature_set)

Now you have a trained classifier and when you want to tag a new sentence, you have to "featurize" the new sentence to see which of the word in the new sentence are in the vocabulary that the classifier was trained on:

现在你有一个经过训练的分类器,当你想标记一个新句子时,你必须“特征化”新句子,看看新句子中的哪些词在分类器训练的词汇表中:

>>> test_sentence = "This is the best band I've ever heard! foobar"
>>> featurized_test_sentence = {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary}

NOTE:As you can see from the step above, the naive bayes classifier cannot handle out of vocabulary words since the foobartoken disappears after you featurize it.

注意:从上面的步骤中可以看出,朴素贝叶斯分类器无法处理词汇不足的单词,因为foobar标记在您对它进行特征化后消失了。

Then you feed the featurized test sentence into the classifier and ask it to classify:

然后将特征化的测试句子输入分类器并要求它进行分类:

>>> classifier.classify(featurized_test_sentence)
'pos'

Hopefully this gives a clearer picture of how to feed data in to NLTK's naive bayes classifier for sentimental analysis. Here's the full code without the comments and the walkthrough:

希望这能更清楚地了解如何将数据输入 NLTK 的朴素贝叶斯分类器进行情感分析。这是没有注释和演练的完整代码:

from nltk import NaiveBayesClassifier as nbc
from nltk.tokenize import word_tokenize
from itertools import chain

training_data = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))

feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]

classifier = nbc.train(feature_set)

test_sentence = "This is the best band I've ever heard!"
featurized_test_sentence =  {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary}

print "test_sent:",test_sentence
print "tag:",classifier.classify(featurized_test_sentence)

回答by Steve L

It appears that you are trying to use TextBlob but are training the NLTK NaiveBayesClassifier, which, as pointed out in other answers, must be passed a dictionary of features.

您似乎正在尝试使用 TextBlob,但正在训练 NLTK NaiveBayesClassifier,正如其他答案中所指出的,必须通过特征字典。

TextBlob has a default feature extractor that indicates which words in the training set are included in the document (as demonstrated in the other answers). Therefore, TextBlob allows you to pass in your data as is.

TextBlob 有一个默认的特征提取器,用于指示文档中包含训练集中的哪些单词(如其他答案中所示)。因此,TextBlob 允许您按原样传递数据。

from textblob.classifiers import NaiveBayesClassifier

train = [('This is an amazing place!', 'pos'),
        ('I feel very good about these beers.', 'pos'),
        ('This is my best work.', 'pos'),
        ("What an awesome view", 'pos'),
        ('I do not like this restaurant', 'neg'),
        ('I am tired of this stuff.', 'neg'),
        ("I can't deal with this", 'neg'),
        ('He is my sworn enemy!', 'neg'),
        ('My boss is horrible.', 'neg') ] 
test = [
        ('The beer was good.', 'pos'),
        ('I do not enjoy my job', 'neg'),
        ("I ain't feeling dandy today.", 'neg'),
        ("I feel amazing!", 'pos'),
        ('Gary is a friend of mine.', 'pos'),
        ("I can't believe I'm doing this.", 'neg') ] 


classifier = NaiveBayesClassifier(train)  # Pass in data as is
# When classifying text, features are extracted automatically
classifier.classify("This is an amazing library!")  # => 'pos'

Of course, the simple default extractor is not appropriate for all problems. If you would like to how features are extracted, you just write a function that takes a string of text as input and outputs the dictionary of features and pass that to the classifier.

当然,简单的默认提取器并不适合所有问题。如果您想了解如何提取特征,您只需编写一个函数,该函数将一串文本作为输入并输出特征字典并将其传递给分类器。

classifier = NaiveBayesClassifier(train, feature_extractor=my_extractor_func)

I encourage you to check out the short TextBlob classifier tutorial here: http://textblob.readthedocs.org/en/latest/classifiers.html

我鼓励您在此处查看简短的 TextBlob 分类器教程:http://textblob.readthedocs.org/en/latest/classifiers.html