Java 如何训练斯坦福 NLP 情绪分析工具

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22586658/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 16:40:09  来源:igfitidea点击:

How to train the Stanford NLP Sentiment Analysis tool

javanlpstanford-nlpsentiment-analysis

提问by Jordan H

Hell everyone! I'm using the Stanford Core NLP package and my goal is to perform sentiment analysis on a live-stream of tweets.

见鬼去吧!我正在使用斯坦福核心 NLP 包,我的目标是对推文直播进行情感分析。

Using the sentiment analysis tool as is returns a very poor analysis of text's 'attitude' .. many positives are labeled neutral, many negatives rated positive. I've gone ahead an acquired well over a million tweets in a text file, but I haven't a clue how to actually trainthe tool and create my own model.

按原样使用情感分析工具会返回对文本“态度”的非常糟糕的分析......许多正面被标记为中性,许多负面被评为正面。我已经在一个文本文件中获得了超过一百万条推文,但我不知道如何实际训练该工具并创建我自己的模型。

Link to Stanford Sentiment Analysis page

链接到斯坦福情绪分析页面

"Models can be retrained using the following command using the PTB format dataset:"

“可以使用以下命令使用 PTB 格式数据集重新训练模型:”

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath     dev.txt -train -model model.ser.gz

Sample from dev.txt (The leading 4 represents polarity out of 5 ... 4/5 positive)

来自 dev.txt 的示例(前导 4 表示 5 ... 4/5 正极中的极性)

(4 (4 (2 A) (4 (3 (3 warm) (2 ,)) (3 funny))) (3 (2 ,) (3 (4 (4 engaging) (2 film)) (2 .))))

Sample from test.txt

来自 test.txt 的示例

(3 (3 (2 If) (3 (2 you) (3 (2 sometimes) (2 (2 like) (3 (2 to) (3 (3 (2 go) (2 (2 to) (2 (2 the) (2 movies)))) (3 (2 to) (3 (2 have) (4 fun))))))))) (2 (2 ,) (2 (2 Wasabi) (3 (3 (2 is) (2 (2 a) (2 (3 good) (2 (2 place) (2 (2 to) (2 start)))))) (2 .)))))

Sample from train.txt

来自 train.txt 的示例

(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))

I have two questions going forward.

我有两个问题要继续。

What is the significance and difference between each file? Train.txt/Dev.txt/Test.txt ?

每个文件的意义和区别是什么?Train.txt/Dev.txt/Test.txt ?

How would I train my own model with a raw, unparsed text file full of tweets?

我将如何使用一个充满推文的原始、未解析的文本文件来训练我自己的模型?

I'm very new to NLP so if I am missing any required information or anything at all please critique! Thank you!

我对 NLP 很陌生,所以如果我遗漏了任何必需的信息或任何东西,请批评!谢谢!

回答by mbatchkarov

What is the significance and difference between each file? Train.txt/Dev.txt/Test.txt ?

每个文件的意义和区别是什么?Train.txt/Dev.txt/Test.txt ?

This is standard machine learning terminology. The train set is used to (surprise surprise) train a model. The development set is used to tune any parameters the model might have. What you would normally do is pick a parameter value, train a model on the training set, and then check how well the trained model does on the development set. You then pick another parameter value and repeat. This procedure helps you find reasonable parameter values for your model.

这是标准的机器学习术语。训练集用于(惊喜)训练模型。开发集用于调整模型可能具有的任何参数。您通常要做的是选择一个参数值,在训练集上训练模型,然后检查训练后的模型在开发集上的表现。然后选择另一个参数值并重复。此过程可帮助您为模型找到合理的参数值。

Once this is done, you proceed to test how well the model does on the test set. This is unseen- your model has never encountered any of that data before. It is important that the test set is separate from the training and development set, otherwise you are effectively evaluating a model on data it has seen before. This would be wrong as it will not give you an idea of how well the model really does.

完成此操作后,您将继续测试模型在测试集上的表现。这是看不见的- 您的模型以前从未遇到过任何这些数据。将测试集与训练和开发集分开是很重要的,否则您将根据之前看到的数据有效地评估模型。这是错误的,因为它不会让您了解模型的实际效果。

How would I train my own model with a raw, unparsed text file full of tweets?

我将如何使用一个充满推文的原始、未解析的文本文件来训练我自己的模型?

You can't and you shouldn't train using an unparsed set of documents. The entire point of the recursive deep model (and the reason it performs so well) is that it can learn from the sentiment annotations at every level of the parse tree. The sentence you have given above can be formatted like this:

您不能也不应该使用未解析的文档集进行训练。递归深度模型的全部意义(以及它表现如此出色的原因)在于它可以从解析树的每个级别的情感注释中学习。你上面给出的句子可以这样格式化:

(4 
    (4 
        (2 A) 
        (4 
            (3 (3 warm) (2 ,)) (3 funny)
        )
    ) 
    (3 
        (2 ,) 
        (3 
            (4 (4 engaging) (2 film)) (2 .)
        )
    )
)

Usually, a sentiment analyser is trained with document-level annotations. You only have one score, and this score applies to the document as a whole, ignoring the fact that the phrases in the document may express different sentiment. The Stanford team put a lot of effort into annotating every phrase in the document for sentiment. For example, the word filmon its own is neutral in sentiment: (2 film). However, the phrase engaging filmis very positive: (4 (4 engaging) (2 film)) (2 .)

通常,情感分析器使用文档级注释进行训练。你只有一个分数,这个分数适用于整个文档,忽略了文档中的短语可能表达不同情绪的事实。斯坦福团队付出了很多努力来注释文档中的每个短语以表达情感。例如,这个词film本身在情感上是中性的:(2 film)。然而,这句话engaging film是非常积极的:(4 (4 engaging) (2 film)) (2 .)

If you have labelled tweets, you can use any other document-level sentiment classifier. The sentiment-analysistag on stackoverflow already has some very good answers, I'm not going to repeat them here.

如果您已标记推文,则可以使用任何其他文档级情感分类器。stackoverflow 上的sentiment-analysis标签已经有一些很好的答案,我不打算在这里重复。

PS Did you label the tweets you have? All 1 million of them? If you did, I'd like to pay you a lot of money for that file :)

PS 你给你的推文贴上了标签吗?一百万?如果你这样做了,我愿意为那个文件付给你很多钱:)

回答by arachnode.net

回答by user1466472

If it helps, I got the C# code from Arachnode working very easily - a tweak or two to get the right paths for models and so on, but it then works great. What was missing was something about the right format for the input files. It's in the Javadoc, but for reference, for BuildBinarizedDataset it's something like:

如果有帮助,我可以很容易地从 Arachnode 获得 C# 代码 - 一两次调整以获得模型的正确路径等等,但它工作得很好。缺少的是输入文件的正确格式。它在 Javadoc 中,但作为参考,对于 BuildBinarizedDataset,它类似于:

2 line of text here

0 another line of text 

1 yet another line of text

etc

Building that is pretty trivial, depending on what you're starting with (a database, Excel file, whatever)

构建非常简单,这取决于您从什么开始(数据库、Excel 文件等)