Python 使用 Scikit Learn SVM 为文本分类准备数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13942744/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 09:58:56  来源:igfitidea点击:

Prepare data for text classification using Scikit Learn SVM

pythonsvmscikit-learn

提问by user1906856

I'm trying to apply SVM from Scikit learn to classify the tweets I collected. So, there will be two categories, name them A and B. For now, I have all the tweets categorized in two text file, 'A.txt' and 'B.txt'. However, I'm not sure what type of data inputs the Scikit Learn SVM is asking for. I have a dictionary with labels (A and B) as its keys and a dictionary of features (unigrams) and their frequencies as values. Sorry, I'm really new to machine learning and not sure what I should do to get the SVM work. And I found that SVM is using numpy.ndarray as the type of its data input. Do I need to create one based on my own data? Should it be something like this?

我正在尝试应用 Scikit 中的 SVM 学习对我收集的推文进行分类。因此,将有两个类别,将它们命名为 A 和 B。现在,我将所有推文分类在两个文本文件“A.txt”和“B.txt”中。但是,我不确定 Scikit Learn SVM 要求什么类型的数据输入。我有一个以标签(A 和 B)作为键的字典和一个特征字典(一元组)及其频率作为值。抱歉,我对机器学习真的很陌生,不确定我应该怎么做才能让 SVM 工作。我发现 SVM 使用 numpy.ndarray 作为其数据输入的类型。我需要根据我自己的数据创建一个吗?应该是这样的吗?

Labels    features    frequency
  A        'book'        54
  B       'movies'       32

Any help is appreciated.

任何帮助表示赞赏。

回答by ogrisel

Have a look at the documentation on text feature extraction.

查看有关文本特征提取的文档。

Also have a look at the text classification example.

另请查看文本分类示例

There is also a tutorial here:

这里还有一个教程:

http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In particular don't focus too much on SVM models (in particular not sklearn.svm.SVCthat is more interesting for kernel models hence not text classification): a simple Perceptron, LogisticRegression or Bernoulli naive Bayes models might work as good while being much faster to train.

特别是不要过分关注 SVM 模型(特别是sklearn.svm.SVC这对于内核模型来说不是更有趣,因此不是文本分类):一个简单的感知器、LogisticRegression 或 Bernoulli 朴素贝叶斯模型可能同样有效,同时训练速度要快得多。