Python 通过机器学习进行非常简单的文本分类?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13788229/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 09:37:24  来源:igfitidea点击:

Very simple text classification by machine learning?

pythonalgorithmmachine-learningtext-analysis

提问by Dieter

Possible Duplicate:
Text Classification into Categories

可能的重复:
文本分类到类别

I am currently working on a solution to get the type of food served in a database with 10k restaurants based on their description. I'm using lists of keywords to decide which kind of food is being served.

我目前正在研究一种解决方案,根据他们的描述,在 10,000 家餐厅的数据库中获取所提供的食物类型。我正在使用关键字列表来决定供应哪种食物。

I read a little bit about machine learning but I have no practical experience with it at all. Can anyone explain to me if/why it would a be better solution to a simple problem like this? I find accuracy more important than performance!

我读了一些关于机器学习的书,但我根本没有这方面的实践经验。任何人都可以向我解释是否/为什么它是解决这样一个简单问题的更好方法?我发现准确性比性能更重要!

simplified example:

简化示例:

["China", "Chinese", "Rice", "Noodles", "Soybeans"]
["Belgium", "Belgian", "Fries", "Waffles", "Waterzooi"]

a possible description could be:

可能的描述是:

"Hong's Garden Restaurant offers savory, reasonably priced Chineseto our customers. If you find that you have a sudden craving for rice, noodlesor soybeansat 8 o'clock on a Saturday evening, don't worry! We're open seven days a week and offer carryout service. You can get frieshere as well!"

"Hong's Garden Restaurant 为我们的顾客提供美味、价格合理的中餐。如果您发现周六晚上 8 点突然想吃 米饭面条大豆,别担心!我们营业 7 天一周,提供外送服务。你也可以在这里买到薯条!”

采纳答案by amit

You are indeed describing a classificationproblem, which can be solved with machine learning.

您确实在描述分类问题,可以通过机器学习解决。

In this problem, your features are the words in the description. You should use the Bag Of Wordsmodel - which basically says that the words and their number of occurrences for each word is what matters to the classification process.

在这个问题中,你的特征是描述中的词。您应该使用Bag Of Words模型——它基本上表明单词及其出现的每个单词的次数对分类过程很重要。

To solve your problem, here are the steps you should do:

要解决您的问题,您应该执行以下步骤:

  1. Create a feature extractor- that given a description of a restaurant, returns the "features" (under the Bag Of Words model explained above) of this restaurant (denoted as example in the literature).
  2. Manually label a set of examples, each will be labeled with the desired class (Chinese, Belgian, Junk food,...)
  3. Feed your labeled examples into a learning algorithm. It will generate a classifier. From personal experience, SVMusually gives the best results, but there are other choices such as Naive Bayes, Neural Networksand Decision Trees(usually C4.5is used), each has its own advantage.
  4. When a new (unlabeled) example (restaurant) comes - extract the features and feed it to your classifier - it will tell you what it thinks it is (and usually - what is the probability the classifier is correct).
  1. 创建一个特征提取器- 给出一家餐厅的描述,返回这家餐厅的“特征”(在上面解释的词袋模型下)(在文献中表示为示例)。
  2. 手动标记一组示例,每个示例都将被标记为所需的类别(中国、比利时、垃圾食品……)
  3. 将您标记的示例输入到学习算法中。它将生成一个分类器。从个人经验来看,SVM通常给出最好的结果,但也有其他选择如朴素贝叶斯神经网络决策树(通常使用C4.5),各有优势。
  4. 当一个新的(未标记的)示例(餐厅)出现时 - 提取特征并将其提供给您的分类器 - 它会告诉您它认为它是什么(通常 - 分类器正确的概率是多少)。


Evaluation:
Evaluation of your algorithm can be done with cross-validation, or seperating a test set out of your labeled examples that will be used only for evaluating how accurate the algorithm is.

评估:
您的算法的评估可以通过交叉验证来完成,或者从您的标记示例中分离出一个测试集,这些测试集将仅用于评估算法的准确度。



Optimizations:

优化:

From personal experience - here are some optimizations I found helpful for the feature extraction:

根据个人经验 - 以下是我发现对特征提取有帮助的一些优化:

  1. Stemmingand eliminating stop wordsusually helps a lot.
  2. Using Bi-Gramstends to improve accuracy (though increases the feature space significantly).
  3. Some classifiers are prone to large feature space (SVM not included), there are some ways to overcome it, such as decreasing the dimensionality of your features. PCAis one thing that can help you with it. Genethic Algorithmsare also (empirically) pretty good for subset selection.
  1. 词干和消除停用词通常有很大帮助。
  2. 使用Bi-Grams倾向于提高准确性(尽管显着增加了特征空间)。
  3. 一些分类器容易出现较大的特征空间(不包括 SVM),有一些方法可以克服它,例如降低特征的维数。PCA是可以帮助您解决的一件事。遗传算法(根据经验)也非常适合子集选择。


Libraries:

图书馆:

Unfortunately, I am not fluent enough with python, but here are some libraries that might be helpful:

不幸的是,我对 python 不够流利,但这里有一些可能有用的库:

  • Lucenemight help you a lot with the text analysis, for example - stemming can be done with EnglishAnalyzer. There is a python version of lucene called PyLucene, which I believe might help you out.
  • Wekais an open source library that implements a lot of useful things for Machine Learning - many classifiers and feature selectors included.
  • Libsvmis a library that implements the SVM algorithm.
  • Lucene可能对您进行文本分析有很大帮助,例如 - 可以使用EnglishAnalyzer进行词干提取。有一个名为PyLucene的 lucene 的 python 版本,我相信它可以帮助你。
  • Weka是一个开源库,它为机器学习实现了很多有用的东西——包括许多分类器和特征选择器。
  • Libsvm是一个实现 SVM 算法的库。