Python 通过机器学习进行非常简单的文本分类？

Question

提问by Dieter

Possible Duplicate:
Text Classification into Categories

可能的重复：
文本分类到类别

I am currently working on a solution to get the type of food served in a database with 10k restaurants based on their description. I'm using lists of keywords to decide which kind of food is being served.

我目前正在研究一种解决方案，根据他们的描述，在 10,000 家餐厅的数据库中获取所提供的食物类型。我正在使用关键字列表来决定供应哪种食物。

I read a little bit about machine learning but I have no practical experience with it at all. Can anyone explain to me if/why it would a be better solution to a simple problem like this? I find accuracy more important than performance!

我读了一些关于机器学习的书，但我根本没有这方面的实践经验。任何人都可以向我解释是否/为什么它是解决这样一个简单问题的更好方法？我发现准确性比性能更重要！

simplified example:

简化示例：

["China", "Chinese", "Rice", "Noodles", "Soybeans"]
["Belgium", "Belgian", "Fries", "Waffles", "Waterzooi"]

a possible description could be:

可能的描述是：

"Hong's Garden Restaurant offers savory, reasonably priced Chineseto our customers. If you find that you have a sudden craving for rice, noodlesor soybeansat 8 o'clock on a Saturday evening, don't worry! We're open seven days a week and offer carryout service. You can get frieshere as well!"

"Hong's Garden Restaurant 为我们的顾客提供美味、价格合理的中餐。如果您发现周六晚上 8 点突然想吃米饭、面条或大豆，别担心！我们营业 7 天一周，提供外送服务。你也可以在这里买到薯条！”

Answer 1

采纳答案by amit

You are indeed describing a classificationproblem, which can be solved with machine learning.

您确实在描述分类问题，可以通过机器学习解决。

In this problem, your features are the words in the description. You should use the Bag Of Wordsmodel - which basically says that the words and their number of occurrences for each word is what matters to the classification process.

在这个问题中，你的特征是描述中的词。您应该使用Bag Of Words模型——它基本上表明单词及其出现的每个单词的次数对分类过程很重要。

To solve your problem, here are the steps you should do:

要解决您的问题，您应该执行以下步骤：

Create a feature extractor- that given a description of a restaurant, returns the "features" (under the Bag Of Words model explained above) of this restaurant (denoted as example in the literature).
Manually label a set of examples, each will be labeled with the desired class (Chinese, Belgian, Junk food,...)
Feed your labeled examples into a learning algorithm. It will generate a classifier. From personal experience, SVMusually gives the best results, but there are other choices such as Naive Bayes, Neural Networksand Decision Trees(usually C4.5is used), each has its own advantage.
When a new (unlabeled) example (restaurant) comes - extract the features and feed it to your classifier - it will tell you what it thinks it is (and usually - what is the probability the classifier is correct).

创建一个特征提取器- 给出一家餐厅的描述，返回这家餐厅的“特征”（在上面解释的词袋模型下）（在文献中表示为示例）。
手动标记一组示例，每个示例都将被标记为所需的类别（中国、比利时、垃圾食品……）
将您标记的示例输入到学习算法中。它将生成一个分类器。从个人经验来看，SVM通常给出最好的结果，但也有其他选择如朴素贝叶斯、神经网络和决策树（通常使用C4.5），各有优势。
当一个新的（未标记的）示例（餐厅）出现时 - 提取特征并将其提供给您的分类器 - 它会告诉您它认为它是什么（通常 - 分类器正确的概率是多少）。

Evaluation:
Evaluation of your algorithm can be done with cross-validation, or seperating a test set out of your labeled examples that will be used only for evaluating how accurate the algorithm is.

评估：
您的算法的评估可以通过交叉验证来完成，或者从您的标记示例中分离出一个测试集，这些测试集将仅用于评估算法的准确度。

Optimizations:

优化：

From personal experience - here are some optimizations I found helpful for the feature extraction:

根据个人经验 - 以下是我发现对特征提取有帮助的一些优化：

Stemmingand eliminating stop wordsusually helps a lot.
Using Bi-Gramstends to improve accuracy (though increases the feature space significantly).
Some classifiers are prone to large feature space (SVM not included), there are some ways to overcome it, such as decreasing the dimensionality of your features. PCAis one thing that can help you with it. Genethic Algorithmsare also (empirically) pretty good for subset selection.

词干和消除停用词通常有很大帮助。
使用Bi-Grams倾向于提高准确性（尽管显着增加了特征空间）。
一些分类器容易出现较大的特征空间（不包括 SVM），有一些方法可以克服它，例如降低特征的维数。PCA是可以帮助您解决的一件事。遗传算法（根据经验）也非常适合子集选择。

Libraries:

图书馆：

Unfortunately, I am not fluent enough with python, but here are some libraries that might be helpful:

不幸的是，我对 python 不够流利，但这里有一些可能有用的库：

Lucenemight help you a lot with the text analysis, for example - stemming can be done with EnglishAnalyzer. There is a python version of lucene called PyLucene, which I believe might help you out.
Wekais an open source library that implements a lot of useful things for Machine Learning - many classifiers and feature selectors included.
Libsvmis a library that implements the SVM algorithm.

Lucene可能对您进行文本分析有很大帮助，例如 - 可以使用EnglishAnalyzer进行词干提取。有一个名为PyLucene的 lucene 的 python 版本，我相信它可以帮助你。
Weka是一个开源库，它为机器学习实现了很多有用的东西——包括许多分类器和特征选择器。
Libsvm是一个实现 SVM 算法的库。

Python 通过机器学习进行非常简单的文本分类？

提问by Dieter

采纳答案by amit

相关推荐

最近更新

标签

Python 通过机器学习进行非常简单的文本分类？

提问by Dieter

采纳答案by amit

相关推荐

使用 BeautifulSoup 遍历 html 树中的元素，并生成一个保持每个元素相对位置的输出？在 Python 中

如何将过滤器应用于python中的信号

如何在没有第三方库的情况下使用 python 验证 xml？

Python 如果 var 不存在则

相关推荐

最近更新

标签