Python 将分类数据传递给 Sklearn 决策树

Question

提问by 0xhfff

There are several posts about how to encode categorical data to Sklearn Decission trees, but from Sklearn documentation, we got these

有几篇关于如何将分类数据编码为 Sklearn 决策树的帖子，但是从 Sklearn 文档中，我们得到了这些

Some advantages of decision trees are:
(...)
Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information.

决策树的一些优点是：
（……）
能够处理数字和分类数据。其他技术通常专门用于分析只有一种类型变量的数据集。有关更多信息，请参阅算法。

But running the following script

但是运行以下脚本

import pandas as pd
from sklearn.tree import DecisionTreeClassifier

data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']

tree = DecisionTreeClassifier()
tree.fit(data[['A','B','C']], data['Class'])

outputs the following error:

输出以下错误：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit
    X = check_array(X, dtype=DTYPE, accept_sparse="csc")
  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: b

I know that in R it is possible to pass categorical data, with Sklearn, is it possible?

我知道在 R 中可以使用 Sklearn 传递分类数据，这可能吗？

Answer 1

采纳答案by Abhinav Arora

Contrary to the accepted answer, I would prefer to use tools provided by Scikit-Learn for this purpose. The main reason for doing so is that they can be easily integrated in a Pipeline.

与接受的答案相反，我更愿意为此目的使用 Scikit-Learn 提供的工具。这样做的主要原因是它们可以轻松集成到Pipeline 中。

Scikit-Learn itself provides very good classes to handle categorical data. Instead of writing your custom function, you should use LabelEncoderwhich is specially designed for this purpose.

Scikit-Learn 本身提供了非常好的类来处理分类数据。你不想编写自定义函数，你应该使用LabelEncoder它专门为此设计的。

Refer to the following code from the documentation:

参考文档中的以下代码：

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"])

This automatically encodes them into numbers for your machine learning algorithms. Now this also supports going back to strings from integers. You can do this by simply calling inverse_transformas follows:

这会自动将它们编码为机器学习算法的数字。现在这也支持从整数返回到字符串。您可以通过简单地调用inverse_transform如下来做到这一点：

list(le.inverse_transform([2, 2, 1]))

This would return ['tokyo', 'tokyo', 'paris'].

这将返回['tokyo', 'tokyo', 'paris']。

Also note that for many other classifiers, apart from decision trees, such as logistic regression or SVM, you would like to encode your categorical variables using One-Hot encoding. Scikit-learn supports this as well through the OneHotEncoderclass.

另请注意，对于许多其他分类器，除了决策树（例如逻辑回归或 SVM）外，您可能希望使用One-Hot encoding对分类变量进行编码。Scikit-learn 也通过OneHotEncoder课程支持这一点。

Hope this helps!

希望这可以帮助！

Answer 2

回答by James Owers

(This is just a reformat of my comment abovefrom 2016...it still holds true.)

（这只是我2016 年以上评论的重新格式化......它仍然适用。）

The accepted answer for this question is misleading.

这个问题的公认答案具有误导性。

As it stands, sklearn decision trees do not handle categorical data - see issue #5442.

就目前而言，sklearn 决策树不处理分类数据 -请参阅问题 #5442。

The recommended approach of using Label Encoding converts to integers which the DecisionTreeClassifier()will treat as numeric. If your categorical data is not ordinal, this is not good - you'll end up with splits that do not make sense.

使用标签编码的推荐方法转换为整数，DecisionTreeClassifier()将其视为 numeric。如果您的分类数据不是有序的，这并不好 - 您最终会得到没有意义的拆分。

Using a OneHotEncoderis the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive.

使用 aOneHotEncoder是目前唯一有效的方式，允许不依赖于标签顺序的任意拆分，但计算成本很高。

Answer 3

回答by Guillaume

(..)
Able to handle both numerical and categorical data.

(..)
能够处理数字和分类数据。

This only means that you can use

这仅意味着您可以使用

the DecisionTreeClassifier class for classification problems
the DecisionTreeRegressor class for regression.

用于分类问题的 DecisionTreeClassifier 类
用于回归的 DecisionTreeRegressor 类。

In any case you need to one-hot encode categorical variables before you fit a tree with sklearn, like so:

在任何情况下，您都需要在使用 sklearn 拟合树之前对分类变量进行单热编码，如下所示：

import pandas as pd
from sklearn.tree import DecisionTreeClassifier

data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']

tree = DecisionTreeClassifier()

one_hot_data = pd.get_dummies(data[['A','B','C']],drop_first=True)
tree.fit(one_hot_data, data['Class'])

Answer 4

回答by Cédric Gaudissart

For nominal categorical variables, I would not use LabelEncoderbut sklearn.preprocessing.OneHotEncoderor pandas.get_dummiesinstead because there is usually no order in these type of variables.

对于名义分类变量，我不会使用LabelEncoderbut sklearn.preprocessing.OneHotEncoder或pandas.get_dummies代替，因为这些类型的变量通常没有顺序。

Answer 5

回答by mrwyatt

Sklearn Decision Trees do not handle conversion of categorical strings to numbers. I suggest you find a function in Sklearn (maybe this) that does so or manually write some code like:

Sklearn 决策树不处理分类字符串到数字的转换。我建议您在 Sklearn 中找到一个函数（可能是这个），或者手动编写一些代码，例如：

def cat2int(column):
    vals = list(set(column))
    for i, string in enumerate(column):
        column[i] = vals.index(string)
    return column

Python 将分类数据传递给 Sklearn 决策树

提问by 0xhfff

采纳答案by Abhinav Arora

回答by James Owers

回答by Guillaume

回答by Cédric Gaudissart

回答by mrwyatt

相关推荐

最近更新

标签

Python 将分类数据传递给 Sklearn 决策树

提问by 0xhfff

采纳答案by Abhinav Arora

回答by James Owers

回答by Guillaume

回答by Cédric Gaudissart

回答by mrwyatt

相关推荐

Python Numpy，将数组与标量相乘

Python os.linesep 有什么用？

Python 在 Pytorch 中连接两个张量

Python MySQLdb - 类中的连接

相关推荐

最近更新

标签