Python 将分类数据传递给 Sklearn 决策树
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38108832/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Passing categorical data to Sklearn Decision Tree
提问by 0xhfff
There are several posts about how to encode categorical data to Sklearn Decission trees, but from Sklearn documentation, we got these
有几篇关于如何将分类数据编码为 Sklearn 决策树的帖子,但是从 Sklearn 文档中,我们得到了这些
Some advantages of decision trees are:
(...)
Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information.
决策树的一些优点是:
(……)
能够处理数字和分类数据。其他技术通常专门用于分析只有一种类型变量的数据集。有关更多信息,请参阅算法。
But running the following script
但是运行以下脚本
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']
tree = DecisionTreeClassifier()
tree.fit(data[['A','B','C']], data['Class'])
outputs the following error:
输出以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: b
I know that in R it is possible to pass categorical data, with Sklearn, is it possible?
我知道在 R 中可以使用 Sklearn 传递分类数据,这可能吗?
采纳答案by Abhinav Arora
Contrary to the accepted answer, I would prefer to use tools provided by Scikit-Learn for this purpose. The main reason for doing so is that they can be easily integrated in a Pipeline.
与接受的答案相反,我更愿意为此目的使用 Scikit-Learn 提供的工具。这样做的主要原因是它们可以轻松集成到Pipeline 中。
Scikit-Learn itself provides very good classes to handle categorical data. Instead of writing your custom function, you should use LabelEncoder
which is specially designed for this purpose.
Scikit-Learn 本身提供了非常好的类来处理分类数据。你不想编写自定义函数,你应该使用LabelEncoder
它专门为此设计的。
Refer to the following code from the documentation:
参考文档中的以下代码:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"])
This automatically encodes them into numbers for your machine learning algorithms. Now this also supports going back to strings from integers. You can do this by simply calling inverse_transform
as follows:
这会自动将它们编码为机器学习算法的数字。现在这也支持从整数返回到字符串。您可以通过简单地调用inverse_transform
如下来做到这一点:
list(le.inverse_transform([2, 2, 1]))
This would return ['tokyo', 'tokyo', 'paris']
.
这将返回['tokyo', 'tokyo', 'paris']
。
Also note that for many other classifiers, apart from decision trees, such as logistic regression or SVM, you would like to encode your categorical variables using One-Hot encoding. Scikit-learn supports this as well through the OneHotEncoder
class.
另请注意,对于许多其他分类器,除了决策树(例如逻辑回归或 SVM)外,您可能希望使用One-Hot encoding对分类变量进行编码。Scikit-learn 也通过OneHotEncoder
课程支持这一点。
Hope this helps!
希望这可以帮助!
回答by James Owers
(This is just a reformat of my comment abovefrom 2016...it still holds true.)
(这只是我2016 年以上评论的重新格式化......它仍然适用。)
The accepted answer for this question is misleading.
这个问题的公认答案具有误导性。
As it stands, sklearn decision trees do not handle categorical data - see issue #5442.
就目前而言,sklearn 决策树不处理分类数据 -请参阅问题 #5442。
The recommended approach of using Label Encoding converts to integers which the DecisionTreeClassifier()
will treat as numeric. If your categorical data is not ordinal, this is not good - you'll end up with splits that do not make sense.
使用标签编码的推荐方法转换为整数,DecisionTreeClassifier()
将其视为 numeric。如果您的分类数据不是有序的,这并不好 - 您最终会得到没有意义的拆分。
Using a OneHotEncoder
is the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive.
使用 aOneHotEncoder
是目前唯一有效的方式,允许不依赖于标签顺序的任意拆分,但计算成本很高。
回答by Guillaume
(..)
Able to handle both numerical and categorical data.
(..)
能够处理数字和分类数据。
This only means that you can use
这仅意味着您可以使用
- the DecisionTreeClassifier class for classification problems
- the DecisionTreeRegressor class for regression.
- 用于分类问题的 DecisionTreeClassifier 类
- 用于回归的 DecisionTreeRegressor 类。
In any case you need to one-hot encode categorical variables before you fit a tree with sklearn, like so:
在任何情况下,您都需要在使用 sklearn 拟合树之前对分类变量进行单热编码,如下所示:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']
tree = DecisionTreeClassifier()
one_hot_data = pd.get_dummies(data[['A','B','C']],drop_first=True)
tree.fit(one_hot_data, data['Class'])
回答by Cédric Gaudissart
For nominal categorical variables, I would not use LabelEncoder
but sklearn.preprocessing.OneHotEncoder
or pandas.get_dummies
instead because there is usually no order in these type of variables.
对于名义分类变量,我不会使用LabelEncoder
but sklearn.preprocessing.OneHotEncoder
或pandas.get_dummies
代替,因为这些类型的变量通常没有顺序。
回答by mrwyatt
Sklearn Decision Trees do not handle conversion of categorical strings to numbers. I suggest you find a function in Sklearn (maybe this) that does so or manually write some code like:
Sklearn 决策树不处理分类字符串到数字的转换。我建议您在 Sklearn 中找到一个函数(可能是这个),或者手动编写一些代码,例如:
def cat2int(column):
vals = list(set(column))
for i, string in enumerate(column):
column[i] = vals.index(string)
return column