Python sklearn 随机森林可以直接处理分类特征吗？

Question

提问by hahdawg

Say I have a categorical feature, color, which takes the values

假设我有一个分类特征颜色，它采用值

['red', 'blue', 'green', 'orange'],

['红色'，'蓝色'，'绿色'，'橙色']，

and I want to use it to predict something in a random forest. If I one-hot encode it (i.e. I change it to four dummy variables), how do I tell sklearn that the four dummy variables are really one variable? Specifically, when sklearn is randomly selecting features to use at different nodes, it should either include the red, blue, green and orange dummies together, or it shouldn't include any of them.

我想用它来预测随机森林中的某些东西。如果我对它进行单热编码（即我将其更改为四个虚拟变量），我如何告诉 sklearn 这四个虚拟变量真的是一个变量？具体来说，当 sklearn 随机选择要在不同节点上使用的特征时，它应该将红色、蓝色、绿色和橙色虚拟对象包括在一起，或者不应该包括其中任何一个。

I've heard that there's no way to do this, but I'd imagine there must be a way to deal with categorical variables without arbitrarily coding them as numbers or something like that.

我听说没有办法做到这一点，但我想必须有一种方法来处理分类变量，而无需随意将它们编码为数字或类似的东西。

Answer 1

采纳答案by Fred Foo

No, there isn't. Somebody's working on thisand the patch might be merged into mainline some day, but right now there's no support for categorical variables in scikit-learn except dummy (one-hot) encoding.

不，没有。有人正在研究这个，补丁可能有一天会合并到主线中，但是现在除了虚拟（one-hot）编码之外，scikit-learn 中不支持分类变量。

Answer 2

回答by Hemanth Kondapalli

You have to make the categorical variable into a series of dummy variables. Yes I know its annoying and seems unnecessary but that is how sklearn works. if you are using pandas. use pd.get_dummies, it works really well.

您必须将分类变量变成一系列虚拟变量。是的，我知道它很烦人，而且似乎没有必要，但这就是 sklearn 的工作原理。如果您使用的是熊猫。使用 pd.get_dummies，它工作得很好。

Answer 3

回答by denson

Most implementations of random forest (and many other machine learning algorithms) that accept categorical inputs are either just automating the encoding of categorical features for you or using a method that becomes computationally intractable for large numbers of categories.

大多数接受分类输入的随机森林（以及许多其他机器学习算法）的实现要么只是为您自动化分类特征的编码，要么使用一种对于大量类别在计算上变得难以处理的方法。

A notable exception is H2O. H2O has a very efficient methodfor handling categorical data directly which often gives it an edge over tree based methods that require one-hot-encoding.

一个值得注意的例外是 H2O。H2O 有一种非常有效的方法来直接处理分类数据，这通常使它比需要单热编码的基于树的方法更具优势。

This article by Will McGinnis has a very good discussion of one-hot-encoding and alternatives.

Will McGinnis 的这篇文章对 one-hot-encoding 和替代方案进行了很好的讨论。

This article by Nick Dingwall and Chris Potts has a very good discussion about categorical variables and tree based learners.

Nick Dingwall 和 Chris Potts 的这篇文章对分类变量和基于树的学习器进行了很好的讨论。

Python sklearn 随机森林可以直接处理分类特征吗？

提问by hahdawg

采纳答案by Fred Foo

回答by Hemanth Kondapalli

回答by denson

相关推荐

最近更新

标签

Python sklearn 随机森林可以直接处理分类特征吗？

提问by hahdawg

采纳答案by Fred Foo

回答by Hemanth Kondapalli

回答by denson

相关推荐

Python Conda 环境和 .BAT 文件

Python 熊猫数据框的中位数

如何在 Python 中使用 IP 地址查找位置？

Python 使用 matplotlib 将点的散点添加到箱线图

相关推荐

最近更新

标签