Python sklearn 随机森林可以直接处理分类特征吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24715230/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Can sklearn random forest directly handle categorical features?
提问by hahdawg
Say I have a categorical feature, color, which takes the values
假设我有一个分类特征颜色,它采用值
['red', 'blue', 'green', 'orange'],
['红色','蓝色','绿色','橙色'],
and I want to use it to predict something in a random forest. If I one-hot encode it (i.e. I change it to four dummy variables), how do I tell sklearn that the four dummy variables are really one variable? Specifically, when sklearn is randomly selecting features to use at different nodes, it should either include the red, blue, green and orange dummies together, or it shouldn't include any of them.
我想用它来预测随机森林中的某些东西。如果我对它进行单热编码(即我将其更改为四个虚拟变量),我如何告诉 sklearn 这四个虚拟变量真的是一个变量?具体来说,当 sklearn 随机选择要在不同节点上使用的特征时,它应该将红色、蓝色、绿色和橙色虚拟对象包括在一起,或者不应该包括其中任何一个。
I've heard that there's no way to do this, but I'd imagine there must be a way to deal with categorical variables without arbitrarily coding them as numbers or something like that.
我听说没有办法做到这一点,但我想必须有一种方法来处理分类变量,而无需随意将它们编码为数字或类似的东西。
采纳答案by Fred Foo
No, there isn't. Somebody's working on thisand the patch might be merged into mainline some day, but right now there's no support for categorical variables in scikit-learn except dummy (one-hot) encoding.
不,没有。有人正在研究这个,补丁可能有一天会合并到主线中,但是现在除了虚拟(one-hot)编码之外,scikit-learn 中不支持分类变量。
回答by Hemanth Kondapalli
You have to make the categorical variable into a series of dummy variables. Yes I know its annoying and seems unnecessary but that is how sklearn works. if you are using pandas. use pd.get_dummies, it works really well.
您必须将分类变量变成一系列虚拟变量。是的,我知道它很烦人,而且似乎没有必要,但这就是 sklearn 的工作原理。如果您使用的是熊猫。使用 pd.get_dummies,它工作得很好。
回答by denson
Most implementations of random forest (and many other machine learning algorithms) that accept categorical inputs are either just automating the encoding of categorical features for you or using a method that becomes computationally intractable for large numbers of categories.
大多数接受分类输入的随机森林(以及许多其他机器学习算法)的实现要么只是为您自动化分类特征的编码,要么使用一种对于大量类别在计算上变得难以处理的方法。
A notable exception is H2O. H2O has a very efficient methodfor handling categorical data directly which often gives it an edge over tree based methods that require one-hot-encoding.
一个值得注意的例外是 H2O。H2O 有一种非常有效的方法来直接处理分类数据,这通常使它比需要单热编码的基于树的方法更具优势。
This article by Will McGinnis has a very good discussion of one-hot-encoding and alternatives.