Python get_dummies (Pandas) 和 OneHotEncoder (Scikit-learn) 之间的优缺点是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36631163/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What are the pros and cons between get_dummies (Pandas) and OneHotEncoder (Scikit-learn)?
提问by O.rka
I'm learning different methods to convert categorical variables to numeric for machine-learning classifiers. I came across the pd.get_dummies
method and sklearn.preprocessing.OneHotEncoder()
and I wanted to see how they differed in terms of performance and usage.
我正在学习不同的方法来将分类变量转换为机器学习分类器的数字。我遇到了这种pd.get_dummies
方法,sklearn.preprocessing.OneHotEncoder()
我想看看它们在性能和使用方面有何不同。
I found a tutorial on how to use OneHotEncoder()
on https://xgdgsc.wordpress.com/2015/03/20/note-on-using-onehotencoder-in-scikit-learn-to-work-on-categorical-features/since the sklearn
documentation wasn't too helpful on this feature. I have a feeling I'm not doing it correctly...but
我发现关于如何使用教程OneHotEncoder()
上https://xgdgsc.wordpress.com/2015/03/20/note-on-using-onehotencoder-in-scikit-learn-to-work-on-categorical-features/自该sklearn
文件并没有这个功能也很有帮助。我有一种感觉我没有做对......但是
Can some explain the pros and cons of using pd.dummies
over sklearn.preprocessing.OneHotEncoder()
and vice versa?I know that OneHotEncoder()
gives you a sparse matrix but other than that I'm not sure how it is used and what the benefits are over the pandas
method. Am I using it inefficiently?
有人可以解释使用pd.dummies
over 的利弊,sklearn.preprocessing.OneHotEncoder()
反之亦然吗?我知道这OneHotEncoder()
会给你一个稀疏矩阵,但除此之外,我不确定它是如何使用的以及该pandas
方法的好处是什么。我使用它的效率低吗?
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
sns.set()
%matplotlib inline
#Iris Plot
iris = load_iris()
n_samples, m_features = iris.data.shape
#Load Data
X, y = iris.data, iris.target
D_target_dummy = dict(zip(np.arange(iris.target_names.shape[0]), iris.target_names))
DF_data = pd.DataFrame(X,columns=iris.feature_names)
DF_data["target"] = pd.Series(y).map(D_target_dummy)
#sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \
#0 5.1 3.5 1.4 0.2
#1 4.9 3.0 1.4 0.2
#2 4.7 3.2 1.3 0.2
#3 4.6 3.1 1.5 0.2
#4 5.0 3.6 1.4 0.2
#5 5.4 3.9 1.7 0.4
DF_dummies = pd.get_dummies(DF_data["target"])
#setosa versicolor virginica
#0 1 0 0
#1 1 0 0
#2 1 0 0
#3 1 0 0
#4 1 0 0
#5 1 0 0
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
def f1(DF_data):
Enc_ohe, Enc_label = OneHotEncoder(), LabelEncoder()
DF_data["Dummies"] = Enc_label.fit_transform(DF_data["target"])
DF_dummies2 = pd.DataFrame(Enc_ohe.fit_transform(DF_data[["Dummies"]]).todense(), columns = Enc_label.classes_)
return(DF_dummies2)
%timeit pd.get_dummies(DF_data["target"])
#1000 loops, best of 3: 777 μs per loop
%timeit f1(DF_data)
#100 loops, best of 3: 2.91 ms per loop
采纳答案by nos
OneHotEncoder
cannot process string values directly. If your nominal features are strings, then you need to first map them into integers.
OneHotEncoder
不能直接处理字符串值。如果您的名义特征是字符串,那么您需要先将它们映射为整数。
pandas.get_dummies
is kind of the opposite. By default, it only converts string columns into one-hot representation, unless columns are specified.
pandas.get_dummies
有点相反。默认情况下,它只会将字符串列转换为 one-hot 表示,除非指定了列。
回答by Denziloe
For machine learning, you almost definitely want to use sklearn.OneHotEncoder
.For other tasks like simple analyses, you might be able to use pd.get_dummies
, which is a bit more convenient.
对于机器学习,您几乎肯定要使用sklearn.OneHotEncoder
. 对于其他任务,例如简单分析,您可能可以使用pd.get_dummies
,这会更方便一些。
Note that sklearn.OneHotEncoder
has been updated in the latest version so that it does accept stringsfor categorical variables, as well as integers.
请注意,sklearn.OneHotEncoder
它已在最新版本中更新,因此它确实接受分类变量的字符串以及整数。
The crux of it is that the sklearn
encoder creates a function which persistsand can then be applied to new data sets which use the same categorical variables, with consistent results.
其关键是sklearn
编码器创建了一个持续存在的函数,然后可以将其应用于使用相同分类变量的新数据集,并具有一致的结果。
from sklearn.preprocessing import OneHotEncoder
# Create the encoder.
encoder = OneHotEncoder(handle_unknown="ignore")
encoder.fit(X_train) # Assume for simplicity all features are categorical.
# Apply the encoder.
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)
Note how we apply the same encoder we created via X_train
to the new data set X_test
.
请注意我们如何将通过创建的相同编码器应用于X_train
新数据集X_test
。
Consider what happens if X_test
contains different levels than X_train
for one of its variables. For example, let's say X_train["color"]
contains only "red"
and "green"
, but in addition to those, X_test["color"]
sometimes contains "blue"
.
考虑如果X_test
包含的级别X_train
与其变量之一不同,会发生什么。例如,假设X_train["color"]
仅包含"red"
和"green"
,但除此之外,X_test["color"]
有时还包含"blue"
。
If we use pd.get_dummies
, X_test
will end up with an additional "color_blue"
column which X_train
doesn't have, and the inconsistency will probably break our code later on, especially if we are feeding X_test
to an sklearn
model which we trained on X_train
.
如果我们使用pd.get_dummies
,X_test
最终会得到一个没有的额外"color_blue"
列X_train
,并且不一致可能会在以后破坏我们的代码,特别是如果我们正在X_test
为sklearn
我们训练的模型提供信息X_train
。
And if we want to process the data like this in production, where we're receiving a single example at a time, pd.get_dummies
won't be of use.
如果我们想在生产中处理这样的数据,我们一次接收一个示例,pd.get_dummies
将没有用。
With sklearn.OneHotEncoder
on the other hand, once we've created the encoder, we can reuse it to produce the same output every time, with columns only for "red"
and "green"
. And we can explicitly control what happens when it encounters the new level "blue"
: if we think that's impossible, then we can tell it to throw an error with handle_unknown="error"
; otherwise we can tell it to continue and simply set the red and green columns to 0, with handle_unknown="ignore"
.
随着sklearn.OneHotEncoder
在另一方面,一旦我们已经创建了编码器,我们可以重复使用它每次都产生相同的输出,仅列的"red"
和"green"
。我们可以明确地控制它遇到新级别时会发生什么"blue"
:如果我们认为这是不可能的,那么我们可以告诉它抛出一个错误handle_unknown="error"
;否则我们可以告诉它继续并简单地将红色和绿色列设置为 0,使用handle_unknown="ignore"
.
回答by Carl
why wouldn't you just cache or save the columns as variable col_list from the resulting get_dummies then use pd.reindex to align the train vs test datasets.... example:
为什么不从结果 get_dummies 中将列缓存或保存为变量 col_list 然后使用 pd.reindex 来对齐火车与测试数据集.... 示例:
df = pd.get_dummies(data)
col_list = df.columns.tolist()
new_df = pd.get_dummies(new_data)
new_df = new_df.reindex(columns=col_list).fillna(0.00)
回答by Sarah
I really like Carl's answer and upvoted it. I will just expand Carl's example a bit so that more people hopefully will appreciate that pd.get_dummies can handle unknown. The two examples below shows that pd.get_dummies can accomplish the same thing in handling unknown as OHE .
我真的很喜欢卡尔的回答并赞成它。我将稍微扩展 Carl 的例子,以便更多的人希望能够欣赏 pd.get_dummies 可以处理未知数。下面的两个示例表明 pd.get_dummies 在处理未知方面可以完成与 OHE 相同的事情。
# data is from @dzieciou's comment above
>>> data =pd.DataFrame(pd.Series(['good','bad','worst','good', 'good', 'bad']))
# new_data has two values that data does not have.
>>> new_data= pd.DataFrame(
pd.Series(['good','bad','worst','good', 'good', 'bad','excellent', 'perfect']))
Using pd.get_dummies
使用 pd.get_dummies
>>> df = pd.get_dummies(data)
>>> col_list = df.columns.tolist()
>>> print(df)
0_bad 0_good 0_worst
0 0 1 0
1 1 0 0
2 0 0 1
3 0 1 0
4 0 1 0
5 1 0 0
6 0 0 0
7 0 0 0
>>> new_df = pd.get_dummies(new_data)
# handle unknow by using .reindex and .fillna()
>>> new_df = new_df.reindex(columns=col_list).fillna(0.00)
>>> print(new_df)
# 0_bad 0_good 0_worst
# 0 0 1 0
# 1 1 0 0
# 2 0 0 1
# 3 0 1 0
# 4 0 1 0
# 5 1 0 0
# 6 0 0 0
# 7 0 0 0
Using OneHotEncoder
使用 OneHotEncoder
>>> encoder = OneHotEncoder(handle_unknown="ignore", sparse=False)
>>> encoder.fit(data)
>>> encoder.transform(new_data)
# array([[0., 1., 0.],
# [1., 0., 0.],
# [0., 0., 1.],
# [0., 1., 0.],
# [0., 1., 0.],
# [1., 0., 0.],
# [0., 0., 0.],
# [0., 0., 0.]])