Python get_dummies (Pandas) 和 OneHotEncoder (Scikit-learn) 之间的优缺点是什么？

Question

提问by O.rka

I'm learning different methods to convert categorical variables to numeric for machine-learning classifiers. I came across the pd.get_dummiesmethod and sklearn.preprocessing.OneHotEncoder()and I wanted to see how they differed in terms of performance and usage.

我正在学习不同的方法来将分类变量转换为机器学习分类器的数字。我遇到了这种pd.get_dummies方法，sklearn.preprocessing.OneHotEncoder()我想看看它们在性能和使用方面有何不同。

I found a tutorial on how to use OneHotEncoder()on https://xgdgsc.wordpress.com/2015/03/20/note-on-using-onehotencoder-in-scikit-learn-to-work-on-categorical-features/since the sklearndocumentation wasn't too helpful on this feature. I have a feeling I'm not doing it correctly...but

我发现关于如何使用教程OneHotEncoder()上https://xgdgsc.wordpress.com/2015/03/20/note-on-using-onehotencoder-in-scikit-learn-to-work-on-categorical-features/自该sklearn文件并没有这个功能也很有帮助。我有一种感觉我没有做对......但是

Can some explain the pros and cons of using pd.dummiesover sklearn.preprocessing.OneHotEncoder()and vice versa?I know that OneHotEncoder()gives you a sparse matrix but other than that I'm not sure how it is used and what the benefits are over the pandasmethod. Am I using it inefficiently?

有人可以解释使用pd.dummiesover 的利弊，sklearn.preprocessing.OneHotEncoder()反之亦然吗？我知道这OneHotEncoder()会给你一个稀疏矩阵，但除此之外，我不确定它是如何使用的以及该pandas方法的好处是什么。我使用它的效率低吗？

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
sns.set()

%matplotlib inline

#Iris Plot
iris = load_iris()
n_samples, m_features = iris.data.shape

#Load Data
X, y = iris.data, iris.target
D_target_dummy = dict(zip(np.arange(iris.target_names.shape[0]), iris.target_names))

DF_data = pd.DataFrame(X,columns=iris.feature_names)
DF_data["target"] = pd.Series(y).map(D_target_dummy)
#sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
#0                  5.1               3.5                1.4               0.2   
#1                  4.9               3.0                1.4               0.2   
#2                  4.7               3.2                1.3               0.2   
#3                  4.6               3.1                1.5               0.2   
#4                  5.0               3.6                1.4               0.2   
#5                  5.4               3.9                1.7               0.4   

DF_dummies = pd.get_dummies(DF_data["target"])
#setosa  versicolor  virginica
#0         1           0          0
#1         1           0          0
#2         1           0          0
#3         1           0          0
#4         1           0          0
#5         1           0          0

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
def f1(DF_data):
    Enc_ohe, Enc_label = OneHotEncoder(), LabelEncoder()
    DF_data["Dummies"] = Enc_label.fit_transform(DF_data["target"])
    DF_dummies2 = pd.DataFrame(Enc_ohe.fit_transform(DF_data[["Dummies"]]).todense(), columns = Enc_label.classes_)
    return(DF_dummies2)

%timeit pd.get_dummies(DF_data["target"])
#1000 loops, best of 3: 777 μs per loop

%timeit f1(DF_data)
#100 loops, best of 3: 2.91 ms per loop

Answer 1

采纳答案by nos

OneHotEncodercannot process string values directly. If your nominal features are strings, then you need to first map them into integers.

OneHotEncoder不能直接处理字符串值。如果您的名义特征是字符串，那么您需要先将它们映射为整数。

pandas.get_dummiesis kind of the opposite. By default, it only converts string columns into one-hot representation, unless columns are specified.

pandas.get_dummies有点相反。默认情况下，它只会将字符串列转换为 one-hot 表示，除非指定了列。

Answer 2

回答by Denziloe

For machine learning, you almost definitely want to use sklearn.OneHotEncoder.For other tasks like simple analyses, you might be able to use pd.get_dummies, which is a bit more convenient.

对于机器学习，您几乎肯定要使用sklearn.OneHotEncoder. 对于其他任务，例如简单分析，您可能可以使用pd.get_dummies，这会更方便一些。

Note that sklearn.OneHotEncoderhas been updated in the latest version so that it does accept stringsfor categorical variables, as well as integers.

请注意，sklearn.OneHotEncoder它已在最新版本中更新，因此它确实接受分类变量的字符串以及整数。

The crux of it is that the sklearnencoder creates a function which persistsand can then be applied to new data sets which use the same categorical variables, with consistent results.

其关键是sklearn编码器创建了一个持续存在的函数，然后可以将其应用于使用相同分类变量的新数据集，并具有一致的结果。

from sklearn.preprocessing import OneHotEncoder

# Create the encoder.
encoder = OneHotEncoder(handle_unknown="ignore")
encoder.fit(X_train)    # Assume for simplicity all features are categorical.

# Apply the encoder.
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

Note how we apply the same encoder we created via X_trainto the new data set X_test.

请注意我们如何将通过创建的相同编码器应用于X_train新数据集X_test。

Consider what happens if X_testcontains different levels than X_trainfor one of its variables. For example, let's say X_train["color"]contains only "red"and "green", but in addition to those, X_test["color"]sometimes contains "blue".

考虑如果X_test包含的级别X_train与其变量之一不同，会发生什么。例如，假设X_train["color"]仅包含"red"和"green"，但除此之外，X_test["color"]有时还包含"blue"。

If we use pd.get_dummies, X_testwill end up with an additional "color_blue"column which X_traindoesn't have, and the inconsistency will probably break our code later on, especially if we are feeding X_testto an sklearnmodel which we trained on X_train.

如果我们使用pd.get_dummies,X_test最终会得到一个没有的额外"color_blue"列X_train，并且不一致可能会在以后破坏我们的代码，特别是如果我们正在X_test为sklearn我们训练的模型提供信息X_train。

And if we want to process the data like this in production, where we're receiving a single example at a time, pd.get_dummieswon't be of use.

如果我们想在生产中处理这样的数据，我们一次接收一个示例，pd.get_dummies将没有用。

With sklearn.OneHotEncoderon the other hand, once we've created the encoder, we can reuse it to produce the same output every time, with columns only for "red"and "green". And we can explicitly control what happens when it encounters the new level "blue": if we think that's impossible, then we can tell it to throw an error with handle_unknown="error"; otherwise we can tell it to continue and simply set the red and green columns to 0, with handle_unknown="ignore".

随着sklearn.OneHotEncoder在另一方面，一旦我们已经创建了编码器，我们可以重复使用它每次都产生相同的输出，仅列的"red"和"green"。我们可以明确地控制它遇到新级别时会发生什么"blue"：如果我们认为这是不可能的，那么我们可以告诉它抛出一个错误handle_unknown="error"；否则我们可以告诉它继续并简单地将红色和绿色列设置为 0，使用handle_unknown="ignore".

Answer 3

回答by Carl

why wouldn't you just cache or save the columns as variable col_list from the resulting get_dummies then use pd.reindex to align the train vs test datasets.... example:

为什么不从结果 get_dummies 中将列缓存或保存为变量 col_list 然后使用 pd.reindex 来对齐火车与测试数据集.... 示例：

df = pd.get_dummies(data)
col_list = df.columns.tolist()

new_df = pd.get_dummies(new_data)
new_df = new_df.reindex(columns=col_list).fillna(0.00)

Answer 4

回答by Sarah

I really like Carl's answer and upvoted it. I will just expand Carl's example a bit so that more people hopefully will appreciate that pd.get_dummies can handle unknown. The two examples below shows that pd.get_dummies can accomplish the same thing in handling unknown as OHE .

我真的很喜欢卡尔的回答并赞成它。我将稍微扩展 Carl 的例子，以便更多的人希望能够欣赏 pd.get_dummies 可以处理未知数。下面的两个示例表明 pd.get_dummies 在处理未知方面可以完成与 OHE 相同的事情。

# data is from @dzieciou's comment above
>>> data =pd.DataFrame(pd.Series(['good','bad','worst','good', 'good', 'bad']))
# new_data has two values that data does not have. 
>>> new_data= pd.DataFrame(
pd.Series(['good','bad','worst','good', 'good', 'bad','excellent', 'perfect']))

Using pd.get_dummies

使用 pd.get_dummies

>>> df = pd.get_dummies(data)
>>> col_list = df.columns.tolist()
>>> print(df)
   0_bad  0_good  0_worst
0      0       1        0
1      1       0        0
2      0       0        1
3      0       1        0
4      0       1        0
5      1       0        0
6      0       0        0
7      0       0        0

>>> new_df = pd.get_dummies(new_data)
# handle unknow by using .reindex and .fillna()
>>> new_df = new_df.reindex(columns=col_list).fillna(0.00)
>>> print(new_df)
#    0_bad  0_good  0_worst
# 0      0       1        0
# 1      1       0        0
# 2      0       0        1
# 3      0       1        0
# 4      0       1        0
# 5      1       0        0
# 6      0       0        0
# 7      0       0        0

Using OneHotEncoder

使用 OneHotEncoder

>>> encoder = OneHotEncoder(handle_unknown="ignore", sparse=False)
>>> encoder.fit(data)
>>> encoder.transform(new_data)
# array([[0., 1., 0.],
#        [1., 0., 0.],
#        [0., 0., 1.],
#        [0., 1., 0.],
#        [0., 1., 0.],
#        [1., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.]])

Python get_dummies (Pandas) 和 OneHotEncoder (Scikit-learn) 之间的优缺点是什么？

提问by O.rka

采纳答案by nos

回答by Denziloe

回答by Carl

回答by Sarah

Using pd.get_dummies

使用 pd.get_dummies

Using OneHotEncoder

使用 OneHotEncoder

相关推荐

最近更新

标签

Python get_dummies (Pandas) 和 OneHotEncoder (Scikit-learn) 之间的优缺点是什么？

提问by O.rka

采纳答案by nos

回答by Denziloe

回答by Carl

回答by Sarah

Using pd.get_dummies

使用 pd.get_dummies

Using OneHotEncoder

使用 OneHotEncoder

相关推荐

Python pyQt：如何更新标签？

Python np.random.rand 与 np.random.random

Python 中 numpy.random.rand 与 numpy.random.randn 之间的差异

Python 删除 JSON 对象中的元素

相关推荐

最近更新

标签