pandas 想知道 pd.factorize、pd.get_dummies、sklearn.preprocessing.LableEncoder 和 OneHotEncoder 之间的差异

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40336502/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:19:09  来源:igfitidea点击:

Want to know the diff among pd.factorize, pd.get_dummies, sklearn.preprocessing.LableEncoder and OneHotEncoder

pythonpandasencodingmachine-learningscikit-learn

提问by Richard Ji

All four functions seem really similar to me. In some situations some of them might give the same result, some not. Any help will be thankfully appreciated!

所有四个功能对我来说似乎都非常相似。在某些情况下,其中一些可能会给出相同的结果,而另一些则不会。任何帮助将不胜感激!

Now I know and I assume that internally, factorizeand LabelEncoderwork the same way and having no big differences in terms of results. I am not sure whether they will take up similar time with large magnitudes of data.

现在我知道并且我在内部假设,factorize并且LabelEncoder以相同的方式工作并且在结果方面没有太大差异。我不确定他们是否会在处理大量数据时占用类似的时间。

get_dummiesand OneHotEncoderwill yield the same result but OneHotEncodercan only handle numbers but get_dummieswill take all kinds of input. get_dummieswill generate new column names automatically for each column input, but OneHotEncoderwill not (it rather will assign new column names 1,2,3....). So get_dummiesis better in all respectives.

get_dummies并且OneHotEncoder会产生相同的结果但OneHotEncoder只能处理数字但get_dummies会接受各种输入。get_dummies将为每个列输入自动生成新的列名,但OneHotEncoder不会(它会分配新的列名 1,2,3....)。所以get_dummies在所有方面都更好。

Please correct me if I am wrong! Thank you!

如果我错了,请纠正我!谢谢!

回答by Romain

These four encoders can be split in two categories:

这四种编码器可以分为两类:

  • Encode labels into categorical variables: Pandas factorizeand scikit-learn LabelEncoder. The result will have 1 dimension.
  • Encode categorical variable into dummy/indicator (binary) variables: Pandas get_dummiesand scikit-learn OneHotEncoder. The result will have n dimensions, one by distinct value of the encoded categorical variable.
  • 标签编码为分类变量:Pandasfactorize和 scikit-learn LabelEncoder。结果将有 1 维。
  • 分类变量编码为虚拟/指标(二进制)变量:Pandasget_dummies和 scikit-learn OneHotEncoder。结果将有 n 个维度,一个是编码分类变量的不同值。

The main difference between pandas and scikit-learn encoders is that scikit-learn encoders are made to be used in scikit-learn pipelineswith fitand transformmethods.

Pandas之间的主要区别scikit学习编码器的是,scikit学习编码器制造中使用scikit学习管道fittransform方法。

Encode labels into categorical variables

将标签编码为分类变量

Pandas factorizeand scikit-learn LabelEncoderbelong to the first category. They can be used to create categorical variables for example to transform characters into numbers.

Pandasfactorize和 scikit-learnLabelEncoder属于第一类。它们可用于创建分类变量,例如将字符转换为数字。

from sklearn import preprocessing    
# Test data
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])    
df['Fact'] = pd.factorize(df['Col'])[0]
le = preprocessing.LabelEncoder()
df['Lab'] = le.fit_transform(df['Col'])

print(df)
#   Col  Fact  Lab
# 0   A     0    0
# 1   B     1    1
# 2   B     1    1
# 3   C     2    2

Encode categorical variable into dummy/indicator (binary) variables

将分类变量编码为虚拟/指标(二进制)变量

Pandas get_dummiesand scikit-learn OneHotEncoderbelong to the second category. They can be used to create binary variables. OneHotEncodercan only be used with categorical integers while get_dummiescan be used with other type of variables.

Pandasget_dummies和 scikit-learnOneHotEncoder属于第二类。它们可用于创建二进制变量。OneHotEncoder只能与分类整数一起使用,而get_dummies可以与其他类型的变量一起使用。

df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
df = pd.get_dummies(df)

print(df)
#    Col_A  Col_B  Col_C
# 0    1.0    0.0    0.0
# 1    0.0    1.0    0.0
# 2    0.0    1.0    0.0
# 3    0.0    0.0    1.0

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
# We need to transform first character into integer in order to use the OneHotEncoder
le = preprocessing.LabelEncoder()
df['Col'] = le.fit_transform(df['Col'])
enc = OneHotEncoder()
df = DataFrame(enc.fit_transform(df).toarray())

print(df)
#      0    1    2
# 0  1.0  0.0  0.0
# 1  0.0  1.0  0.0
# 2  0.0  1.0  0.0
# 3  0.0  0.0  1.0

I've also written a more detailed postbased on this answer.

我还根据这个答案写了一篇更详细的帖子