pandas 想知道 pd.factorize、pd.get_dummies、sklearn.preprocessing.LableEncoder 和 OneHotEncoder 之间的差异
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40336502/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Want to know the diff among pd.factorize, pd.get_dummies, sklearn.preprocessing.LableEncoder and OneHotEncoder
提问by Richard Ji
All four functions seem really similar to me. In some situations some of them might give the same result, some not. Any help will be thankfully appreciated!
所有四个功能对我来说似乎都非常相似。在某些情况下,其中一些可能会给出相同的结果,而另一些则不会。任何帮助将不胜感激!
Now I know and I assume that internally, factorize
and LabelEncoder
work the same way and having no big differences in terms of results. I am not sure whether they will take up similar time with large magnitudes of data.
现在我知道并且我在内部假设,factorize
并且LabelEncoder
以相同的方式工作并且在结果方面没有太大差异。我不确定他们是否会在处理大量数据时占用类似的时间。
get_dummies
and OneHotEncoder
will yield the same result but OneHotEncoder
can only handle numbers but get_dummies
will take all kinds of input. get_dummies
will generate new column names automatically for each column input, but OneHotEncoder
will not (it rather will assign new column names 1,2,3....). So get_dummies
is better in all respectives.
get_dummies
并且OneHotEncoder
会产生相同的结果但OneHotEncoder
只能处理数字但get_dummies
会接受各种输入。get_dummies
将为每个列输入自动生成新的列名,但OneHotEncoder
不会(它会分配新的列名 1,2,3....)。所以get_dummies
在所有方面都更好。
Please correct me if I am wrong! Thank you!
如果我错了,请纠正我!谢谢!
回答by Romain
These four encoders can be split in two categories:
这四种编码器可以分为两类:
- Encode labels into categorical variables: Pandas
factorize
and scikit-learnLabelEncoder
. The result will have 1 dimension. - Encode categorical variable into dummy/indicator (binary) variables: Pandas
get_dummies
and scikit-learnOneHotEncoder
. The result will have n dimensions, one by distinct value of the encoded categorical variable.
- 将标签编码为分类变量:Pandas
factorize
和 scikit-learnLabelEncoder
。结果将有 1 维。 - 将分类变量编码为虚拟/指标(二进制)变量:Pandas
get_dummies
和 scikit-learnOneHotEncoder
。结果将有 n 个维度,一个是编码分类变量的不同值。
The main difference between pandas and scikit-learn encoders is that scikit-learn encoders are made to be used in scikit-learn pipelineswith fit
and transform
methods.
Pandas之间的主要区别scikit学习编码器的是,scikit学习编码器制造中使用scikit学习管道与fit
和transform
方法。
Encode labels into categorical variables
将标签编码为分类变量
Pandas factorize
and scikit-learn LabelEncoder
belong to the first category. They can be used to create categorical variables for example to transform characters into numbers.
Pandasfactorize
和 scikit-learnLabelEncoder
属于第一类。它们可用于创建分类变量,例如将字符转换为数字。
from sklearn import preprocessing
# Test data
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
df['Fact'] = pd.factorize(df['Col'])[0]
le = preprocessing.LabelEncoder()
df['Lab'] = le.fit_transform(df['Col'])
print(df)
# Col Fact Lab
# 0 A 0 0
# 1 B 1 1
# 2 B 1 1
# 3 C 2 2
Encode categorical variable into dummy/indicator (binary) variables
将分类变量编码为虚拟/指标(二进制)变量
Pandas get_dummies
and scikit-learn OneHotEncoder
belong to the second category. They can be used to create binary variables. OneHotEncoder
can only be used with categorical integers while get_dummies
can be used with other type of variables.
Pandasget_dummies
和 scikit-learnOneHotEncoder
属于第二类。它们可用于创建二进制变量。OneHotEncoder
只能与分类整数一起使用,而get_dummies
可以与其他类型的变量一起使用。
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
df = pd.get_dummies(df)
print(df)
# Col_A Col_B Col_C
# 0 1.0 0.0 0.0
# 1 0.0 1.0 0.0
# 2 0.0 1.0 0.0
# 3 0.0 0.0 1.0
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
# We need to transform first character into integer in order to use the OneHotEncoder
le = preprocessing.LabelEncoder()
df['Col'] = le.fit_transform(df['Col'])
enc = OneHotEncoder()
df = DataFrame(enc.fit_transform(df).toarray())
print(df)
# 0 1 2
# 0 1.0 0.0 0.0
# 1 0.0 1.0 0.0
# 2 0.0 1.0 0.0
# 3 0.0 0.0 1.0
I've also written a more detailed postbased on this answer.
我还根据这个答案写了一篇更详细的帖子。