pandas pd.get_dummies 是单热编码吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48170405/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Is pd.get_dummies one-hot encoding?
提问by Mattia Paterna
Giventhe difference between one-hot encoding and dummy coding, is the pandas.get_dummies
method one-hot encoding when using default parameters (i.e. drop_first=False
)?
鉴于one-hot encoding和dummy coding之间的区别,pandas.get_dummies
使用默认参数(即drop_first=False
)时的方法是one-hot encoding吗?
If so, does it make sense that I remove the intercept from the logistic regression model? Here is an example:
如果是这样,我从逻辑回归模型中删除截距是否有意义?下面是一个例子:
# I assume I have already my dataset in a DataFrame X and the true labels in y
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .80)
clf = LogisticRegression(fit_intercept=False)
clf.fit(X_train, y_train)
回答by piRSquared
Dummies are any variables that are either one or zero for each observation. pd.get_dummies
when applied to a column of categories where we have onecategory per observation will produce a new column (variable) for each unique categorical value. It will place a one in the column corresponding to the categorical value present for that observation. This is equivalent to one hot encoding.
虚拟变量是每个观察结果为 1 或 0 的任何变量。 pd.get_dummies
当应用于一列类别时,我们每个观察都有一个类别,将为每个唯一的类别值生成一个新列(变量)。它将在对应于该观察的分类值的列中放置一个。这相当于一种热编码。
One-hot encoding is characterized by having only one one per set of categorical values per observation.
One-hot 编码的特点是每个观察值的每组分类值只有一个。
Consider the series s
考虑系列 s
s = pd.Series(list('AABBCCABCDDEE'))
s
0 A
1 A
2 B
3 B
4 C
5 C
6 A
7 B
8 C
9 D
10 D
11 E
12 E
dtype: object
pd.get_dummies
will produce one-hot encoding. And yes! it is absolutely appropriate to not fit the intercept.
pd.get_dummies
将产生one-hot编码。是的!不拟合截距是绝对合适的。
pd.get_dummies(s)
A B C D E
0 1 0 0 0 0
1 1 0 0 0 0
2 0 1 0 0 0
3 0 1 0 0 0
4 0 0 1 0 0
5 0 0 1 0 0
6 1 0 0 0 0
7 0 1 0 0 0
8 0 0 1 0 0
9 0 0 0 1 0
10 0 0 0 1 0
11 0 0 0 0 1
12 0 0 0 0 1
However, if you had s
include different data and used pd.Series.str.get_dummies
但是,如果您s
包含不同的数据并使用pd.Series.str.get_dummies
s = pd.Series('A|B,A,B,B,C|D,D|B,A,B,C,A|D'.split(','))
s
0 A|B
1 A
2 B
3 B
4 C|D
5 D|B
6 A
7 B
8 C
9 A|D
dtype: object
Then get_dummies
produces dummy variables that are not one-hot encoded and you could theoretically leave the intercept.
然后get_dummies
产生不是单热编码的虚拟变量,理论上你可以离开拦截。
s.str.get_dummies()
A B C D
0 1 1 0 0
1 1 0 0 0
2 0 1 0 0
3 0 1 0 0
4 0 0 1 1
5 0 1 0 1
6 1 0 0 0
7 0 1 0 0
8 0 0 1 0
9 1 0 0 1
回答by muskrat
First question: yes, pd.get_dummies()
is one-hot encoding in its default state; see example below, from pd.get_dummies docs:
第一个问题:是的,pd.get_dummies()
默认状态下是one-hot编码;请参阅下面的示例,来自pd.get_dummies 文档:
s = pd.Series(list('abca'))
pd.get_dummies(s, drop_first=False)
Second question: [edited now that OP includes code example] yes, if you are one-hot encoding the inputs to a logistic regression model, it is appropriate to skip the intercept.
第二个问题:[编辑现在 OP 包含代码示例] 是的,如果您对逻辑回归模型的输入进行单热编码,则跳过拦截是合适的。