pandas pd.get_dummies 是单热编码吗？

Question

提问by Mattia Paterna

Giventhe difference between one-hot encoding and dummy coding, is the pandas.get_dummiesmethod one-hot encoding when using default parameters (i.e. drop_first=False)?

鉴于one-hot encoding和dummy coding之间的区别，pandas.get_dummies使用默认参数（即drop_first=False）时的方法是one-hot encoding吗？

If so, does it make sense that I remove the intercept from the logistic regression model? Here is an example:

如果是这样，我从逻辑回归模型中删除截距是否有意义？下面是一个例子：

# I assume I have already my dataset in a DataFrame X and the true labels in y
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .80)

clf = LogisticRegression(fit_intercept=False)
clf.fit(X_train, y_train)

Answer 1

回答by piRSquared

Dummies are any variables that are either one or zero for each observation. pd.get_dummieswhen applied to a column of categories where we have onecategory per observation will produce a new column (variable) for each unique categorical value. It will place a one in the column corresponding to the categorical value present for that observation. This is equivalent to one hot encoding.

虚拟变量是每个观察结果为 1 或 0 的任何变量。 pd.get_dummies当应用于一列类别时，我们每个观察都有一个类别，将为每个唯一的类别值生成一个新列（变量）。它将在对应于该观察的分类值的列中放置一个。这相当于一种热编码。

One-hot encoding is characterized by having only one one per set of categorical values per observation.

One-hot 编码的特点是每个观察值的每组分类值只有一个。

Consider the series s

考虑系列 s

s = pd.Series(list('AABBCCABCDDEE'))

s

0     A
1     A
2     B
3     B
4     C
5     C
6     A
7     B
8     C
9     D
10    D
11    E
12    E
dtype: object

pd.get_dummieswill produce one-hot encoding. And yes! it is absolutely appropriate to not fit the intercept.

pd.get_dummies将产生one-hot编码。是的！不拟合截距是绝对合适的。

pd.get_dummies(s)

    A  B  C  D  E
0   1  0  0  0  0
1   1  0  0  0  0
2   0  1  0  0  0
3   0  1  0  0  0
4   0  0  1  0  0
5   0  0  1  0  0
6   1  0  0  0  0
7   0  1  0  0  0
8   0  0  1  0  0
9   0  0  0  1  0
10  0  0  0  1  0
11  0  0  0  0  1
12  0  0  0  0  1

However, if you had sinclude different data and used pd.Series.str.get_dummies

但是，如果您s包含不同的数据并使用pd.Series.str.get_dummies

s = pd.Series('A|B,A,B,B,C|D,D|B,A,B,C,A|D'.split(','))

s

0    A|B
1      A
2      B
3      B
4    C|D
5    D|B
6      A
7      B
8      C
9    A|D
dtype: object

Then get_dummiesproduces dummy variables that are not one-hot encoded and you could theoretically leave the intercept.

然后get_dummies产生不是单热编码的虚拟变量，理论上你可以离开拦截。

s.str.get_dummies()

   A  B  C  D
0  1  1  0  0
1  1  0  0  0
2  0  1  0  0
3  0  1  0  0
4  0  0  1  1
5  0  1  0  1
6  1  0  0  0
7  0  1  0  0
8  0  0  1  0
9  1  0  0  1

Answer 2

回答by muskrat

First question: yes, pd.get_dummies()is one-hot encoding in its default state; see example below, from pd.get_dummies docs:

第一个问题：是的，pd.get_dummies()默认状态下是one-hot编码；请参阅下面的示例，来自pd.get_dummies 文档：

s = pd.Series(list('abca'))
pd.get_dummies(s, drop_first=False)

Second question: [edited now that OP includes code example] yes, if you are one-hot encoding the inputs to a logistic regression model, it is appropriate to skip the intercept.

第二个问题：[编辑现在 OP 包含代码示例] 是的，如果您对逻辑回归模型的输入进行单热编码，则跳过拦截是合适的。

pandas pd.get_dummies 是单热编码吗？

提问by Mattia Paterna

回答by piRSquared

回答by muskrat

相关推荐

最近更新

标签

pandas pd.get_dummies 是单热编码吗？

提问by Mattia Paterna

回答by piRSquared

回答by muskrat

相关推荐

pandas 类型错误：__init__() 得到了意外的关键字参数“编码”

Pandas 按列将 CSV 拆分为多个 CSV（或 DataFrame）

Pandas，如何将多列组合成一个数组列

Pandas：数据框字典

相关推荐

最近更新

标签

pandas 类型错误：init() 得到了意外的关键字参数“编码”