pandas 在 scikit-learn 中进行一种热编码的可能方法？

Question

提问by Nguyen Ngoc Tuan

I have a pandas data frame with some categorical columns. Some of these contains non-integer values.

我有一个带有一些分类列的Pandas数据框。其中一些包含非整数值。

I currently want to apply several machine learning models on this data. With some models, it is necessary to do normalization to get better result. For example, converting categorical variable into dummy/indicator variables. Indeed, pandas has a function called get_dummiesfor that purpose. However, this function returns the result depending on the data. So if I call get_dummies on training data, then call it again on test data, columns achieved in two cases can be different because a categorical column in test data can contains just a sub-set/different set of possible values compared to possible values in training data.

我目前想在这些数据上应用几种机器学习模型。对于某些模型，需要进行归一化以获得更好的结果。例如，将分类变量转换为虚拟/指标变量。事实上，pandas 有一个名为get_dummies的函数用于这个目的。但是，此函数根据数据返回结果。因此，如果我在训练数据上调用 get_dummies，然后在测试数据上再次调用它，在两种情况下实现的列可能不同，因为测试数据中的分类列可以只包含一个子集/不同的可能值集训练数据。

Therefore, I am looking for other methods to do one-hot coding.

因此，我正在寻找其他方法来进行一次性编码。

What are possible ways to do one hot encoding in python (pandas/sklearn)?

在 python (pandas/sklearn) 中进行一种热编码的可能方法是什么？

Answer 1

回答by David Maust

Scikit-learn provides an encoder sklearn.preprocessing.LabelBinarizer.

Scikit-learn 提供了一个编码器sklearn.preprocessing.LabelBinarizer。

For encoding training data you can use fit_transform which will discover the category labels and create appropriate dummy variables.

对于编码训练数据，您可以使用 fit_transform 它将发现类别标签并创建适当的虚拟变量。

label_binarizer = sklearn.preprocessing.LabelBinarizer()
training_mat = label_binarizer.fit_transform(df.Label)

For the test data you can use the same set of categories using transform.

对于测试数据，您可以使用变换使用相同的类别集。

test_mat = label_binarizer.transform(test_df.Label)

Answer 2

回答by hume

In the past, I've found the easiest way to deal with this problem is to use get_dummiesand then enforce that the columns match up between test and train. For example, you might do something like:

过去，我发现处理此问题的最简单方法是使用get_dummies并强制使 test 和 train 之间的列匹配。例如，您可能会执行以下操作：

import pandas as pd

train = pd.get_dummies(train_df)
test = pd.get_dummies(test_df)

# get the columns in train that are not in test
col_to_add = np.setdiff1d(train.columns, test.columns)

# add these columns to test, setting them equal to zero
for c in col_to_add:
    test[c] = 0

# select and reorder the test columns using the train columns
test = test[train.columns]

This will discard information about labels that you haven't seen in the training set, but will enforce consistency. If you're doing cross validation using these splits, I'd recommend two things. First, do get_dummieson the whole dataset to get all of the columns (instead of just on the training set as in the code above). Second, use StratifiedKFoldfor cross validation so that your splits contain the relevant labels.

这将丢弃有关您在训练集中未见过的标签的信息，但会强制执行一致性。如果您使用这些拆分进行交叉验证，我建议您做两件事。首先，get_dummies在整个数据集上执行以获取所有列（而不是像上面的代码那样仅在训练集上）。其次，使用StratifiedKFold进行交叉验证，以便您的拆分包含相关标签。

Answer 3

回答by Arnab Biswas

Say, I have a feature "A" with possible values "a", "b", "c", "d". But the training data set consists of only three categories "a", "b", "c" as values. If get_dummiesis used at this stage, features generated will be three (A_a, A_b, A_c). But ideally there should be another feature A_d as well with all zeros. That can be achieved in the following way :

比如说，我有一个特征“A”，可能的值是“a”、“b”、“c”、“d”。但是训练数据集仅包含三个类别“a”、“b”、“c”作为值。如果get_dummies在这个阶段使用，生成的特征将是三个（A_a，A_b，A_c）。但理想情况下应该还有另一个特征 A_d 以及全零。这可以通过以下方式实现：

import pandas as pd
data = pd.DataFrame({"A" : ["a", "b", "c"]})
data["A"] = data["A"].astype("category", categories=["a", "b", "c", "d"])
mod_data = pd.get_dummies(data[["A"]])
print(mod_data)

The output being

输出为

   A_a  A_b  A_c  A_d
0  1.0  0.0  0.0  0.0
1  0.0  1.0  0.0  0.0
2  0.0  0.0  1.0  0.0

pandas 在 scikit-learn 中进行一种热编码的可能方法？

提问by Nguyen Ngoc Tuan

回答by David Maust

回答by hume

回答by Arnab Biswas

相关推荐

最近更新

标签

pandas 在 scikit-learn 中进行一种热编码的可能方法？

提问by Nguyen Ngoc Tuan

回答by David Maust

回答by hume

回答by Arnab Biswas

相关推荐

pandas 调整散点图中的点大小

Python Pandas df 未定义

Python Pandas：字符串包含和不包含

pandas - 按列名屏蔽数据框

相关推荐

最近更新

标签