pandas 在 scikit-learn 中进行一种热编码的可能方法?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34170413/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Possible ways to do one hot encoding in scikit-learn?
提问by Nguyen Ngoc Tuan
I have a pandas data frame with some categorical columns. Some of these contains non-integer values.
我有一个带有一些分类列的Pandas数据框。其中一些包含非整数值。
I currently want to apply several machine learning models on this data. With some models, it is necessary to do normalization to get better result. For example, converting categorical variable into dummy/indicator variables. Indeed, pandas has a function called get_dummiesfor that purpose. However, this function returns the result depending on the data. So if I call get_dummies on training data, then call it again on test data, columns achieved in two cases can be different because a categorical column in test data can contains just a sub-set/different set of possible values compared to possible values in training data.
我目前想在这些数据上应用几种机器学习模型。对于某些模型,需要进行归一化以获得更好的结果。例如,将分类变量转换为虚拟/指标变量。事实上,pandas 有一个名为get_dummies的函数用于这个目的。但是,此函数根据数据返回结果。因此,如果我在训练数据上调用 get_dummies,然后在测试数据上再次调用它,在两种情况下实现的列可能不同,因为测试数据中的分类列可以只包含一个子集/不同的可能值集训练数据。
Therefore, I am looking for other methods to do one-hot coding.
因此,我正在寻找其他方法来进行一次性编码。
What are possible ways to do one hot encoding in python (pandas/sklearn)?
在 python (pandas/sklearn) 中进行一种热编码的可能方法是什么?
回答by David Maust
Scikit-learn provides an encoder sklearn.preprocessing.LabelBinarizer
.
Scikit-learn 提供了一个编码器sklearn.preprocessing.LabelBinarizer
。
For encoding training data you can use fit_transform which will discover the category labels and create appropriate dummy variables.
对于编码训练数据,您可以使用 fit_transform 它将发现类别标签并创建适当的虚拟变量。
label_binarizer = sklearn.preprocessing.LabelBinarizer()
training_mat = label_binarizer.fit_transform(df.Label)
For the test data you can use the same set of categories using transform.
对于测试数据,您可以使用变换使用相同的类别集。
test_mat = label_binarizer.transform(test_df.Label)
回答by hume
In the past, I've found the easiest way to deal with this problem is to use get_dummies
and then enforce that the columns match up between test and train. For example, you might do something like:
过去,我发现处理此问题的最简单方法是使用get_dummies
并强制使 test 和 train 之间的列匹配。例如,您可能会执行以下操作:
import pandas as pd
train = pd.get_dummies(train_df)
test = pd.get_dummies(test_df)
# get the columns in train that are not in test
col_to_add = np.setdiff1d(train.columns, test.columns)
# add these columns to test, setting them equal to zero
for c in col_to_add:
test[c] = 0
# select and reorder the test columns using the train columns
test = test[train.columns]
This will discard information about labels that you haven't seen in the training set, but will enforce consistency. If you're doing cross validation using these splits, I'd recommend two things. First, do get_dummies
on the whole dataset to get all of the columns (instead of just on the training set as in the code above). Second, use StratifiedKFoldfor cross validation so that your splits contain the relevant labels.
这将丢弃有关您在训练集中未见过的标签的信息,但会强制执行一致性。如果您使用这些拆分进行交叉验证,我建议您做两件事。首先,get_dummies
在整个数据集上执行以获取所有列(而不是像上面的代码那样仅在训练集上)。其次,使用StratifiedKFold进行交叉验证,以便您的拆分包含相关标签。
回答by Arnab Biswas
Say, I have a feature "A" with possible values "a", "b", "c", "d". But the training data set consists of only three categories "a", "b", "c" as values. If get_dummies
is used at this stage, features generated will be three (A_a, A_b, A_c). But ideally there should be another feature A_d as well with all zeros. That can be achieved in the following way :
比如说,我有一个特征“A”,可能的值是“a”、“b”、“c”、“d”。但是训练数据集仅包含三个类别“a”、“b”、“c”作为值。如果get_dummies
在这个阶段使用,生成的特征将是三个(A_a,A_b,A_c)。但理想情况下应该还有另一个特征 A_d 以及全零。这可以通过以下方式实现:
import pandas as pd
data = pd.DataFrame({"A" : ["a", "b", "c"]})
data["A"] = data["A"].astype("category", categories=["a", "b", "c", "d"])
mod_data = pd.get_dummies(data[["A"]])
print(mod_data)
The output being
输出为
A_a A_b A_c A_d
0 1.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0