Python 在训练和测试数据中保持相同的虚拟变量

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41335718/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 00:49:48  来源:igfitidea点击:

Keep same dummy variable in training and testing data

pythondataframescikit-learnpredictiondummy-variable

提问by nimning

I am building a prediction model in python with two separate training and testing sets. The training data contains numerical type categorical variable, e.g., zip code,[91521,23151,12355, ...], and also string categorical variables, e.g., city ['Chicago', 'New York', 'Los Angeles', ...].

我正在用两个单独的训练和测试集在 python 中构建一个预测模型。训练数据包含数字类型的分类变量,例如邮政编码 [91521,23151,12355, ...],以及字符串分类变量,例如城市 ['Chicago', 'New York', 'Los Angeles', ...]。

To train the data, I first use the 'pd.get_dummies' to get dummy variable of these variable, and then fit the model with the transformed training data.

为了训练数据,我首先使用“pd.get_dummies”来获取这些变量的虚拟变量,然后用转换后的训练数据拟合模型。

I do the same transformation on my test data and predict the result using the trained model. However, I got the error 'ValueError: Number of features of the model must match the input. Model n_features is 1487 and input n_features is 1345 '. The reason is because there are fewer dummy variables in the test data because it has fewer 'city' and 'zipcode'.

我对我的测试数据进行相同的转换,并使用经过训练的模型预测结果。但是,我收到错误“ValueError:模型的特征数必须与输入匹配。模型 n_features 是 1487 并且输入 n_features 是 1345 '。原因是测试数据中的虚拟变量较少,因为它的“城市”和“邮政编码”较少。

How can I solve this problem? For example, 'OneHotEncoder' will only encode all numerical type categorical variable. 'DictVectorizer()' will only encode all string type categorical variable. I search on line and see a few similar questions but none of them really addresses my question.

我怎么解决这个问题?例如,'OneHotEncoder' 将只编码所有数字类型的分类变量。'DictVectorizer()' 将只编码所有字符串类型的分类变量。我在网上搜索并看到一些类似的问题,但没有一个真正解决我的问题。

Handling categorical features using scikit-learn

使用 scikit-learn 处理分类特征

https://www.quora.com/If-the-training-dataset-has-more-variables-than-the-test-dataset-what-does-one-do

https://www.quora.com/If-the-training-dataset-has-more-variables-than-the-test-dataset-what-does-one-do

https://www.quora.com/What-is-the-best-way-to-do-a-binary-one-hot-one-of-K-coding-in-Python

https://www.quora.com/What-is-the-best-way-to-do-a-binary-one-hot-one-of-K-coding-in-Python

回答by Thibault Clement

You can also just get the missing columns and add them to the test dataset:

您也可以获取缺失的列并将它们添加到测试数据集中:

# Get missing columns in the training test
missing_cols = set( train.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test = test[train.columns]

This code also ensure that column resulting from category in the test dataset but not present in the training dataset will be removed

此代码还确保将删除由测试数据集中的类别产生但不存在于训练数据集中的列

回答by Eduard Ilyasov

Assume you have identical feature's names in train and test dataset. You can generate concatenated dataset from train and test, get dummies from concatenated dataset and split it to train and test back.

假设您在训练和测试数据集中具有相同的特征名称。您可以从训练和测试生成连接数据集,从连接数据集获取虚拟数据并将其拆分以进行训练和测试。

You can do it this way:

你可以这样做:

import pandas as pd
train = pd.DataFrame(data = [['a', 123, 'ab'], ['b', 234, 'bc']],
                     columns=['col1', 'col2', 'col3'])
test = pd.DataFrame(data = [['c', 345, 'ab'], ['b', 456, 'ab']],
                     columns=['col1', 'col2', 'col3'])
train_objs_num = len(train)
dataset = pd.concat(objs=[train, test], axis=0)
dataset_preprocessed = pd.get_dummies(dataset)
train_preprocessed = dataset_preprocessed[:train_objs_num]
test_preprocessed = dataset_preprocessed[train_objs_num:]

In result, you have equal number of features for train and test dataset.

因此,训练数据集和测试数据集的特征数量相同。

回答by user1482030

train2,test2 = train.align(test, join='outer', axis=1, fill_value=0)

train2 and test2 have the same columns. Fill_value indicates the value to use for missing columns.

train2 和 test2 具有相同的列。Fill_value 指示用于缺失列的值。

回答by fsociety

This is a rather old question, but if you aim at using scikit learn API, you can use the following DummyEncoder class: https://gist.github.com/psinger/ef4592492dc8edf101130f0bf32f5ff9

这是一个相当古老的问题,但如果您的目标是使用 scikit learn API,则可以使用以下 DummyEncoder 类:https://gist.github.com/psinger/ef4592492dc8edf101130f0bf32f5ff9

What it does is that it utilizes the category dtype to specify which dummies to create as also elaborated here: Dummy creation in pipeline with different levels in train and test set

它的作用是利用类别 dtype 来指定要创建的虚拟对象,这里也详细说明:在训练和测试集中具有不同级别的管道中的虚拟创建