pandas 在 SciKit 线性回归上获取“ValueError：形状未对齐”

Question

提问by Koen

Quite new to SciKit and linear algebra/machine learning with Python in general, so I can't seem to solve the following:

SciKit 和使用 Python 进行线性代数/机器学习的新手一般来说，所以我似乎无法解决以下问题：

I have a training set and a test set of data, containing both continuous and discrete/categorical values. The CSV files are loaded into Pandas DataFrames and match in shape, being (1460,81) and (1459,81). However, after using Pandas' get_dummies, the shapes of the DataFrames change to (1460, 306) and (1459, 294). So, when I do linear regression with the SciKit Linear Regressionmodule, it builds a model for 306 variables and it tries to predict one with only 294 with it. This then, naturally, leads to the following error:

我有一个训练集和一个测试数据集，包含连续和离散/分类值。CSV 文件被加载到 Pandas DataFrames 中并在形状上匹配，为 (1460,81) 和 (1459,81)。但是，在使用Pandas 的 get_dummies 之后，DataFrames的形状变为 (1460, 306) 和 (1459, 294)。因此，当我使用SciKit 线性回归模块进行线性回归时，它会为 306 个变量构建一个模型，并尝试预测只有 294 个变量的模型。这自然会导致以下错误：

ValueError: shapes (1459,294) and (306,1) not aligned: 294 (dim 1) != 306 (dim 0)

How could I tackle such a problem? Could I somehow reshape the (1459, 294) to match the other one?

我怎么能解决这样的问题？我能否以某种方式重塑 (1459, 294) 以匹配另一个？

Thanks and I hope I've made myself clear :)

谢谢，我希望我已经说清楚了:)

Answer 1

采纳答案by Nick Becker

This is an extremely common problem when dealing with categorical data. There are differing opinions on how to best handle this.

这是处理分类数据时极其常见的问题。关于如何最好地处理这个问题有不同的意见。

One possible approach is to apply a function to categorical features that limits the set of possible options. For example, if your feature contained the letters of the alphabet, you could encode features for A, B, C, D, and 'Other/Unknown'. In this way, you could apply the same function at test time and abstract from the issue. A clear downside, of course, is that by reducing the feature space you may lose meaningful information.

一种可能的方法是将函数应用于限制可能选项集的分类特征。例如，如果您的特征包含字母表中的字母，您可以对 A、B、C、D 和“其他/未知”的特征进行编码。通过这种方式，您可以在测试时应用相同的功能并从问题中抽象出来。当然，一个明显的缺点是，通过减少特征空间，您可能会丢失有意义的信息。

Another approach is to build a model on your training data, with whichever dummies are naturally created, and treat that as the baseline for your model. When you predict with the model at test time, you transform your test data in the same way your training data is transformed. For example, if your training set had the letters of the alphabet in a feature, and the same feature in the test set contained a value of 'AA', you would ignore that in making a prediction. This is the reverse of your current situation, but the premise is the same. You need to create the missing features on the fly. This approach also has downsides, of course.

另一种方法是在您的训练数据上构建一个模型，使用自然创建的虚拟对象，并将其视为模型的基线。当您在测试时使用模型进行预测时，您转换测试数据的方式与转换训练数据的方式相同。例如，如果您的训练集在某个特征中包含字母表中的字母，而测试集中的同一特征包含值 'AA'，您将在进行预测时忽略它。这与你目前的情况相反，但前提是相同的。您需要即时创建缺失的功能。当然，这种方法也有缺点。

The second approach is what you mention in your question, so I'll go through it with pandas.

第二种方法是您在问题中提到的，因此我将使用pandas.

By using get_dummiesyou're encoding the categorical features into multiple one-hot encoded features. What you could do is force your test data to match your training data by using reindex, like this:

通过使用，get_dummies您将分类特征编码为多个单热编码特征。您可以做的是使用强制您的测试数据与您的训练数据相匹配reindex，如下所示：

test_encoded = pd.get_dummies(test_data, columns=['your columns'])
test_encoded_for_model = test_encoded.reindex(columns = training_encoded.columns, 
    fill_value=0)

This will encode the test data in the same way as your training data, filling in 0 for dummy features that weren't created by encoding the test data but were created in during the training process.

这将以与训练数据相同的方式对测试数据进行编码，为不是通过编码测试数据创建但在训练过程中创建的虚拟特征填充 0。

You could just wrap this into a function, and apply it to your test data on the fly. You don't need the encoded training data in memory (which I access with training_encoded.columns) if you create an array or list of the column names.

您可以将其包装到一个函数中，并即时将其应用于您的测试数据。training_encoded.columns如果您创建一个数组或列名列表，则不需要内存中的编码训练数据（我使用访问）。

Answer 2

回答by Koen

For anyone interested: I ended up merging the train and test set, then generating the dummies, and then splitting the data again at exactly the same fraction. That way there wasn't any issue with different shapes anymore, as it generated exactly the same dummy data.

对于任何感兴趣的人：我最终合并了训练集和测试集，然后生成了假人，然后以完全相同的分数再次拆分数据。这样就不再有不同形状的问题了，因为它生成了完全相同的虚拟数据。

Answer 3

回答by Satyajit Dhawale

This works for me:
Initially, I was getting this error message:

这对我
有用：最初，我收到此错误消息：

shapes (15754,3) and (4, ) not aligned

I found out that, I was creating a model using 3variables in my train data. But what I add constant X_train = sm.add_constant(X_train)the constant variable is automatically gets created. So, in total there are now 4variables.
And when you test this model by default the test variable has 3variables. So, the error gets pops up for dimension miss match.
So, I used the trick that creates a dummy variable for y_test also.

我发现，我正在使用训练数据中的3 个变量创建模型。但是我添加的常量X_train = sm.add_constant(X_train)常量变量是自动创建的。所以，现在总共有4 个变量。
默认情况下，当您测试此模型时，测试变量有3 个变量。因此，错误会因维度不匹配而弹出。
因此，我还使用了为 y_test 创建虚拟变量的技巧。

`X_test = sm.add_constant(X_test)`

Though this a useless variable, but this solves all the issue.

虽然这是一个无用的变量，但这解决了所有问题。

pandas 在 SciKit 线性回归上获取“ValueError：形状未对齐”

提问by Koen

采纳答案by Nick Becker

回答by Koen

回答by Satyajit Dhawale

相关推荐

最近更新

标签

pandas 在 SciKit 线性回归上获取“ValueError：形状未对齐”

提问by Koen

采纳答案by Nick Becker

回答by Koen

回答by Satyajit Dhawale

相关推荐

有没有办法在 Pandas 中将 dtypes 生成为字典？

如何（重新）命名 Pandas 数据框中的空列标题而不导出到 csv

将 API 转换为 Pandas DataFrame

重命名 pandas.concat 的 DataFrame 输出上的列

相关推荐

最近更新

标签