pandas ValueError:feature_names 不匹配:在 predict() 函数中的 xgboost

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42338972/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:01:33  来源:igfitidea点击:

ValueError: feature_names mismatch: in xgboost in the predict() function

pythonpandasmachine-learningregressionxgboost

提问by Sujay S Kumar

I have trained an XGBoostRegressor model. When I have to use this trained model for predicting for a new input, the predict() function throws a feature_names mismatch error, although the input feature vector has the same structure as the training data.

我已经训练了一个 XGBoostRegressor 模型。当我必须使用这个经过训练的模型来预测新输入时,predict() 函数会抛出一个 feature_names 不匹配错误,尽管输入特征向量与训练数据具有相同的结构。

Also, in order to build the feature vector in the same structure as the training data, I am doing a lot inefficient processing such as adding new empty columns (if data does not exist) and then rearranging the data columns so that it matches with the training structure. Is there a better and cleaner way of formatting the input so that it matches the training structure?

此外,为了以与训练数据相同的结构构建特征向量,我做了很多低效的处理,例如添加新的空列(如果数据不存在),然后重新排列数据列,使其与培训结构。是否有更好、更简洁的方式来格式化输入,使其与训练结构相匹配?

回答by Abhishek Sharma

I also had this problem when i used pandas DataFrame (non-sparse representation).

当我使用 Pandas DataFrame(非稀疏表示)时,我也遇到了这个问题。

I converted training and testing data into numpy ndarray.

我将训练和测试数据转换为numpy ndarray.

          `X_train = X_train.as_matrix()
           X_test = X_test.as_matrix()` 

This how i got rid of that Error!

这就是我摆脱那个错误的方式!

回答by saurabh kumar

Try converting data into ndarray before passing it to fit/predict. For eg: if your train data is train_df and test data is test_df. Use below code:

在将数据传递给拟合/预测之前尝试将数据转换为 ndarray。例如:如果您的训练数据是 train_df,而测试数据是 test_df。使用以下代码:

train_x = train_df.values
test_x = test_df.values

Now fit the model:

现在拟合模型:

xgb.fit(train_x,train_y)

Finally, predict:

最后,预测:

pred = xgb.predict(test_x)

Hope this helps!

希望这可以帮助!

回答by Athar

This is the case where the order of column-names while model building is different from order of column-names while model scoring.

这是模型构建时列名的顺序与模型评分时列名的顺序不同的情况。

I have used the following steps to overcome this error

我已使用以下步骤来克服此错误

First load the pickle file

首先加载pickle文件

model = pickle.load(open("saved_model_file", "rb"))

extraxt all the columns with order in which they were used

按照使用顺序提取所有列

cols_when_model_builds = model.get_booster().feature_names

reorder the pandas dataframe

重新排序Pandas数据框

pd_dataframe = pd_dataframe[cols_when_model_builds]

回答by Sujay S Kumar

From what I could find, the predict function does not take the DataFrame (or a sparse matrix) as input. It is one of the bugs which can be found here https://github.com/dmlc/xgboost/issues/1238

据我所知,预测函数不会将 DataFrame(或稀疏矩阵)作为输入。这是可以在这里找到的错误之一https://github.com/dmlc/xgboost/issues/1238

In order to get around this issue, use as_matrix() function in case of a DataFrame or toarray() in case of a sparse matrix.

为了解决这个问题,在 DataFrame 的情况下使用 as_matrix() 函数或在稀疏矩阵的情况下使用 toarray() 函数。

This is the only workaround till the bug is fixed or the feature is implemented in a different manner.

这是在修复错误或以不同方式实现功能之前的唯一解决方法。

回答by CathyQian

I came across the same problem and it's been solved by adding passing the train dataframe column name to the test dataframe via adding the following code:

我遇到了同样的问题,通过添加以下代码将列车数据帧列名称传递给测试数据帧已经解决了这个问题:

test_df = test_df[train_df.columns]

回答by GDB

Check the exception. What you should see are two arrays. One is the column names of the dataframe you're passing in and the other is the XGBoost feature names. They should be the same length. If you put them side by side in an Excel spreadsheet you will see that they are not in the same order. My guess is that the XGBoost names were written to a dictionary so it would be a coincidence if the names in then two arrays were in the same order.

检查异常。您应该看到的是两个数组。一个是您传入的数据框的列名,另一个是 XGBoost 功能名称。它们的长度应该相同。如果将它们并排放置在 Excel 电子表格中,您会发现它们的顺序不同。我的猜测是 XGBoost 名称被写入字典,因此如果两个数组中的名称顺序相同,那将是巧合。

The fix is easy. Just reorder your dataframe columns to match the XGBoost names:

修复很容易。只需重新排序您的数据框列以匹配 XGBoost 名称:

f_names = model.feature_names
df = df[f_names]

回答by David1592

I'm contributing an answer as I experienced this problem when putting a fitted XGBRegressor model into production. Thus, this is a solution for cases where you cannot select column names from a y training or testing DataFrame, though there may be cross-over which could be helpful.

我正在提供一个答案,因为我在将合适的 XGBRegressor 模型投入生产时遇到了这个问题。因此,对于无法从任何训练或测试 DataFrame 中选择列名称的情况,这是一种解决方案,尽管可能存在交叉,这可能会有所帮助。

The model had been fit on a Pandas DataFrame, and I was attempting to pass a single row of values as a np.array to the predict function. Processing the values of the array had already been performed (reverse label encoded, etc.), and the array was all numeric values.

该模型适合 Pandas DataFrame,我试图将单行值作为 np.array 传递给预测函数。已经对数组的值进行了处理(反向标签编码等),数组都是数值。

I got the familiar error:

我得到了熟悉的错误:

ValueError: feature_names mismatchfollowed by a list of the features, followed by a list of the same length: ['f0', 'f1' ....]

ValueError: feature_names mismatch后跟特征列表,后跟长度相同的列表: ['f0', 'f1' ....]

While there are no doubt more direct solutions, I had little time and this fixed the problem:

虽然毫无疑问有更直接的解决方案,但我几乎没有时间,这解决了问题:

  1. Make the input vector a Pandas Dataframe:
  1. 使输入向量成为 Pandas 数据框:
series = {'feature1': [value],
          'feature2': [value],
          'feature3': [value],
          'feature4': [value],
          'feature5': [value],
          'feature6': [value],
          'feature7': [value],
          'feature8': [value],
          'feature9': [value],
          'feature10': [value]
           }

self.vector = pd.DataFrame(series)
  1. Get the feature names that the trained model knows:
  1. 获取训练模型知道的特征名称:

names = model.get_booster().feature_names

names = model.get_booster().feature_names

  1. Select those feature from the input vector DataFrame (defined above), and perform iloc indexing:
  1. 从输入向量 DataFrame(如上定义)中选择这些特征,并执行 iloc 索引:

result = model.predict(vector[names].iloc[[-1]])

result = model.predict(vector[names].iloc[[-1]])



The iloc transformation I found here.

我在这里找到的 iloc 转换。

Selecting the feature names – as models in the Scikit Learn implementation do not have a feature_namesattribute – using get_booster( ).feature_namesI found in @Athar post above.

选择特征名称——因为 Scikit Learn 实现中的模型没有feature_names属性——使用get_booster( ).feature_names我在上面的@Athar 帖子中找到的。

Check out the the documentationto learn more.

查看文档以了解更多信息。

Hope this helps.

希望这可以帮助。

回答by Saurabh

Do this while creating the DMatrix for XGB:

在为 XGB 创建 DMatrix 时执行此操作:

dtrain = xgb.DMatrix(np.asmatrix(X_train), label=y_train)
dtest = xgb.DMatrix(np.asmatrix(X_test), label=y_test)

Do not pass X_train and X_test directly.

不要直接通过 X_train 和 X_test。