pandas feature_names 必须是唯一的 - Xgboost

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43579180/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:27:38  来源:igfitidea点击:

feature_names must be unique - Xgboost

pythonpandasxgboostsklearn-pandas

提问by user2728024

I am running the xgboost model for a very sparse matrix.

我正在为一个非常稀疏的矩阵运行 xgboost 模型。

I am getting this error. ValueError: feature_names must be unique

我收到此错误。ValueError:feature_names 必须是唯一的

How can I deal with this?

我该如何处理?

This is my code.

这是我的代码。

  yprob = bst.predict(xgb.DMatrix(test_df))[:,1]

回答by andrew_reece

According the the xgboostsource code documentation, this error only occurs in one place- in a DMatrixinternal function. Here's the source code excerpt:

根据xgboost源代码文档,此错误仅发生在一个地方- 在DMatrix内部函数中。这是源代码摘录:

if len(feature_names) != len(set(feature_names)):
    raise ValueError('feature_names must be unique')

So, the error text is pretty literal here; your test_dfhas at least one duplicate feature/column name.

因此,这里的错误文本非常简单;您test_df至少有一个重复的功能/列名称。

You've tagged pandason this post; that suggests test_dfis a Pandas DataFrame. In this case, DMatrixliterally runs df.columnsto extract feature_names. Check your test_dffor repeat column names, remove or rename them, and then try DMatrix()again.

您已pandas在此帖子上加了标签;这表明test_df是 Pandas DataFrame。在这种情况下,DMatrix字面上运行df.columns提取feature_names。检查您test_df的重复列名称,删除或重命名它们,然后重试DMatrix()

回答by Arjan Groen

Assuming the problem is indeed that columns are duplicated, the following line should solve your problem:

假设问题确实是列重复,以下行应该可以解决您的问题:

test_df = test_df.loc[:,~test_df.columns.duplicated()]

Source: python pandas remove duplicate columns

来源: python pandas 删除重复列

This line should identify which columns are duplicated:

此行应标识哪些列是重复的:

duplicate_columns = test_df.columns[test_df.columns.duplicated()]

回答by Akshay

One way around this can be to use column names that are unique while preparing the data and then it should work out.

解决此问题的一种方法是在准备数据时使用唯一的列名,然后它应该可以解决。