pandas feature_names 必须是唯一的 - Xgboost
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43579180/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
feature_names must be unique - Xgboost
提问by user2728024
I am running the xgboost model for a very sparse matrix.
我正在为一个非常稀疏的矩阵运行 xgboost 模型。
I am getting this error. ValueError: feature_names must be unique
我收到此错误。ValueError:feature_names 必须是唯一的
How can I deal with this?
我该如何处理?
This is my code.
这是我的代码。
yprob = bst.predict(xgb.DMatrix(test_df))[:,1]
回答by andrew_reece
According the the xgboost
source code documentation, this error only occurs in one place- in a DMatrix
internal function. Here's the source code excerpt:
根据xgboost
源代码文档,此错误仅发生在一个地方- 在DMatrix
内部函数中。这是源代码摘录:
if len(feature_names) != len(set(feature_names)):
raise ValueError('feature_names must be unique')
So, the error text is pretty literal here; your test_df
has at least one duplicate feature/column name.
因此,这里的错误文本非常简单;您test_df
至少有一个重复的功能/列名称。
You've tagged pandas
on this post; that suggests test_df
is a Pandas DataFrame
. In this case, DMatrix
literally runs df.columns
to extract feature_names
. Check your test_df
for repeat column names, remove or rename them, and then try DMatrix()
again.
您已pandas
在此帖子上加了标签;这表明test_df
是 Pandas DataFrame
。在这种情况下,DMatrix
字面上运行df.columns
提取feature_names
。检查您test_df
的重复列名称,删除或重命名它们,然后重试DMatrix()
。
回答by Arjan Groen
Assuming the problem is indeed that columns are duplicated, the following line should solve your problem:
假设问题确实是列重复,以下行应该可以解决您的问题:
test_df = test_df.loc[:,~test_df.columns.duplicated()]
Source: python pandas remove duplicate columns
This line should identify which columns are duplicated:
此行应标识哪些列是重复的:
duplicate_columns = test_df.columns[test_df.columns.duplicated()]
回答by Akshay
One way around this can be to use column names that are unique while preparing the data and then it should work out.
解决此问题的一种方法是在准备数据时使用唯一的列名,然后它应该可以解决。