pandas ValueError:数组长度与索引长度不匹配
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37063350/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
ValueError: array length does not match index length
提问by Pavan Vasan
I am practicing for contests like kaggle and I have been trying to use XGBoost and am trying to get myself familiar with python 3rd party libraries like pandas and numpy.
我正在练习像 kaggle 这样的比赛,我一直在尝试使用 XGBoost,并试图让自己熟悉 python 3rd 方库,如 pandas 和 numpy。
I have been reviewing scripts from this particular competition called the Santander Customer Satisfaction Classification and I have been modifying different forked scripts in order to experiment on them.
我一直在这个名为桑坦德客户满意度分类的特殊比赛的脚本,并且我一直在修改不同的分叉脚本以对它们进行试验。
Here is one modified script through which I am trying to implement XGBoost:
这是我试图通过它实现 XGBoost 的一个修改后的脚本:
import pandas as pd
from sklearn import cross_validation as cv
import xgboost as xgb
df_train = pd.read_csv("/Users/pavan7vasan/Desktop/Machine_Learning/Project Datasets/Santander_Customer_Satisfaction/train.csv")
df_test = pd.read_csv("/Users/pavan7vasan/Desktop/Machine_Learning/Project Datasets/Santander_Customer_Satisfaction/test.csv")
df_train = df_train.replace(-999999,2)
id_test = df_test['ID']
y_train = df_train['TARGET'].values
X_train = df_train.drop(['ID','TARGET'], axis=1).values
X_test = df_test.drop(['ID'], axis=1).values
X_train, X_test, y_train, y_test = cv.train_test_split(X_train, y_train, random_state=1301, test_size=0.4)
clf = xgb.XGBClassifier(objective='binary:logistic',
missing=9999999999,
max_depth = 7,
n_estimators=200,
learning_rate=0.1,
nthread=4,
subsample=1.0,
colsample_bytree=0.5,
min_child_weight = 3,
reg_alpha=0.01,
seed=7)
clf.fit(X_train, y_train, early_stopping_rounds=50, eval_metric="auc", eval_set=[(X_train, y_train), (X_test, y_test)])
y_pred = clf.predict_proba(X_test)
print("Cross validating and checking the score...")
scores = cv.cross_val_score(clf, X_train, y_train)
'''
test = []
result = []
for each in id_test:
test.append(each)
for each in y_pred[:,1]:
result.append(each)
print len(test)
print len(result)
'''
submission = pd.DataFrame({"ID":id_test, "TARGET":y_pred[:,1]})
#submission = pd.DataFrame({"ID":test, "TARGET":result})
submission.to_csv("submission_XGB_Pavan.csv", index=False)
Here is the stacktrace :
这是堆栈跟踪:
Traceback (most recent call last):
File "/Users/pavan7vasan/Documents/workspace/Machine_Learning_Project/Kaggle/XG_Boost.py", line 45, in <module>
submission = pd.DataFrame({"ID":id_test, "TARGET":y_pred[:,1]})
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 214, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 341, in _init_dict
dtype=dtype)
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 4798, in _arrays_to_mgr
index = extract_index(arrays)
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 4856, in extract_index
raise ValueError(msg)
ValueError: array length 30408 does not match index length 75818
I have tried solutions based on my searches for different solutions, but I am not able to figure out what the mistake is. What is it that I have gone wrong in? Please let me know
我已经根据我对不同解决方案的搜索尝试了解决方案,但我无法弄清楚错误是什么。我哪里出了问题?请告诉我
回答by Anton Protopopov
The problem is that you defining X_test
twice as @maxymoo mentioned. First you defined it as
问题是你X_test
像@maxymoo 提到的那样定义了两次。首先你将它定义为
X_test = df_test.drop(['ID'], axis=1).values
And then you redefine that with:
然后你重新定义它:
X_train, X_test, y_train, y_test = cv.train_test_split(X_train, y_train, random_state=1301, test_size=0.4)
Which means now X_test
have size equal to 0.4*len(X_train)
. Then after:
这意味着现在的X_test
大小等于0.4*len(X_train)
. 然后:
y_pred = clf.predict_proba(X_test)
you've got predictions for that part of X_train
and you trying to create dataframe with that and initial id_test
which has length of the original X_test
.
You could use X_fit
and X_eval
in train_test_split
and not hide initial X_train
and X_test
because for your cross_validation
you also has different X_train
which means you'll not get right answer or you cv
would be inaccurate with public/private score.
您已经对那部分进行了预测,X_train
并且您尝试使用该数据框和初始id_test
长度创建数据框,该数据框的长度为原始X_test
.
您可以使用X_fit
and X_eval
intrain_test_split
而不是隐藏初始值X_train
,X_test
因为对于您来说,cross_validation
您也有不同的X_train
答案,这意味着您不会得到正确的答案,或者您的cv
公共/私人分数会不准确。