pandas ValueError:发现样本数量不一致的输入变量:[100, 7]
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47661149/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
ValueError: Found input variables with inconsistent numbers of samples: [100, 7]
提问by Quentin Clayton
Currently trying to have the program guess the animal based on the feature that is included in the zoo database. When I run this code it gets the error ''ValueError: Found input variables with inconsistent numbers of samples: [100, 7]''. It shows the error happens on this line ''X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=testing_size, random_state=seed)''
目前正在尝试让程序根据动物园数据库中包含的特征猜测动物。当我运行此代码时,它会收到错误“ValueError:发现样本数量不一致的输入变量:[100, 7]”。它显示错误发生在这一行 ''X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=testing_size, random_state=seed)''
def zoo_that():
zoodatabase = pd.read_csv('C:/Users/Quentin Clayton/Documents/Class work/Quarter 9/Data Analytics Project I/Final Project for Project Course/zoo.csv', header = 0)
classtypes = pd.read_csv('C:/Users/Quentin Clayton/Documents/Class work/Quarter 9/Data Analytics Project I/Final Project for Project Course/class.csv',header = 0,)
zoodatabase_v2 = zoodatabase.merge(classtypes,how = 'left',left_on = 'class_type',right_on = 'Class_Number')
X = zoodatabase_v2.loc[:, 'hair':'catsize']
Y = zoodatabase_v2.loc[:, 'class_type':'Class_Number']
testing_size = 0.2
seed = 2
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=testing_size, random_state=seed)
# Test options and evaluation metric|
scoring = 'accuracy'
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=4, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
# Make predictions on validation dataset
LR = LogisticRegression()
LR.fit(X_train, Y_train)
predictions = LR.predict(X_validation)
print("Accuracy score\n",accuracy_score(Y_validation, predictions))
print("Confusion matrix\n",confusion_matrix(Y_validation, predictions))
print("Final Report\n",classification_report(Y_validation, predictions))
print(scoring)
zoo_that()
Traceback (most recent call last):
File "<ipython-input-20-396e334d1676>", line 1, in <module>
zoo_that()
File "C:/Users/Quentin Clayton/Documents/Class work/Quarter 9/Data Analytics Project I/Final Project for Project Course/Final Submission.py", line 35, in zoo_that
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=testing_size, random_state=seed)
File "D:\Anaconda\lib\site-packages\sklearn\model_selection\_split.py", line 2031, in train_test_split
arrays = indexable(*arrays)
File "D:\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 229, in indexable
check_consistent_length(*result)
File "D:\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 204, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [100, 7]
Picture of the files [1]: https://i.stack.imgur.com/OaJmO.jpg[This is the Class Csv][1] [2]: https://i.stack.imgur.com/FL0by.jpg[This is the Zoo Csv][2]
文件图片 [1]:https: //i.stack.imgur.com/OaJmO.jpg[这是 Csv 类][1] [2]:https: //i.stack.imgur.com/FL0by .jpg[这是动物园 Csv][2]
采纳答案by havanagrawal
The problem is with this part:
问题出在这部分:
X = zoodatabase_v2.loc[1:101,'hair':'catsize']
Y = zoodatabase_v2.loc[0:6,'Class_Type':'Animal_Names']
X is a DataFrame with length 100 (1:101), and Y is a Series with length 6. To train a model (supervised learning), you need to give target labels for ALL input records. Also, you need to give a single target label, whereas currently it looks as if you are giving 2 ('Animal_Names' and 'Class_Type'). If you remove the subsetting, it should work. i.e.
X 是一个长度为 100 (1:101) 的 DataFrame,Y 是一个长度为 6 的系列。要训练模型(监督学习),您需要为所有输入记录提供目标标签。此外,您需要提供一个目标标签,而目前看起来好像您提供了 2 个('Animal_Names' 和 'Class_Type')。如果删除子集,它应该可以工作。IE
X = zoodatabase_v2.loc[:, 'hair':'catsize']
Y = zoodatabase_v2.loc[:, 'Class_Type']
should work fine.
应该工作正常。