Python ValueError:x 和 y 的大小必须相同

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41659535/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 01:22:58  来源:igfitidea点击:

ValueError: x and y must be the same size

pythoncsvnumpymatplotlibmachine-learning

提问by user3521180

import numpy as np
import pandas as pd
import matplotlib.pyplot as pt

data1 = pd.read_csv('stage1_labels.csv')

X = data1.iloc[:, :-1].values
y = data1.iloc[:, 1].values

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_X = LabelEncoder()
X[:,0] = label_X.fit_transform(X[:,0])
encoder = OneHotEncoder(categorical_features = [0])
X = encoder.fit_transform(X).toarray()

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size = 0.4, random_state = 0)

#fitting Simple Regression to training set

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

#predecting the test set results
y_pred = regressor.predict(X_test)

#Visualization of the training set results
pt.scatter(X_train, y_train, color = 'red')
pt.plot(X_train, regressor.predict(X_train), color = 'green')
pt.title('salary vs yearExp (Training set)')
pt.xlabel('years of experience')
pt.ylabel('salary')
pt.show()

I need a help understanding the error in while executing the above code. Below is the error:

我需要帮助理解执行上述代码时出现的错误。下面是错误:

"raise ValueError("x and y must be the same size")"

"raise ValueError("x 和 y 的大小必须相同")"

I have .csv file with 1398 rows and 2 column. I have taken 40% as y_test set, as it is visible in the above code.

我有 1398 行和 2 列的 .csv 文件。我已经将 40% 作为 y_test 集,因为它在上面的代码中是可见的。

回答by Lukasz Tracewski

Print X_train shape. What do you see? I'd bet X_trainis 2d (matrix with a single column), while y_train1d (vector). In turn you get different sizes.

打印 X_train 形状。你看到了什么?我敢打赌X_train是 2d(单列矩阵),而y_train1d(向量)。反过来,你会得到不同的尺寸。

I think using X_train[:,0]for plotting (which is from where the error originates) should solve the problem

我认为X_train[:,0]用于绘图(这是错误的来源)应该可以解决问题

回答by yogabonito

Slicing with [:, :-1]will give you a 2-dimensionalarray (including all rows and all columns excluding the last column).

切片[:, :-1]将为您提供一个二维数组(包括除最后一列之外的所有行和所有列)。

Slicing with [:, 1]will give you a 1-dimensionalarray (including all rows from the second column). To make this array also 2-dimensional use [:, 1:2]or [:, 1].reshape(-1, 1)or [:, 1][:, None]instead of [:, 1]. This will make xand ycomparable.

切片[:, 1]将为您提供一个一维数组(包括第二列中的所有行)。要使此数组也为二维,请使用[:, 1:2]or[:, 1].reshape(-1, 1)[:, 1][:, None]代替[:, 1]。这将使xy具有可比性。



An alternative to making both arrays 2-dimensional is making them both one dimensional. For this one would do [:, 0](instead of [:, :1]) for selecting the first column and [:, 1]for selecting the second column.

使两个数组都为二维的另一种方法是使它们都是一维的。为此,可以[:, 0](而不是[:, :1])选择第一列和[:, 1]选择第二列。

回答by PdF

In my case the problem was that the size of test_size was different from the range of the scatter plot. The range should be the same of the test_size (40% in your code) of the total observation. Here you should set the range of your scatter plot as 40% of total observations that you are processing in your model.

就我而言,问题是 test_size 的大小与散点图的范围不同。该范围应与总观察值的 test_size (代码中的 40%)相同。在这里,您应该将散点图的范围设置为您在模型中处理的总观测值的 40%。