Python sklearn 问题：在进行回归时发现样本数量不一致的数组

Question

提问by pyman

this question seems to have been asked before, but I can't seem to comment for further clarification on the accepted answer and I couldn't figure out the solution provided.

这个问题似乎以前被问过，但我似乎无法评论以进一步澄清已接受的答案，我无法弄清楚提供的解决方案。

I am trying to learn how to use sklearn with my own data. I essentially just got the annual % change in GDP for 2 different countries over the past 100 years. I am just trying to learn using a single variable for now. What I am essentially trying to do is use sklearn to predict what the GDP % change for country A will be given the percentage change in country B's GDP.

我正在尝试学习如何将 sklearn 与我自己的数据一起使用。我基本上只是获得了过去 100 年中 2 个不同国家的 GDP 年度变化百分比。我现在只是想学习使用单个变量。我基本上想要做的是使用 sklearn 来预测 A 国的 GDP 百分比变化将与 B 国 GDP 的百分比变化有关。

The problem is that I receive an error saying:

问题是我收到一条错误消息：

ValueError: Found arrays with inconsistent numbers of samples: [ 1 107]

ValueError：发现样本数量不一致的数组：[ 1 107]

Here is my code:

这是我的代码：

import sklearn.linear_model as lm
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import matplotlib.dates as mdates


def bytespdate2num(fmt, encoding='utf-8'):#function to convert bytes to string for the dates.
    strconverter = mdates.strpdate2num(fmt)
    def bytesconverter(b):
        s = b.decode(encoding)
        return strconverter(s)
    return bytesconverter

dataCSV = open('combined_data.csv')

comb_data = []

for line in dataCSV:
    comb_data.append(line)

date, chngdpchange, ausgdpchange = np.loadtxt(comb_data, delimiter=',', unpack=True, converters={0: bytespdate2num('%d/%m/%Y')})


chntrain = chngdpchange[:-1]
chntest = chngdpchange[-1:]

austrain = ausgdpchange[:-1]
austest = ausgdpchange[-1:]

regr = lm.LinearRegression()
regr.fit(chntrain, austrain)

print('Coefficients: \n', regr.coef_)

print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(chntest) - austest) ** 2))

print('Variance score: %.2f' % regr.score(chntest, austest))

plt.scatter(chntest, austest,  color='black')
plt.plot(chntest, regr.predict(chntest), color='blue')

plt.xticks(())
plt.yticks(())

plt.show()

What am I doing wrong? I essentially tried to apply the sklearn tutorial (They used some diabetes data set) to my own simple data. My data just contains the date, country A's % change in GDP for that specific year, and country B's % change in GDP for that same year.

我究竟做错了什么？我基本上尝试将 sklearn 教程（他们使用了一些糖尿病数据集）应用于我自己的简单数据。我的数据只包含日期，A 国在该特定年份的 GDP 变化百分比，以及 B 国在同一年的 GDP 变化百分比。

I tried the solutions hereand here (basically trying to find more out about the solution in the first link), but just receive the exact same error.

我在这里和这里尝试了解决方案（基本上是试图在第一个链接中找到有关解决方案的更多信息），但收到完全相同的错误。

Here is the full traceback in case you want to see it:

这是完整的回溯，以防您想看到它：

Traceback (most recent call last):
  File "D:\My Stuff\Dropbox\Python\Python projects\test regression\tester.py", line 34, in <module>
    regr.fit(chntrain, austrain)
  File "D:\Programs\Installed\Python34\lib\site-packages\sklearn\linear_model\base.py", line 376, in fit
    y_numeric=True, multi_output=True)
  File "D:\Programs\Installed\Python34\lib\site-packages\sklearn\utils\validation.py", line 454, in check_X_y
    check_consistent_length(X, y)
  File "D:\Programs\Installed\Python34\lib\site-packages\sklearn\utils\validation.py", line 174, in check_consistent_length
    "%s" % str(uniques))
ValueError: Found arrays with inconsistent numbers of samples: [  1 107]

Answer 1

回答by IVlad

regr.fit(chntrain, austrain)

This doesn't look right. The first parameter to fitshould be an X, which refers to a feature vector. The second parameter should be a y, which is the correct answers (targets) vector associated with X.

这看起来不对。to 的第一个参数fit应该是 an X，它指的是一个特征向量。第二个参数应该是 a y，它是与关联的正确答案（目标）向量X。

For example, if you have GDP, you might have:

例如，如果你有 GDP，你可能有：

X[0] = [43, 23, 52] -> y[0] = 5
# meaning the first year had the features [43, 23, 52] (I just made them up)
# and the change that year was 5

Judging by your names, both chntrainand austrainare feature vectors. Judging by how you load your data, maybe the last column is the target?

从你的名字来看，chntrain和austrain都是特征向量。从您加载数据的方式来看，也许最后一列是目标？

Maybe you need to do something like:

也许您需要执行以下操作：

chntrain_X, chntrain_y = chntrain[:, :-1], chntrain[:, -1]
# you can do the same with austrain and concatenate them or test on them if this part works
regr.fit(chntrain_X, chntrain_y)

But we can't tell without knowing the exact storage format of your data.

但是，如果不知道您数据的确切存储格式，我们就无法判断。

Answer 2

回答by qg_jinn

Try changing chntrainto a 2-D array instead of 1-D, i.e. reshape to (len(chntrain), 1).

尝试更改chntrain为二维数组而不是一维数组，即重塑为(len(chntrain), 1)。

For prediction, also change chntestto a 2-D array.

对于预测，也更改chntest为二维数组。

Answer 3

回答by Chang Men

In fit(X,y),the input parameter X is supposed to be a 2-D array. But if X in your data is only one-dimension, you can just reshape it into a 2-D array like this:regr.fit(chntrain_X.reshape(len(chntrain_X), 1), chntrain_Y)

在 fit(X,y) 中，输入参数 X 应该是一个二维数组。但是，如果数据中的 X 只是一维，则可以将其重塑为二维数组，如下所示：regr.fit(chntrain_X.reshape(len(chntrain_X), 1), chntrain_Y)

Answer 4

回答by bobo

I have been having similar problems to you and have found a solution.

我一直遇到与您类似的问题，并找到了解决方案。

Where you have the following error:

出现以下错误的地方：

ValueError: Found arrays with inconsistent numbers of samples: [  1 107]

The [ 1 107] part is basically saying that your array is the wrong way around. Sklearn thinks you have 107 columns of data with 1 row.

[ 1 107] 部分基本上是说你的数组是错误的。Sklearn 认为您有 107 列数据和 1 行。

To fix this try transposing the X data like so:

要解决此问题，请尝试转置 X 数据，如下所示：

chntrain.T

The re-run your fit:

重新运行您的适合：

regr.fit(chntrain, austrain)

Depending on what your "austrain" data looks like you may need to transpose this too.

根据您的“austrain”数据的样子，您可能也需要转置它。

Answer 5

回答by Cloud Cho

You may use np.newaxisas well. The example can be X = X[:, np.newaxis]. I found the method at Logistic function

你也可以使用np.newaxis。该示例可以是X = X[:, np.newaxis]. 我在Logistic 函数中找到了方法

Python sklearn 问题：在进行回归时发现样本数量不一致的数组

提问by pyman

回答by IVlad

回答by qg_jinn

回答by Chang Men

回答by bobo

回答by Cloud Cho

相关推荐

最近更新

标签

Python sklearn 问题：在进行回归时发现样本数量不一致的数组

提问by pyman

回答by IVlad

回答by qg_jinn

回答by Chang Men

回答by bobo

回答by Cloud Cho

相关推荐

从另一个文件调用 Python 函数

Python numpy 和 matlab 之间的性能差异

Python pandas groupby 没有将按列分组转换为索引

python中的加权移动平均

相关推荐

最近更新

标签