pandas Numpy Array,数据必须是一维的

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45112856/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:00:41  来源:igfitidea点击:

Numpy Array, Data must be 1-dimensional

pythonmatlabpandasnumpy

提问by Robert Garrison

I am attempting to reproduce MatLab code in Python and am stumbling with a MatLab matrix. The block of code in MatLab is below:

我正在尝试在 Python 中重现 MatLab 代码,并且在使用 MatLab 矩阵时遇到了困难。MatLab 中的代码块如下:

for i = 1:Np
    y = returns(:,i);
    sgn = modified_sign(y); 
    X = [ones(Tp,1) sgn.*log(prices(:,i).*volumes(:,i))];

I am having a hard time creating 'X' without getting the "Data Must be 1 Dimensional Error. Below is one of my attempts, of many trying to reproduce this section of code:

我很难在没有得到“数据必须是 1 维错误”的情况下创建“X”。以下是我的尝试之一,其中许多尝试重现这部分代码:

lam = np.empty([Tp,Np]) * np.nan
for i in range(0,Np):
    y=returns.iloc[:,i]
    sgn = modified_sign(y)
    #X = np.array([[np.ones([Tp,1]),np.multiply(np.multiply(sgn,np.log(prices.iloc[:,i])),volumes.iloc[:,i])]])
    X = np.concatenate([np.ones([Tp,1]),np.column_stack(np.array([sgn*np.log(prices.iloc[:,i])*volumes[:,i]]))],axis=1)

Tp and Np are the length and width of the prices series

Tp 和 Np 是价格序列的长度和宽度

crsp['PRC'].to_frame().shape = (9455,1)
Tp, Np = crsp['PRC'].to_frame().shape 

Tr and Nr are the length and width of the returns series

Tr 和 Nr 是收益系列的长度和宽度

crsp['RET'].to_frame().shape = (9455,1)
Tr, Nr = crsp['RET'].to_frame().shape

Tv and Nv are the length and width of the volume series

Tv 和 Nv 是卷系列的长度和宽度

crsp['VOL'].to_frame().shape = (9455,1)
Tv, Nv = crsp['VOL'].to_frame().shape

The ones array:

个数组:

np.ones([Tp,1])

would be (9455,1)

将是 (9455,1)

Sample Volume Data:

样本体积数据:

    DATE    VOLAVG
1979-12-04  8880.9912591051
1979-12-05  8867.545284586622
1979-12-06  8872.264687564875
1979-12-07  8876.922134551494
1979-12-10  8688.765365448506
1979-12-11  8695.279567657451
1979-12-12  8688.865033222592
1979-12-13  8684.095435684647
1979-12-14  8684.534550736667
1979-12-17  8879.694444444445

Sample Price Data

样本价格数据

    DATE    AVGPRC
1979-12-04  25.723484200567693
1979-12-05  25.839463450495863
1979-12-06  26.001899852224145
1979-12-07  25.917628864251874
1979-12-10  26.501898917349788
1979-12-11  26.448652367425804
1979-12-12  26.475906537182407
1979-12-13  26.519610746585908
1979-12-14  26.788873713159944
1979-12-17  26.38583047822484

Sample Return Data

样本返回数据

    DATE    RET
1979-12-04  0.008092780873338423
1979-12-05  0.004498557619416754
1979-12-06  0.006266692192175238
1979-12-07  -0.0032462182943131523
1979-12-10  0.022292999386413825
1979-12-11  -0.002011180868938034
1979-12-12  0.001029925340138238
1979-12-13  0.0016493553247958206
1979-12-14  0.010102153877941776
1979-12-17  -0.015159499602784175

What I am ultimately trying to achieve is an (9455,2) array where X.iloc[:,0]=1 and X.iloc[:,2]=log(price)*volume for each row.

我最终想要实现的是一个 (9455,2) 数组,其中 X.iloc[:,0]=1 和 X.iloc[:,2]=log(price)*volume 每行。

I referenced the MatLab to Numpy document online (https://docs.scipy.org/doc/numpy-dev/user/numpy-for-matlab-users.html) and checked out various other StackOverflow posts to no avail.

我在网上引用了 MatLab 到 Numpy 文档(https://docs.scipy.org/doc/numpy-dev/user/numpy-for-matlab-users.html)并查看了各种其他 StackOverflow 帖子,但无济于事。

For context, modified_sign is an external function, prices is a DataFrame slice, as is returns. Np is the width (think df.shape[1]) of the price DataFrame and Tp is df.shape[0]. This is esentially creating a column of 1s and log(price)*volume to be used in a regression for each series of returns where each df is (TxN) where T is dates and N is securities. Any guidance you can provide would be greatly appreciated.

对于上下文, modified_sign 是一个外部函数,prices 是一个 DataFrame 切片,返回值也是如此。Np 是价格 DataFrame 的宽度(想想 df.shape[1]),Tp 是 df.shape[0]。这实质上是创建一列 1s 和 log(price)*volume,用于每个收益系列的回归,其中每个 df 是 (TxN),其中 T 是日期,N 是证券。您能提供的任何指导将不胜感激。

采纳答案by TheBlackCat

The problem is that numpy can have 1D array (vectors) while MATLAB cannot. So when you create the np.ones([Tp,1])array, it is creating a 2D array where one dimension has a size of 1. In MATLAB, that is considered a "vector", but in numpy it isn't.

问题是 numpy 可以有 1D 数组(向量),而 MATLAB 不能。因此,当您创建np.ones([Tp,1])数组时,它会创建一个二维数组,其中一维的大小为 1。在 MATLAB 中,这被视为“向量”,但在 numpy 中则不是。

So what you need to do is give np.onesa single value. This will result in a vector (unlike in MATLAB where it will result in a 2D square matrix). The same rule applies to np.zerosand any other function that takes dimensions as inputs.

所以你需要做的是给出np.ones一个单一的值。这将产生一个向量(不像在 MATLAB 中它会产生一个 2D 方阵)。相同的规则适用于np.zeros任何其他将维度作为输入的函数。

So this should work:

所以这应该有效:

X = np.column_stack([np.ones(Tp), sgn*np.log(prices.iloc[:,1])*volumes.iloc[:,1]])

That being said, you are losing most of the advantage of using pandas by doing it this way. It would be much better to combine the DataFrames into one using the dates as indices, then create a new column with the calculation. Assuming the dates are the indices, something like this should work (if the dates are indices use set_indexto make them indices):

话虽如此,这样做会失去使用Pandas的大部分优势。使用日期作为索引将 DataFrames 组合成一个会更好,然后创建一个带有计算的新列。假设日期是索引,这样的事情应该可以工作(如果日期是set_index用来制作索引的索引):

data = pd.concat([returns, prices, volumes], axis=1)
data['sign'] = modified_sign(data['ret')
data['X0'] = 1
data['X1'] = data['sign']*np.log(data['AVGPRC'])*data['VOLAVG']

Of course you would replace X0and X1with more informative names, and I am not sure you even need X0using this approach, but that would get you a much easier-to-work-with data structure.

当然,您会替换X0X1使用更多信息的名称,我不确定您是否甚至需要X0使用这种方法,但这会让您更容易使用数据结构。

Also, if your dates are strings you should convert them to pandas dates. They are much nicer to work with than strings.

此外,如果您的日期是字符串,您应该将它们转换为Pandas日期。它们比字符串更好用。