Python 来自数据帧的神经网络 LSTM 输入形状

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39674713/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 22:33:50  来源:igfitidea点击:

Neural Network LSTM input shape from dataframe

pythonpandaskeraslstm

提问by dreamer

I am trying to implement an LSTM with Keras.

我正在尝试使用 Keras实现LSTM

I know that LSTM's in Keras require a 3D tensor with shape (nb_samples, timesteps, input_dim)as an input. However, I am not entirely sure how the input should look like in my case, as I have just one sample of Tobservations for each input, not multiple samples, i.e. (nb_samples=1, timesteps=T, input_dim=N). Is it better to split each of my inputs into samples of length T/M? Tis around a few million observations for me, so how long should each sample in that case be, i.e., how would I choose M?

我知道 Keras 中的 LSTM 需要一个具有形状的 3D 张量(nb_samples, timesteps, input_dim)作为输入。但是,我不完全确定输入在我的情况下应该是什么样子,因为我T对每个输入只有一个观察样本,而不是多个样本,即(nb_samples=1, timesteps=T, input_dim=N). 将我的每个输入拆分为长度样本是否更好T/MT对我来说大约有几百万个观察值,那么在这种情况下每个样本应该多长时间,即我将如何选择M

Also, am I right in that this tensor should look something like:

另外,我对这个张量应该是这样的:

[[[a_11, a_12, ..., a_1M], [a_21, a_22, ..., a_2M], ..., [a_N1, a_N2, ..., a_NM]], 
 [[b_11, b_12, ..., b_1M], [b_21, b_22, ..., b_2M], ..., [b_N1, b_N2, ..., b_NM]], 
 ..., 
 [[x_11, x_12, ..., a_1M], [x_21, x_22, ..., x_2M], ..., [x_N1, x_N2, ..., x_NM]]]

where M and N defined as before and x corresponds to the last sample that I would have obtained from splitting as discussed above?

其中 M 和 N 定义如前,x 对应于我从上面讨论的分裂中获得的最后一个样本?

Finally, given a pandas dataframe with Tobservations in each column, and Ncolumns, one for each input, how can I create such an input to feed to Keras?

最后,给定一个 Pandas 数据框,T每列和N列中都有观察值,每个输入一个,我如何创建这样的输入以提供给 Keras?

回答by Andrew

Below is an example that sets up time series data to train an LSTM. The model output is nonsense as I only set it up to demonstrate how to build the model.

下面是一个设置时间序列数据以训练 LSTM 的示例。模型输出是无稽之谈,因为我设置它只是为了演示如何构建模型。

import pandas as pd
import numpy as np
# Get some time series data
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/timeseries.csv")
df.head()

Time series dataframe:

时间序列数据框:

Date      A       B       C      D      E      F      G
0   2008-03-18  24.68  164.93  114.73  26.27  19.21  28.87  63.44
1   2008-03-19  24.18  164.89  114.75  26.22  19.07  27.76  59.98
2   2008-03-20  23.99  164.63  115.04  25.78  19.01  27.04  59.61
3   2008-03-25  24.14  163.92  114.85  27.41  19.61  27.84  59.41
4   2008-03-26  24.44  163.45  114.84  26.86  19.53  28.02  60.09

You can build put inputs into a vector and then use pandas .cumsum()function to build the sequence for the time series:

您可以将输入构建到向量中,然后使用 pandas.cumsum()函数构建时间序列的序列:

# Put your inputs into a single list
df['single_input_vector'] = df[input_cols].apply(tuple, axis=1).apply(list)
# Double-encapsulate list so that you can sum it in the next step and keep time steps as separate elements
df['single_input_vector'] = df.single_input_vector.apply(lambda x: [list(x)])
# Use .cumsum() to include previous row vectors in the current row list of vectors
df['cumulative_input_vectors'] = df.single_input_vector.cumsum()

The output can be set up in a similar way, but it will be a single vector instead of a sequence:

可以以类似的方式设置输出,但它将是单个向量而不是序列:

# If your output is multi-dimensional, you need to capture those dimensions in one object
# If your output is a single dimension, this step may be unnecessary
df['output_vector'] = df[output_cols].apply(tuple, axis=1).apply(list)

The input sequences have to be the same length to run them through the model, so you need to pad them to be the max length of your cumulative vectors:

输入序列的长度必须相同才能通过模型​​运行它们,因此您需要将它们填充为累积向量的最大长度:

# Pad your sequences so they are the same length
from keras.preprocessing.sequence import pad_sequences

max_sequence_length = df.cumulative_input_vectors.apply(len).max()
# Save it as a list   
padded_sequences = pad_sequences(df.cumulative_input_vectors.tolist(), max_sequence_length).tolist()
df['padded_input_vectors'] = pd.Series(padded_sequences).apply(np.asarray)

Training data can be pulled from the dataframe and put into numpy arrays. Note that the input data that comes out of the dataframe will not make a 3D array. It makes an array of arrays, which is not the same thing.

可以从数据框中提取训练数据并放入 numpy 数组中。 请注意,来自数据帧的输入数据不会构成 3D 数组。它制作了一个数组数组,这不是一回事。

You can use hstack and reshape to build a 3D input array.

您可以使用 hstack 和 reshape 来构建 3D 输入数组。

# Extract your training data
X_train_init = np.asarray(df.padded_input_vectors)
# Use hstack to and reshape to make the inputs a 3d vector
X_train = np.hstack(X_train_init).reshape(len(df),max_sequence_length,len(input_cols))
y_train = np.hstack(np.asarray(df.output_vector)).reshape(len(df),len(output_cols))

To prove it:

为了证明这一点:

>>> print(X_train_init.shape)
(11,)
>>> print(X_train.shape)
(11, 11, 6)
>>> print(X_train == X_train_init)
False

Once you have training data you can define the dimensions of your input layer and output layers.

获得训练数据后,您可以定义输入层和输出层的维度。

# Get your input dimensions
# Input length is the length for one input sequence (i.e. the number of rows for your sample)
# Input dim is the number of dimensions in one input vector (i.e. number of input columns)
input_length = X_train.shape[1]
input_dim = X_train.shape[2]
# Output dimensions is the shape of a single output vector
# In this case it's just 1, but it could be more
output_dim = len(y_train[0])

Build the model:

构建模型:

from keras.models import Model, Sequential
from keras.layers import LSTM, Dense

# Build the model
model = Sequential()

# I arbitrarily picked the output dimensions as 4
model.add(LSTM(4, input_dim = input_dim, input_length = input_length))
# The max output value is > 1 so relu is used as final activation.
model.add(Dense(output_dim, activation='relu'))

model.compile(loss='mean_squared_error',
              optimizer='sgd',
              metrics=['accuracy'])

Finally you can train the model and save the training log as history:

最后,您可以训练模型并将训练日志保存为历史记录:

# Set batch_size to 7 to show that it doesn't have to be a factor or multiple of your sample size
history = model.fit(X_train, y_train,
              batch_size=7, nb_epoch=3,
              verbose = 1)

Output:

输出:

Epoch 1/3
11/11 [==============================] - 0s - loss: 3498.5756 - acc: 0.0000e+00     
Epoch 2/3
11/11 [==============================] - 0s - loss: 3498.5755 - acc: 0.0000e+00     
Epoch 3/3
11/11 [==============================] - 0s - loss: 3498.5757 - acc: 0.0000e+00 

That's it. Use model.predict(X)where Xis the same format (other than the number of samples) as X_trainin order to make predictions from the model.

就是这样。使用model.predict(X)whereX是相同的格式(样本数除外)X_train,以便根据模型进行预测。

回答by Andrew

Tensor shape

张量形状

You're right that Keras is expecting a 3D tensor for an LSTM neural network, but I think the piece you are missing is that Keras expects that each observation can have multiple dimensions.

你说得对,Keras 期望 LSTM 神经网络有一个 3D 张量,但我认为你缺少的部分是 Keras 期望每个观察都可以有多个维度

For example, in Keras I have used word vectors to represent documents for natural language processing. Each word in the document is represented by an n-dimensional numerical vector (so if n = 2the word 'cat' would be represented by something like [0.31, 0.65]). To represent a single document, the word vectors are lined up in sequence (e.g. 'The cat sat.' = [[0.12, 0.99], [0.31, 0.65], [0.94, 0.04]]). A document would be a single sample in a Keras LSTM.

例如,在 Keras 中,我使用词向量来表示用于自然语言处理的文档。文档中的每个单词都由一个 n 维数值向量表示(因此,如果n = 2单词 'cat' 将由类似的东西表示[0.31, 0.65])。为了表示单个文档,单词向量按顺序排列(例如 'The cat sat.' = [[0.12, 0.99], [0.31, 0.65], [0.94, 0.04]])。文档将是 Keras LSTM 中的单个样本。

This is analogous to your time series observations. A document is like a time series, and a word is like a single observation in your time series, but in your case it's just that the representation of your observation is just n = 1dimensions.

这类似于您的时间序列观察。一个文档就像一个时间序列,一个词就像一个时间序列中的单个观察,但在你的情况下,你观察的表示只是n = 1维度。

Because of that, I think your tensor should be something like [[[a1], [a2], ... , [aT]], [[b1], [b2], ..., [bT]], ..., [[x1], [x2], ..., [xT]]], where xcorresponds to nb_samples, timesteps = T, and input_dim = 1, because each of your observations is only one number.

正因为如此,我认为你应该张是有点像[[[a1], [a2], ... , [aT]], [[b1], [b2], ..., [bT]], ..., [[x1], [x2], ..., [xT]]],其中x对应于nb_samplestimesteps = Tinput_dim = 1,因为每个你所观察的是只有一个号码。

Batch size

批量大小

Batch size should be set to maximize throughput without exceeding the memory capacity on your machine, per this Cross Validated post. As far as I know your input does not need to be a multiple of your batch size, neither when training the model and making predictions from it.

根据这篇Cross Validated post,批处理大小应设置为在不超过机器内存容量的情况下最大化吞吐量。据我所知,您的输入不需要是批量大小的倍数,无论是在训练模型并从中进行预测时。

Examples

例子

If you're looking for sample code, on the Keras Githubthere are a number of examples using LSTM and other network types that have sequenced input.

如果您正在寻找示例代码,Keras Github 上有许多使用 LSTM 和其他具有序列输入的网络类型的示例。