Python 了解 Keras LSTM

Question

提问by sachinruk

I am trying to reconcile my understand of LSTMs and pointed out here in this post by Christopher Olahimplemented in Keras. I am following the blog written by Jason Brownleefor the Keras tutorial. What I am mainly confused about is,

我试图调和我对 LSTM 的理解，并在Christopher Olah在Keras 中实现的这篇文章中指出。我正在关注Jason Brownlee为 Keras 教程编写的博客。我主要困惑的是，

The reshaping of the data series into [samples, time steps, features]and,
The stateful LSTMs

将数据系列重塑为[samples, time steps, features]和，
有状态的 LSTM

Lets concentrate on the above two questions with reference to the code pasted below:

让我们参考下面粘贴的代码集中讨论以上两个问题：

# reshape into X=t and Y=t+1
look_back = 3
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)

# reshape input to be [samples, time steps, features]
trainX = numpy.reshape(trainX, (trainX.shape[0], look_back, 1))
testX = numpy.reshape(testX, (testX.shape[0], look_back, 1))
########################
# The IMPORTANT BIT
##########################
# create and fit the LSTM network
batch_size = 1
model = Sequential()
model.add(LSTM(4, batch_input_shape=(batch_size, look_back, 1), stateful=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
for i in range(100):
    model.fit(trainX, trainY, nb_epoch=1, batch_size=batch_size, verbose=2, shuffle=False)
    model.reset_states()

Note: create_dataset takes a sequence of length N and returns a N-look_backarray of which each element is a look_backlength sequence.

注意：create_dataset 接受一个长度为 N 的序列并返回一个N-look_back数组，其中每个元素都是一个look_back长度序列。

What is Time Steps and Features?

什么是时间步长和特征？

As can be seen TrainX is a 3-D array with Time_steps and Feature being the last two dimensions respectively (3 and 1 in this particular code). With respect to the image below, does this mean that we are considering the many to onecase, where the number of pink boxes are 3? Or does it literally mean the chain length is 3 (i.e. only 3 green boxes considered).

可以看出，TrainX 是一个 3-D 数组，其中 Time_steps 和 Feature 分别是最后两个维度（在此特定代码中为 3 和 1）。关于下图，这是否意味着我们正在考虑many to one粉红色框数为 3 的情况？或者它的字面意思是链长为 3（即仅考虑 3 个绿色框）。

Does the features argument become relevant when we consider multivariate series? e.g. modelling two financial stocks simultaneously?

当我们考虑多元序列时，特征参数是否变得相关？例如同时模拟两只金融股？

Stateful LSTMs

有状态的 LSTM

Does stateful LSTMs mean that we save the cell memory values between runs of batches? If this is the case, batch_sizeis one, and the memory is reset between the training runs so what was the point of saying that it was stateful. I'm guessing this is related to the fact that training data is not shuffled, but I'm not sure how.

有状态 LSTM 是否意味着我们在批次运行之间保存单元内存值？如果是这种情况，batch_size是一，并且在训练运行之间重置内存，所以说它是有状态的有什么意义。我猜这与训练数据没有混洗的事实有关，但我不确定如何混洗。

Any thoughts? Image reference: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

有什么想法吗？图片参考：http: //karpathy.github.io/2015/05/21/rnn-effectiveness/

Edit 1:

编辑1：

A bit confused about @van's comment about the red and green boxes being equal. So just to confirm, does the following API calls correspond to the unrolled diagrams? Especially noting the second diagram (batch_sizewas arbitrarily chosen.):

对@van 关于红框和绿框相等的评论有点困惑。所以只是确认一下，以下 API 调用是否对应于展开的图表？特别注意第二张图（batch_size是随意选择的。）：

Edit 2:

编辑2：

For people who have done Udacity's deep learning course and still confused about the time_step argument, look at the following discussion: https://discussions.udacity.com/t/rnn-lstm-use-implementation/163169

学过Udacity的深度学习课程，还对time_step论证感到困惑的人，看下面的讨论：https://discussions.udacity.com/t/rnn-lstm-use-implementation/163169

Update:

更新：

It turns out model.add(TimeDistributed(Dense(vocab_len)))was what I was looking for. Here is an example: https://github.com/sachinruk/ShakespeareBot

原来model.add(TimeDistributed(Dense(vocab_len)))是我要找的。这是一个例子：https: //github.com/sachinruk/ShakespeareBot

Update2:

更新2：

I have summarised most of my understanding of LSTMs here: https://www.youtube.com/watch?v=ywinX5wgdEU

我在这里总结了我对 LSTM 的大部分理解：https: //www.youtube.com/watch?v=ywinX5wgdEU

Answer 1

采纳答案by Van

First of all, you choose great tutorials(1,2) to start.

首先，您选择很棒的教程（1, 2）开始。

What Time-step means: Time-steps==3in X.shape (Describing data shape) means there are three pink boxes. Since in Keras each step requires an input, therefore the number of the green boxes should usually equal to the number of red boxes. Unless you hack the structure.

时间步长的含义：Time-steps==3在 X.shape（描述数据形状）中表示有三个粉红色框。由于在 Keras 中每一步都需要一个输入，因此绿色框的数量通常应等于红色框的数量。除非你破解结构。

many to many vs. many to one: In keras, there is a return_sequencesparameter when your initializing LSTMor GRUor SimpleRNN. When return_sequencesis False(by default), then it is many to oneas shown in the picture. Its return shape is (batch_size, hidden_unit_length), which represent the last state. When return_sequencesis True, then it is many to many. Its return shape is (batch_size, time_step, hidden_unit_length)

多对多与多对一：在keras，还有return_sequences当你的初始化参数LSTM或GRU或SimpleRNN。当return_sequences是False（默认情况下）时，它是多对一的，如图所示。它的返回形状是(batch_size, hidden_unit_length)，代表最后一个状态。当return_sequences是True，那么它是多对多的。它的返回形状是(batch_size, time_step, hidden_unit_length)

Does the features argument become relevant: Feature argument means "How big is your red box"or what is the input dimension each step. If you want to predict from, say, 8 kinds of market information, then you can generate your data with feature==8.

特征参数是否变得相关：特征参数的意思是“你的红框有多大”或每一步的输入维度是多少。如果您想从 8 种市场信息中进行预测，那么您可以使用feature==8.

Stateful: You can look up the source code. When initializing the state, if stateful==True, then the state from last training will be used as the initial state, otherwise it will generate a new state. I haven't turn on statefulyet. However, I disagree with that the batch_sizecan only be 1 when stateful==True.

有状态：您可以查找源代码。初始化状态时，如果stateful==True，则将上次训练的状态作为初始状态，否则生成新状态。我还没打开stateful呢。但是，我不同意 thebatch_size只能是 1 when stateful==True。

Currently, you generate your data with collected data. Image your stock information is coming as stream, rather than waiting for a day to collect all sequential, you would like to generate input data onlinewhile training/predicting with network. If you have 400 stocks sharing a same network, then you can set batch_size==400.

目前，您使用收集的数据生成数据。想象您的股票信息以流的形式出现，而不是等待一天收集所有顺序，您希望在使用网络进行训练/预测的同时在线生成输入数据。如果您有 400 只股票共享同一个网络，那么您可以设置batch_size==400.

Answer 2

回答by Daniel M?ller

As a complement to the accepted answer, this answer shows keras behaviors and how to achieve each picture.

作为对已接受答案的补充，此答案显示了 keras 行为以及如何实现每张图片。

General Keras behavior

Keras 的一般行为

The standard keras internal processing is always a many to many as in the following picture (where I used features=2, pressure and temperature, just as an example):

标准的keras内部处理总是多对多的，如下图（这里我用的features=2是压力和温度，只是举例）：

In this image, I increased the number of steps to 5, to avoid confusion with the other dimensions.

在此图像中，我将步骤数增加到 5，以避免与其他维度混淆。

For this example:

对于这个例子：

We have N oil tanks
We spent 5 hours taking measures hourly (time steps)
We measured two features:
- Pressure P
- Temperature T

我们有N个油罐
我们每小时花费 5 个小时进行测量（时间步长）
我们测量了两个特征：
- 压力 P
- 温度

Our input array should then be something shaped as (N,5,2):

我们的输入数组应该是这样的(N,5,2)：

        [     Step1      Step2      Step3      Step4      Step5
Tank A:    [[Pa1,Ta1], [Pa2,Ta2], [Pa3,Ta3], [Pa4,Ta4], [Pa5,Ta5]],
Tank B:    [[Pb1,Tb1], [Pb2,Tb2], [Pb3,Tb3], [Pb4,Tb4], [Pb5,Tb5]],
  ....
Tank N:    [[Pn1,Tn1], [Pn2,Tn2], [Pn3,Tn3], [Pn4,Tn4], [Pn5,Tn5]],
        ]

Inputs for sliding windows

滑动窗口的输入

Often, LSTM layers are supposed to process the entire sequences. Dividing windows may not be the best idea. The layer has internal states about how a sequence is evolving as it steps forward. Windows eliminate the possibility of learning long sequences, limiting all sequences to the window size.

通常，LSTM 层应该处理整个序列。分隔窗口可能不是最好的主意。该层具有关于序列在前进时如何演变的内部状态。Windows 消除了学习长序列的可能性，将所有序列限制为窗口大小。

In windows, each window is part of a long original sequence, but by Keras they will be seen each as an independent sequence:

在 Windows 中，每个窗口都是一个很长的原始序列的一部分，但在 Keras 中，它们将被视为一个独立的序列：

        [     Step1    Step2    Step3    Step4    Step5
Window  A:  [[P1,T1], [P2,T2], [P3,T3], [P4,T4], [P5,T5]],
Window  B:  [[P2,T2], [P3,T3], [P4,T4], [P5,T5], [P6,T6]],
Window  C:  [[P3,T3], [P4,T4], [P5,T5], [P6,T6], [P7,T7]],
  ....
        ]

Notice that in this case, you have initially only one sequence, but you're dividing it in many sequences to create windows.

请注意，在这种情况下，您最初只有一个序列，但您将其分成许多序列以创建窗口。

The concept of "what is a sequence" is abstract. The important parts are:

“什么是序列”的概念是抽象的。重要的部分是：

you can have batches with many individual sequences
what makes the sequences be sequences is that they evolve in steps (usually time steps)

您可以拥有具有许多单独序列的批次
使序列成为序列的原因是它们逐步演化（通常是时间步长）

Achieving each case with "single layers"

用“单层”实现每个案例

Achieving standard many to many:

实现多对多标准：

You can achieve many to many with a simple LSTM layer, using return_sequences=True:

您可以使用简单的 LSTM 层实现多对多，使用return_sequences=True：

outputs = LSTM(units, return_sequences=True)(inputs)

#output_shape -> (batch_size, steps, units)

Achieving many to one:

实现多对一：

Using the exact same layer, keras will do the exact same internal preprocessing, but when you use return_sequences=False(or simply ignore this argument), keras will automatically discard the steps previous to the last:

使用完全相同的层，keras 将执行完全相同的内部预处理，但是当您使用return_sequences=False（或简单地忽略此参数）时，keras 将自动丢弃最后一步之前的步骤：

outputs = LSTM(units)(inputs)

#output_shape -> (batch_size, units) --> steps were discarded, only the last was returned

Achieving one to many

实现一对多

Now, this is not supported by keras LSTM layers alone. You will have to create your own strategy to multiplicate the steps. There are two good approaches:

现在，单独的 keras LSTM 层不支持这一点。您将必须创建自己的策略来增加步骤。有两种好方法：

Create a constant multi-step input by repeating a tensor
Use a stateful=Trueto recurrently take the output of one step and serve it as the input of the next step (needs output_features == input_features)

通过重复张量创建恒定的多步输入
使用 astateful=True循环获取一个步骤的输出并将其作为下一步的输入（需要output_features == input_features）

One to many with repeat vector

一对多重复向量

In order to fit to keras standard behavior, we need inputs in steps, so, we simply repeat the inputs for the length we want:

为了适应 keras 的标准行为，我们需要逐步输入，因此，我们只需按照我们想要的长度重复输入：

outputs = RepeatVector(steps)(inputs) #where inputs is (batch,features)
outputs = LSTM(units,return_sequences=True)(outputs)

#output_shape -> (batch_size, steps, units)

Understanding stateful = True

理解有状态 = True

Now comes one of the possible usages of stateful=True(besides avoiding loading data that can't fit your computer's memory at once)

现在出现了一种可能的用法 stateful=True（除了避免立即加载无法容纳计算机内存的数据）

Stateful allows us to input "parts" of the sequences in stages. The difference is:

Stateful 允许我们分阶段输入序列的“部分”。区别在于：

In stateful=False, the second batch contains whole new sequences, independent from the first batch
In stateful=True, the second batch continues the first batch, extending the same sequences.

在中stateful=False，第二批包含全新的序列，独立于第一批
在中stateful=True，第二批继续第一批，扩展相同的序列。

It's like dividing the sequences in windows too, with these two main differences:

这就像在窗口中划分序列一样，有以下两个主要区别：

these windows do not superpose!!
stateful=Truewill see these windows connected as a single long sequence

这些窗口不重叠！！
stateful=True将看到这些窗口连接为一个长序列

In stateful=True, every new batch will be interpreted as continuing the previous batch (until you call model.reset_states()).

在中stateful=True，每个新批次都将被解释为继续上一个批次（直到您调用model.reset_states()）。

Sequence 1 in batch 2 will continue sequence 1 in batch 1.
Sequence 2 in batch 2 will continue sequence 2 in batch 1.
Sequence n in batch 2 will continue sequence n in batch 1.

批次 2 中的序列 1 将继续批次 1 中的序列 1。
批次 2 中的序列 2 将继续批次 1 中的序列 2。
批次 2 中的序列 n 将继续批次 1 中的序列 n。

Example of inputs, batch 1 contains steps 1 and 2, batch 2 contains steps 3 to 5:

输入示例，批次 1 包含步骤 1 和 2，批次 2 包含步骤 3 到 5：

                   BATCH 1                           BATCH 2
        [     Step1      Step2        |    [    Step3      Step4      Step5
Tank A:    [[Pa1,Ta1], [Pa2,Ta2],     |       [Pa3,Ta3], [Pa4,Ta4], [Pa5,Ta5]],
Tank B:    [[Pb1,Tb1], [Pb2,Tb2],     |       [Pb3,Tb3], [Pb4,Tb4], [Pb5,Tb5]],
  ....                                |
Tank N:    [[Pn1,Tn1], [Pn2,Tn2],     |       [Pn3,Tn3], [Pn4,Tn4], [Pn5,Tn5]],
        ]                                  ]

Notice the alignment of tanks in batch 1 and batch 2! That's why we need shuffle=False(unless we are using only one sequence, of course).

注意第 1 批和第 2 批中罐的对齐！这就是我们需要的原因shuffle=False（当然，除非我们只使用一个序列）。

You can have any number of batches, indefinitely. (For having variable lengths in each batch, use input_shape=(None,features).

您可以无限期地拥有任意数量的批次。（要在每批中具有可变长度，请使用input_shape=(None,features).

One to many with stateful=True

一对多， stateful=True

For our case here, we are going to use only 1 step per batch, because we want to get one output step and make it be an input.

对于我们这里的情况，我们将每批只使用 1 个步骤，因为我们希望获得一个输出步骤并将其作为输入。

Please notice that the behavior in the picture is not "caused by" stateful=True. We will force that behavior in a manual loop below. In this example, stateful=Trueis what "allows" us to stop the sequence, manipulate what we want, and continue from where we stopped.

请注意，图中的行为不是“由”引起的stateful=True。我们将在下面的手动循环中强制执行该行为。在这个例子中，stateful=True是什么“允许”我们停止序列，操纵我们想要的，并从我们停止的地方继续。

Honestly, the repeat approach is probably a better choice for this case. But since we're looking into stateful=True, this is a good example. The best way to use this is the next "many to many" case.

老实说，对于这种情况，重复方法可能是更好的选择。但由于我们正在研究stateful=True，这是一个很好的例子。使用它的最佳方法是下一个“多对多”情况。

Layer:

层：

outputs = LSTM(units=features, 
               stateful=True, 
               return_sequences=True, #just to keep a nice output shape even with length 1
               input_shape=(None,features))(inputs) 
    #units = features because we want to use the outputs as inputs
    #None because we want variable length

#output_shape -> (batch_size, steps, units)

Now, we're going to need a manual loop for predictions:

现在，我们将需要一个手动循环进行预测：

input_data = someDataWithShape((batch, 1, features))

#important, we're starting new sequences, not continuing old ones:
model.reset_states()

output_sequence = []
last_step = input_data
for i in steps_to_predict:

    new_step = model.predict(last_step)
    output_sequence.append(new_step)
    last_step = new_step

 #end of the sequences
 model.reset_states()

Many to many with stateful=True

多对多，有状态=真

Now, here, we get a very nice application: given an input sequence, try to predict its future unknown steps.

现在，在这里，我们得到了一个非常好的应用程序：给定一个输入序列，尝试预测它未来的未知步骤。

We're using the same method as in the "one to many" above, with the difference that:

我们使用与上面“一对多”相同的方法，不同之处在于：

we will use the sequence itself to be the target data, one step ahead
we know part of the sequence (so we discard this part of the results).

我们将使用序列本身作为目标数据，提前一步
我们知道部分序列（因此我们丢弃这部分结果）。

Layer (same as above):

层（同上）：

outputs = LSTM(units=features, 
               stateful=True, 
               return_sequences=True, 
               input_shape=(None,features))(inputs) 
    #units = features because we want to use the outputs as inputs
    #None because we want variable length

#output_shape -> (batch_size, steps, units)

Training:

训练：

We are going to train our model to predict the next step of the sequences:

我们将训练我们的模型来预测序列的下一步：

totalSequences = someSequencesShaped((batch, steps, features))
    #batch size is usually 1 in these cases (often you have only one Tank in the example)

X = totalSequences[:,:-1] #the entire known sequence, except the last step
Y = totalSequences[:,1:] #one step ahead of X

#loop for resetting states at the start/end of the sequences:
for epoch in range(epochs):
    model.reset_states()
    model.train_on_batch(X,Y)

Predicting:

预测：

The first stage of our predicting involves "ajusting the states". That's why we're going to predict the entire sequence again, even if we already know this part of it:

我们预测的第一阶段涉及“调整状态”。这就是为什么我们要再次预测整个序列的原因，即使我们已经知道它的这一部分：

model.reset_states() #starting a new sequence
predicted = model.predict(totalSequences)
firstNewStep = predicted[:,-1:] #the last step of the predictions is the first future step

Now we go to the loop as in the one to many case. But don't reset states here!. We want the model to know in which step of the sequence it is (and it knows it's at the first new step because of the prediction we just made above)

现在我们像一对多的情况一样进入循环。但是不要在这里重置状态！. 我们希望模型知道它在序列的哪一步（并且它知道它处于第一个新步骤，因为我们刚刚在上面做出了预测）

output_sequence = [firstNewStep]
last_step = firstNewStep
for i in steps_to_predict:

    new_step = model.predict(last_step)
    output_sequence.append(new_step)
    last_step = new_step

 #end of the sequences
 model.reset_states()

This approach was used in these answers and file:

在这些答案和文件中使用了这种方法：

Achieving complex configurations

实现复杂的配置

In all examples above, I showed the behavior of "one layer".

在上面的所有示例中，我都展示了“一层”的行为。

You can, of course, stack many layers on top of each other, not necessarly all following the same pattern, and create your own models.

当然，您可以将许多层堆叠在一起，不必都遵循相同的模式，然后创建自己的模型。

One interesting example that has been appearing is the "autoencoder" that has a "many to one encoder" followed by a "one to many" decoder:

出现的一个有趣的例子是“自动编码器”，它有一个“多对一编码器”和一个“一对多”解码器：

Encoder:

编码器：

inputs = Input((steps,features))

#a few many to many layers:
outputs = LSTM(hidden1,return_sequences=True)(inputs)
outputs = LSTM(hidden2,return_sequences=True)(outputs)    

#many to one layer:
outputs = LSTM(hidden3)(outputs)

encoder = Model(inputs,outputs)

Decoder:

解码器：

Using the "repeat" method;

使用“重复”方法；

inputs = Input((hidden3,))

#repeat to make one to many:
outputs = RepeatVector(steps)(inputs)

#a few many to many layers:
outputs = LSTM(hidden4,return_sequences=True)(outputs)

#last layer
outputs = LSTM(features,return_sequences=True)(outputs)

decoder = Model(inputs,outputs)

Autoencoder:

自编码器：

inputs = Input((steps,features))
outputs = encoder(inputs)
outputs = decoder(outputs)

autoencoder = Model(inputs,outputs)

Train with fit(X,X)

训练 fit(X,X)

Additional explanations

附加说明

If you want details about how steps are calculated in LSTMs, or details about the stateful=Truecases above, you can read more in this answer: Doubts regarding `Understanding Keras LSTMs`

如果您想了解有关如何在 LSTM 中计算步骤的详细信息，或有关上述stateful=True情况的详细信息，您可以在此答案中阅读更多内容：关于“理解 Keras LSTM”的疑虑

Answer 3

回答by Sanjay Krishna

When you have return_sequences in your last layer of RNN you cannot use a simple Dense layer instead use TimeDistributed.

当您在 RNN 的最后一层中有 return_sequences 时，您不能使用简单的 Dense 层而是使用 TimeDistributed。

Here is an example piece of code this might help others.

这是一段示例代码，可能对其他人有帮助。

words = keras.layers.Input(batch_shape=(None, self.maxSequenceLength), name = "input")

    # Build a matrix of size vocabularySize x EmbeddingDimension 
    # where each row corresponds to a "word embedding" vector.
    # This layer will convert replace each word-id with a word-vector of size Embedding Dimension.
    embeddings = keras.layers.embeddings.Embedding(self.vocabularySize, self.EmbeddingDimension,
        name = "embeddings")(words)
    # Pass the word-vectors to the LSTM layer.
    # We are setting the hidden-state size to 512.
    # The output will be batchSize x maxSequenceLength x hiddenStateSize
    hiddenStates = keras.layers.GRU(512, return_sequences = True, 
                                        input_shape=(self.maxSequenceLength,
                                        self.EmbeddingDimension),
                                        name = "rnn")(embeddings)
    hiddenStates2 = keras.layers.GRU(128, return_sequences = True, 
                                        input_shape=(self.maxSequenceLength, self.EmbeddingDimension),
                                        name = "rnn2")(hiddenStates)

    denseOutput = TimeDistributed(keras.layers.Dense(self.vocabularySize), 
        name = "linear")(hiddenStates2)
    predictions = TimeDistributed(keras.layers.Activation("softmax"), 
        name = "softmax")(denseOutput)  

    # Build the computational graph by specifying the input, and output of the network.
    model = keras.models.Model(input = words, output = predictions)
    # model.compile(loss='kullback_leibler_divergence', \
    model.compile(loss='sparse_categorical_crossentropy', \
        optimizer = keras.optimizers.Adam(lr=0.009, \
            beta_1=0.9,\
            beta_2=0.999, \
            epsilon=None, \
            decay=0.01, \
            amsgrad=False))