Suresh Pasumarthi
9 min readJul 27, 2019

--

Dissecting The Role of Return_state and Return_seq Options in LSTM Based Sequence Models

Introduction:

RNNs are the basic building blocks of Recurrent neural networks. A RNN has two inputs and two outputs. Of the two inputs, one is the input from the dataset, and the other input is from the hidden state output of the previous RNN unit. Of the two outputs, one is the output that predicts the target for the RNN and the other output is the hidden state output that goes into the next RNN unit as hidden state input.

I know the previous paragraph was a mind-bender if you are seeing RNN for the first time. To make life easy please look at the Fig.1

Fig 1: Simple RNN based Sequence model

Different applications of sequence models take these inputs and outputs differently. Two arguments that greatly help in manipulating the outputs of RNN units are return_state and return_sequence in Keras Deep Learning framework. In this article, we will see how these two arguments are used for different applications.

Information provided in Keras docs:

Returns state and Return sequence are two very important parameters in RNN layer of Keras. There is a lot of confusion regarding how these two parameters work. It is largely due to the lack of good explanation in the Keras documentation. They have not provided any examples to understand these parameters.

Since LSTM is the most used sequence model, we will use LSTM for our explanation.

keras.layers.LSTM(units, activation=’tanh’, recurrent_activation=’hard_sigmoid’, use_bias=True, kernel_initializer=’glorot_uniform’, recurrent_initializer=’orthogonal’, bias_initializer=’zeros’, unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0, implementation=1, return_sequences=False, return_state=False, go_backwards=False, stateful=False, unroll=False)

Above code is the LSTM layer from Keras. We can see so many arguments being specified.

return_sequences: Boolean. Whether to return the last output. in the output sequence, or the full sequence.

return_state: Boolean. Whether to return the last state in addition to the output.

In this article, we focus mainly on return_sequences and return_state. We can also see that they are set to False by default. It means if we don’t specify them in LSTM layer they are set to False, to make them True we have to explicitly specify. They are boolean type parameters.

Understanding return_state and return_sequences:

In an LSTM layer, return_state can be true or false. It is about considering to take the hidden state output of the last time step or to ignore it. If it is set to False which is the default option, we do not use or take the hidden state output of the last time step. If it is True, we take the hidden state output of last time step and use it in some other place where we need it.

What is the last time step in an LSTM layer?

This question is very critical to understand, which is application specific. LSTM is a recurrent unit which repeats again and again in the defined window. In the language model(Fig.2), we pass the output of every time step as input to the next time step. In this case, every time step is considered the last time step. In the case of classification(Fig.3) using many to one type of architecture the time step of the final output unit is considered last time step.

Fig .2 Langauge model
Fig 3 Classification

In the case of seq2seq models(Fig.4), the encoder is seen as a block of input sentence length. LSTM corresponding to the last word in the sentence is considered last time step. Whereas in decoder it works like a language model where the output of every time step is given as input to the next time step. So in the decoder, every LSTM unit is considered last time step.

Fig 4. Encoder and Decoder part of Seq2Seq model

Having understood return_State now we will try to understand return_sequence. return_sequences cares about output at every time step, whereas returns_state was about the hidden output(cell state and hidden state in an LSTM) of last time step. For example, if we are classifying a sentence of 10 words long we have to consider the recurrent unit of 10 unit time steps which works like a many-to-one setup. In this application among the 10 outputs, that is time steps, we only take the last time step output ignoring other outputs. In such cases, the return_sequences is set to False and return_state is set to False

In applications like seq2seq, Language model and Text summarization where we need many-to-many setup we set return_sequences to true. In fact, the blanket statement that we made just now is very much misleading. Because different architectures for the same application have different variants of the return_sequences and return_state set.

A bigger confusion stems from the fact that standard applications like text classification, language model and seq2seq have different variants of implementation which use the idea of taking the last time step using the returns state differently. We need a good understanding of these parameters so that not to fall for this kind of trap. We will explain below rerturn_sequence and return_state using the different variations of Neural machine translations and Language models and the scope is not to get into details of different architectures.

Neural Machine Translation:

Example-1:

This is a variant of Neural machine Translation code which does not use Encoder-Decoder duo. The model looks like a multilayer LSTM model. In this case, the input is the source language sentence and output is the target language sentence.

First LSTM layer is of many to one type where the input is a sentence and output is taken only from the last time step of the LSTM layer. For this, we need to make both return_seq and return_state to False as we also do not need the hidden state output of last LSTM time step to be passed anywhere.

We use RepeatVector to replicate the output of LSTM to the number of outputs equal to the sentence length expected by the target sentence.

Second LSTM layer has units equal to the maximum length of the target sentence which is also equal to the number of outputs replicated by RepeatVector. This LSTM layer is a many-to-many type. Here we need to capture all the time steps for predicting output sentence word by word for which we set return_sequence to True. We don’t need hidden state taken out anywhere, by setting return_state to False.

Fig 5: Repeat Vector
# define NMT model
def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units):
model = Sequential()
model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True))
model.add(LSTM(n_units))
model.add(RepeatVector(tar_timesteps))
model.add(LSTM(n_units, return_sequences=True))
model.add(TimeDistributed(Dense(tar_vocab, activation=’softmax’)))
return model

*NOTE: There is a misconception that all Neural Machine Translation models are variants of seq2seq Encoder-Decoder, which is not True. we can envision any Neural network model that can translate one language to another. Having said that, seq2seq model is extremely popular for Neural Machine Translation to the extent that people don’t care to consider other options. We will see one such seq2seq model below.

Example-2:

In the code below, we have encoder and decoder (seq2seq) defined separately.

Encoder LSTM has return_state=True and return_sequence is set to False which is the default option. In this code the encoder has to pass its last hidden state(cell and hidden state) outputs to the Decoder, the outputs at each time step are not used.

Decoder LSTM has return_state=True and return_sequence=True. This is the case because we make use of all each LSTM time step to predict the next word in the output sentence for which we need to set return_sequence to True. Hidden state of each time step is also passed to the hidden state input of the next time step treating each time step or LSTM unit individually. Which can be fulfilled by making return_state to True. This represented in Fig 4 above

# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation=’softmax’)
decoder_outputs = decoder_dense(decoder_outputs)

Example-3:

Here we explain yet another variation of Neural translation with an attention mechanism.

In the case of seq2seq Neural Machine Translation without attention, as explained in example 2, return_sequence for the encoder is made false. Because we only use the last hidden state of the LSTM to represent the entire input sentence in a vector, to be passed to the Decoder as input to if Hidden State. In the decoder, the LSTM layer is like a language model, so we need to set the return sequence to true.

In this case, where we use attention we expect to take all the outputs of encoder time steps, to correctly map relevant input words for corresponding output word to be predicted. For this to happen we need to set return_sequence to True in the encoder.

Following code snippet is for seq2seq model with decoder and Encoder with attention. Here we can see that encoder LSTM has return_state and return_sequenc both set to True. This is to ensure that attention is implemented by taking all the output time steps of an encoder. Decoder part of the code is very similar to what we have seen before.

Fig 6: NMT with Attention Model
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_sequences=True, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don’t use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation=’softmax’)
decoder_outputs = decoder_dense(decoder_outputs)
# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

Language Model:

Example-1:

In this Language model example, we take a window of contiguous words in a sentence and predict the next word. For one such operation, we take a window of words as input and predict the next word as output. We use many-to-one time of model-based LSTM. By setting return_sequence to False we are only taking the last time step output. This window moves one word at a time using the same number of words to predict the next word. This can generate a sentence after being trained on a corpus of text data.

Since we don’t need hidden state anywhere taken out, return_state is also set to False. This is represented in Fig 3

# define model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation=’softmax’))
print(model.summary())

Example-2: (Deep LSTM)

Below code is a classic example of a Language model where multi-layer LSTM. is used. We have an LSTM layer in the name of variable “decoder_lstm” with return_sequence set to True. One thing to understand is all the hidden LSTM layers will have return_sequence set to True. Because they have to pass the output of each time step to the next layer LSTM.

The LSTM layer in the for loop iterates to create extra LSTM layers. The last LSTM layer will also have its return_sequences set to True, for passing the output of each time step as input to the next time step. This is represented in Fig 6

Fig 7 Deep LSTM
## Define model
input = Input(shape=(None, ))
decoder_lstm = LSTM(options.lstm_capacity, return_sequences=True)
h = decoder_lstm(embedded)
if options.extra is not None:
for _ in range(options.extra):
h = LSTM(options.lstm_capacity, return_sequences=True)(h)
out = TimeDistributed(fromhidden)(h)
fromhidden = Dense(numwords, activation=’linear’)

Conclusion

It could be seen from the above examples of Language models and Neural machine translations that return_state and return_sequence were used based on the choice of architecture.

--

--

Suresh Pasumarthi

Product Management | AI and ML Enthusiast | Intrapreneur