Neural Machine Translation using Keras

Suresh Pasumarthi
11 min readMar 1, 2019

Introduction:

Machine Translation is an application of NLP where one Language is translated into another language. Example translating Spanish to English. Machine Translation using Neural networks especially Recurrent models, is called Neural Machine Translation or in short NMT.

Most widely used Deep Learning model for NMT is seq2seq model which has Encoder and Decoder. At a high-level Encoder takes input sentence and Decoder outputs translated target sentence. Both Encoder and Decoder models are build using LSTM, RNN, GRU layers etc.

There are two options to feed input sentence into the encoder and target sentence into the decoder during training, one is the word level, second is the character level. Feeding input data at the word level is the most widely used method, but character level input feeding also has its own merits. One advantage with character level encoding is that we need not worry about missing words from the word dictionary. At the word level encoding, there are standard pre-trained dictionaries word2vec and GloVe, which makes word level encoding more favorable to use.

In this article, we will go through character level encoding to run the NMT model. Once this model is understood, it will not be difficult to implement a word level encoded model.

Let's dive deep into the Pre-processing, model building, training and inference steps needed to make this model work and do the translation with good accuracy.

In this example, we will be converting Spanish sentences into English. The sample data is taken from http://www.manythings.org/anki/

Structure of dataset:

This dataset contains a pair of sentences with Spanish and English. Each line has an input sentence and a target sentence separated by a tab space(“\t”). We will be using this data for training, validation, and testing. The first impression created by looking at the data, makes us think we should be able to extract each line iteratively and split each line by tab separation. Save these tab separated sentences in a list corresponding to its language(input or target).

Let’s see how it is actually done.

We will divide the entire code into more manageable snippets. This will make it easy to explain the code and for the reader to understand.

Part1: Pre-processing

Pre-processing involves taking the data set and converting into a format that is acceptable to the seq2seq model.

Pre-processing steps:

  1. Extract the sentences from the data set and save in a list.
  2. Create another list of unique characters from the dataset.
  3. Create two dictionaries for the English and Spanish languages to readily convert back and forth from character to unique dedicated index value.
  4. Create one-hot representations of each character using the index values(integers).

In the following code snippet, we create two lists to save Spanish and English sentences. These lists are input_text, target_text initialized as empty lists. The “lines” is a python object of type list, which has input and target sentences separated by a tab space, as each element of the list.

The separated input and target sentences are stored in input_texts and target_texts lists respectively.

Now at this stage, we have our input and target sentences saved into corresponding lists.

During training and Inference in Neural Machine Translation, the decoder target sentence should be padded with start and end tokens. The specific characters are chosen for start and end tokens are a tab(‘\t’) and end of line(‘\n’) characters. Using start token we are telling the decoder to start translation, and on seeing end token decoder will stop translation meaning we got our translated sentence. To achieve this, all the sentences in the target_text are added with ‘\t’ at the beginning and ‘\n’ to the end of the sentence.

Any dedicated character can be assigned for start or end tokens, we chose tab(‘\t’) and end of the line(‘\n’) since tab and end of line characters are not present in the dataset or among the unique characters of our source or target language.

Unique characters of input and target dataset are stored in a separate list to get the number of unique characters available in the dataset. Since training and inference happens at the character level, each character is represented as a one-hot encoded representation with a cardinality or length equal to all the unique characters in the language. In Order to store all unique characters, Input_characters and target_charecters are initialized as a set() data structure. Then all the unique characters are added to this set and finally typecasted into a list so that it can be accessed through the index. Length of these two lists will give us the dimension of each character of one-hot representation.

Task achieved in the following code snippet:

  1. Length of unique characters in input data and target data.
  2. Separate lists of input sentences and target sentences.
  3. The maximum length of input and output sentences.
  4. The decoder will generate output characters until it generates an end token(’\n’) or it reaches a maximum length of the decoder target sentence.
# Vectorize the data.
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
with open(data_path, ‘r’, encoding=’utf-8') as f:
lines = f.read().split(‘\n’); #print(len(lines),lines[10000],lines[10000][:4]);#print(“…”,lines[1:100])
for line in lines[: min(num_samples, len(lines) — 1)]:
input_text, target_text = line.split(‘\t’)
# We use “tab” as the “start sequence” character
# for the targets, and “\n” as “end sequence” character.
target_text = ‘\t’ + target_text + ‘\n’
input_texts.append(input_text);#print (“input_text”, input_text)
target_texts.append(target_text)
#purpose is to add list of unique characters to final list of characters
for char in input_text:
if char not in input_characters:
input_characters.add(char)
for char in target_text:
if char not in target_characters:
target_characters.add(char)
input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

In the following part of the code following steps are achieved:

  1. Two dictionaries, one for input Language and other for target language are created. Input_characters and target_characters, which are a list of unique characters of input data and target data. They are created after enumerating through corresponding lists. Resultant Dictionaries will have characters as keys and index values as values.
  2. These dictionaries are used to readily convert characters to integers and integers back to characters at the input stage and output stage respectively.
input_token_index = dict(
[(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict(
[(char, i) for i, char in enumerate(target_characters)])

The three variables encoder_input_data, decoder_input_data, and decoder_output_data are initialized with zeros, with the shapes compatible with encoder input, decoder input and decoder output respectively. The shape of each variable has 3 dimensions, the first dimension represents the length of the input sentence, second one max_decoder_seqlength which is the maximum possible sentence length of target sentence, third dimension length of the one-hot encoder, which is also equal to the number of unique characters in the dataset. It again depends on the input data or the output data. For decoder input and output it will be a number of unique tokens of the target dataset. For encoder input, it is the number of unique tokens of the input dataset.

encoder_input_data = np.zeros(
(len(input_texts), max_encoder_seq_length, num_encoder_tokens),
dtype=’float32')#;print(encoder_input_data)
decoder_input_data = np.zeros(
(len(input_texts), max_decoder_seq_length, num_decoder_tokens),
dtype=’float32')
decoder_target_data = np.zeros(
(len(input_texts), max_decoder_seq_length, num_decoder_tokens),
dtype=’float32')

Next, we create one hot vector for input_data and target data by iterating through the respective dictionaries

Decoder target data is created. The only difference between decoder input data and decoder target data is that the decoder target data is shifted one-time step from the decoder input data. It means that in the current time step, decoder output prediction will be equal to the input of the next time step of the decoder. This process works like the Language Model where the current character predicts the next as it is passed through RNN and Softmax.

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
#print(encoder_input_data)
for t, char in enumerate(input_text):
encoder_input_data[i, t, input_token_index[char]] = 1.

for t, char in enumerate(target_text):
# decoder_target_data is ahead of decoder_input_data by one timestep
decoder_input_data[i, t, target_token_index[char]] = 1.
if t > 0:
# decoder_target_data will be ahead by one timestep
# and will not include the start character.
decoder_target_data[i, t — 1, target_token_index[char]] = 1.

Part 2: MODEL

Model building is a very simple process in Keras. It involves the following steps:

Encoder:

The encoder is constructed with an input layer, LSTM layer. For encoder LSTM return_state is set to True, return_seq is set to False. Since we do not need output at every time step we make return_seq=False. We need the last hidden state to be passed as hidden input to the decoder we make return_state=True. For detailed information about return_state and return_sequences refer to this blog.

Decoder:

The decoder has an Input layer, LSTM layer, and Dense layer followed by a Softmax. Input is the character-wise target sentence, at the output time step we have to predict the next character. Hence we need softmax to predict the next character. In the decoder, both return_state and return_seq are set to True. Because we need decoder output at every time step to predict the next character with the help of return_seq=True. We take the hidden input and feed it to the next step explicitly during inference with the help of return_state=True. Hidden state output is not needed during training, it is needed only during inference.

# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don’t use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation=’softmax’)
decoder_outputs = decoder_dense(decoder_outputs)

Part3: Training

We set the input, output variables in the Model function and define the model object. Then compile the model by setting optimizer for gradient descent. Categorical_crossentropy is the loss function used. It is used since the output of the decoder has to predict the next character from all the available target characters, it becomes a multiclass classification.

*Note: Categorical_crossentropy is used as loss function for multiclass classification, whereas binary_crossentropy is used as loss function for binary classification. In the former case, softmax is used as an activation function in the last layer, later case sigmoid is used.

At each time step decoder calculates the above loss(Categorical_crossentropy) between predicted value(probabilities of all the unique target values) and the target value(one-hot representation of the target values). The loss calculated at every time step is back propagated to both encoder and decoder. Gradients calculated in the backward pass will update the encoder and decoder weights in a direction that reduces the loss. This process of backward pass happens in the Keras model behind the scenes, but it is advised to know what is going on during the backward pass.

The training part of the model is easy, but the inference is a bit difficult to understand.

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
# Run training
model.compile(optimizer=’rmsprop’, loss=’categorical_crossentropy’)
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=batch_size, epochs=epochs, validation_split=0.2)

Part4 : Inference:

Inference part of the code is a bit difficult to understand. It is not because it is something big or deep logic is involved, but because we expect inference to be in the same lines as that of training which is not the case here.

During training, the decoder takes input from the target data and predicts the next character of the target data. During inference, the target data input to the decoder is not available. At each time step LSTM unit followed by softmax predicts the next character. The predicted character is given as input to the next time step. Decoder sentence generation starts with start token ‘\t’ given as input to the first time step of the decoder, which predicts the next character. The predicted character is given as input to the next time step and so on. To enable this both return_state and return_seq should be set to True during inference.

In the code snippet below, we are setting the inputs and outputs for decoder and encoder during inference.

Encoder_model is fed with input sentence. Hidden state output of the last time step of Encoder is passed to the decoder’s first time step hidden state as input.

Decoder_model is the complicated part. Input is a list formed by concatenating two lists decoder_inputs and decoder_states_inputs which are input and hidden state input to each time step of the decoder. For the output, we concatenate two lists with decoder outputs which are a prediction of next character and decoder states which are output hidden states at every time step.

Inference steps in brief:

  1. Encode input and retrieve initial decoder state
  2. Run one step of the decoder with this initial state and a “start of sequence” token as a target. The output will be the next target token
  3. Repeat with the current target token and current states
# Define sampling models
encoder_model = Model(encoder_inputs, encoder_states)
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
[decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

The following code is for inference. It takes an input sentence which has to be translated. This input_seq passes through Encoder and outputs encoder states which are c and h, cell and hidden states respectively. These states clubbed with targ_seq are sent to the input of the decoder at each time step. The variable targ_seq is the input to decoder LSTM time step, during inference we do not have target input data. So initially we start out with one-hot encoded input of ‘/t’ which is start token chosen for this project. Inside the while loop, during each iteration decoder predicts the next character. This predicted character is converted to one-hot encoded value and assigned to tar_seq, which is given as input to the next time step. Similarly “states_values” is a variable which inputs state input to the decoder during each time step.

It will loop through until it encounters a stop sequence token ‘/n’ or it runs for a number of times equal to the maximum length of target sentence form dataset. Resulting in the generation of a translated sentence. Output at every time step has to predict the next character, using the Softmax activation function, which predicts the probability score for every possible character. Among these probability values, we should pick the index of the highest probability value and extract the character for that index form the dictionary reverse_target_char_index.

We use arg_max function to pick the index of maxim probability.

# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict(
(i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
(i, char) for char, i in target_token_index.items())
def decode_sequence(input_seq):
# Encode the input as state vectors.
states_value = encoder_model.predict(input_seq)
# Generate empty target sequence of length 1.
target_seq = np.zeros((1, 1, num_decoder_tokens))
# Populate the first character of target sequence with the start character.
target_seq[0, 0, target_token_index[‘\t’]] = 1.
# Sampling loop for a batch of sequences
# (to simplify, here we assume a batch of size 1).
stop_condition = False
decoded_sentence = ‘’
while not stop_condition:
output_tokens, h, c = decoder_model.predict(
[target_seq] + states_value)
# Sample a token
sampled_token_index = np.argmax(output_tokens[0, -1, :])
sampled_char = reverse_target_char_index[sampled_token_index] decoded_sentence += sampled_char
# Exit condition: either hit max length
# or find stop character.
if (sampled_char == ‘\n’ or
len(decoded_sentence) > max_decoder_seq_length):
stop_condition = True
# Update the target sequence (of length 1).
target_seq = np.zeros((1, 1, num_decoder_tokens))
target_seq[0, 0, sampled_token_index] = 1.
# Update states
states_value = [h, c]
return decoded_sentence
for seq_index in range(5):
# Take one sequence (part of the training set)
# for trying out decoding.
input_seq = encoder_input_data[seq_index: seq_index + 1]
decoded_sentence = decode_sequence(input_seq)
print(‘-’)
print(‘Input sentence:’, input_texts[seq_index])
print(‘Decoded sentence:’, decoded_sentence)

What we have seen in this article is a character based input model. Even more popular model is the one based on word embeddings, where we could use pre-trained embeddings to train the model with a better prior of trained embeddings on word2vec or glove.

Thanks, Balaji Chunduri for reviewing and giving your valuable inputs.

References:

  1. https://machinelearningmastery.com/define-encoder-decoder-sequence-sequence-model-neural-machine-translation-keras/
  2. https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html
  3. https://pdfs.semanticscholar.org/12dd/078034f72e4ebd9dfd9f80010d2ae7aaa337.pdf
  4. https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
  5. https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0
  6. Dataset: http://www.manythings.org/anki/
  7. Source code: https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py

--

--

Suresh Pasumarthi

Product Management | AI and ML Enthusiast | Intrapreneur