Language Model using Char RNN

Suresh Pasumarthi
7 min readMar 1, 2019

The language model is a very important application in NLP, where sequence RNN based models(LSTM) are used. The Language models are used for Text Generation by learning on a large corpus of text data.

The famous Andrew Karpathy char RNN, where he trained LM on the entire code of language model on Unix was able to learn to code from this kind of training process. It was able to generate code on its own. Even though the generated code made little sense, it managed to learn a lot of syntactical patterns to reproduce.

In this article, we will explain how a Language model is trained using LSTM along with the prepossessing steps that are required for cleaning and making the data ready to train the model

The dataset we use is “Alice in Wonderland” book. The entire text from this book is taken as the corpus to train our model. Like Karpathy char RNN which is trained on Unix code, to generate more in the style of Unix code. Our model after training will start generating text with the style it has seen in the book corpus. Like Karpathy char RNN, our model is also based on character level encodings.

Theoretical understanding of LM:

Text generation or Language Model uses a sequence model like LSTM layer. A sentence is a time series sequential structure with the basic unit of a character or a word repeating/recurring. What we generate over time should have syntactic and semantic meaning, and style similar to the corpus it is trained.

Our article is based on character level unit recurrences. We divide the corpus at the character level, pass one group of characters at a time.

Each training pattern of the network is comprised of 100-time steps of one character (X) followed by one character output (y). When creating these sequences, we slide this window along the whole book one character at a time, allowing each character a chance to be learned from the 100 characters that preceded it (except the first 100 characters of course). The first group of 100 characters is sent to the LSTM model to predict the next character(101st) In the next step this 100 character input window will move one character step and now we pass characters from 2nd to 101st to predict the next character(102nd) and so on.

Training:

During training, we have the entire text corpus available. From this, we pick first 100 characters as input and 101st character as output. This will be our first data sample. Next, move this window by a character and pass next 100 characters from 2nd character to 101st character. Note that 99 characters from the first window and second window will overlap. In general, any window and it’s immediately previous and subsequent windows will overlap by 99 characters. The character predicted during the first sample will be included as part of the 100 character window for the next sample. In this way we end up covering entire text corpus of “Alice in Wonderland” book, like Karpathy would have covered his entire Unix source code.

Inference:

Once the training part is done, our LSTM model has learned the language, style and all possible subtleties of the book. Now it is ready to generate text in the style of what it has learned.

During inference, we follow steps very similar to training. But with a minor difference during initial stages. The 100 character window for which we need 100 character text is not available during Inference. For this, we can take a “seed” of 100 characters from the data, to get started. Using this seed we will generate a 101st character. Move the window as we did during training, by leaving out the first character and including newly generated character to be part of its new window to generate the 102nd character. In this manner, we continue to generate as many characters, words, lines, paragraphs, pages or books of text we want.

One major difference between training and inference is that, during training the target word during current time step is included in the 100 character window for next time step, but during inference, instead of target character(which we do not have) we include predicted character form the current time step in the 100 character window of next time step.

Part-1: Pre-processing

Load the entire text corpus into an object variable of string type “raw_text” and convert it into lower case text. By typecasting this raw_text into with a set() data structure we are removing all the duplicate characters. Now we are left with unique characters from the text corpus. This set is converted to list and sorted for convenience.

As machines understand only numbers, we convert characters to numbers using the following char_to_int method. This method takes unique characters as keys and assigns integer values corresponding to each character key. In short, each character will have a dedicated integer obtained through its index by enumerating and save character and index pair in a dictionary. “Char_to_int” is usually needed at the input stage where we convert all characters into more meaningful type(integer) for sending it as input to the model.
We also have “int_to_char” this is to convert back from integer to a character which is used during validation, to convert back to characters

# load ASCII text and convert to lowercase
filename = “wonderland.txt”
raw_text = open(filename).read()
raw_text = raw_text.lower()
# create a mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

Part-2: Data Preparation

In this part of the code, data is being prepared. Input sample and corresponding target character sample. Each iteration of the for-loop creates a sample with 100 input characters and a single output character. This is done with “seq_in” and “seq_out”, which are added to “dataX”, ”dataY” which are input and output data in list form respectively. End of this loop we will have our dataset ready with “dataX” having 100 character samples and “dataY” having single character samples. Then, we do normalization of input data samples. Convert target samples to a one-hot encoded categorical variable. Each predicted time step will predict a character from the available unique characters. For which we need the target output to be in one-hot representation. The size of this one-hot representation will be equal to the number of unique characters in the text corpus.

# prepare the dataset of input to output pairs encoded as integersseq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars — seq_length, 1):
seq_in = raw_text[i:i + seq_length]
seq_out = raw_text[i + seq_length]
dataX.append([char_to_int[char] for char in seq_in])
dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print(“Total Patterns: “, n_patterns)
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

Part-3: Model Definition

The model is a multi-layer Recurrent Neural network, with the following layers:

LSTM → Dropout → LSTM → Dropout → Dense → SoftMax

The output of softmax predicts the next character from the unique characters of the dataset, by giving the probability score for each possible character. The character with maximum probability is selected. There are two LSTM Layers, to make sure that the model can learn the complex data very well we made the network deep. First LSTM layer has return_sequences set to True because it connects all its outputs to the next LSTM layer. Second LSTM layer has return_sequence set to False, as we only take the last time step to predict the next character.

model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation=’softmax’))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’)
model.summary()

Part-4: inference

The code below is used to pick random seed of 100 characters, during inference. It is saved in the variable “pattern”, which is used in the part-4 block of the code where inference is implemented.

# pick a random seed
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print(start,len(pattern))
print (“Seed:”)
print (‘’.join([int_to_char[value] for value in pattern]))

Part-5: Prediction

In the outer for-loop, we have to specify how many characters do we want to generate. We use “model.Predict() function to predict the next character. Variable “x” will have an initial seed text sample to start the process of inference. model.predict(x), predicts the possible probability scores for each character. Higher the probability, higher the likelihood of the character being the next character. We apply “argmax” to get the index of the maximum probability value. Using “int_to_char” and the index values of maximum probability we get the next character predicted.

This process will we iterate based on a number of character we want to generate. In the example code below we are predicting 500 characters. We can go on to generate as many as we want.

#import sys
# generate characters
for i in range(500):
x = numpy.reshape(pattern, (1, len(pattern), 1))
x = x / float(n_vocab)
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
result = int_to_char[index]
seq_in = [int_to_char[value] for value in pattern]
#sys.stdout.write(result)
print(result,end=””)
if i%100 == 0 and i != 0:
print(“\n”)
pattern.append(index)
pattern = pattern[1:len(pattern)]
print (“\nDone.”)

Applications of Language Model:

Language models have numerous applications.

  1. Predicting the next character in a word or predict the next word in a sentence while typing.
  2. It helps in both Sentence completion and word completion.
  3. Generate stories in the style of classic books. Example learn from the style of Shakespeare, Agatha Cristie etc and generate a new story.
  4. The language model can be used for pretraining a recurrent model. The pre-trained model can be used for other applications like Text classification or Neural Machine Translation. This idea of Transfer learning in NLP is very new and gaining a lot of ground. Refer to the fast.ai notebook on text classification to see its implementation.

--

--

Suresh Pasumarthi

Product Management | AI and ML Enthusiast | Intrapreneur