Word Embeddings in Keras
This blog will explain the importance of Word embedding and how it is implemented in Keras.
Word embeddings are a way of representing words, to be given as input to a Deep learning model. It is considered the best available representation of words in NLP. In this method, each word is represented as a word vector in a predefined dimension. Higher the dimension richer its ability to incorporate the syntactic and semantic meaning of the word.
A word in any language will have a meaning. However, its meaning depends on the context in which it is located. Each word can have multiple contexts, this number could be in double digits sometimes. This needs a complex representation.
Traditionally integer values and one-hot vectors are used to represent words. Not just words in NLP but any categorical variable in structured data can be seen in this light. One-hot encoder representation has its own drawbacks:
- Vector length of each word representation is equal to a total number of unique words in the dictionary. In NLP application, this will make the vector length too big.
- Different values of the variables can be represented with any relationship using one-hot vectors. Variables features cannot be represented and so the relationship between them. Ex: Day of the week, weekdays will have some sort of relationship and similarly weekends. In one-hot representation, it will not be able to make a distinction.
To overcome the above issues, there are two standard ways to pre-train word embeddings, one is word2vec, other GloVe short form for Global Vectors. These pre-trained models overcome the above drawbacks by letting us select much small dimension vector (compared to one hot vector) to represent the words keeping context in mind.
These are the two techniques available to represent words in a language, in the form of an embedding vector by train these vectors on a huge corpus of the corresponding language. Once the training is completed, each word vector will be enriched with the semantic and syntactic meaning of the word. It will be able to incorporate its contextual meaning based on all the contexts it has seen during training.
word2vec and Glove might be said to be to NLP what VGGNet is to vision, i.e. a common weight initialization that provides generally helpful features without the need for lengthy training. Word embeddings are useful for a wide variety of applications beyond NLP such as information retrieval, recommendation, and link prediction in knowledge bases, which all have their own task-specific approaches. Word embeddings are typically learned only based on the window of surrounding context words
Training these word vectors, using word2vec by Mikalov et al 2014, or GloVe by Pennington et al 2015 is an interesting process. Which needs a dedicated blog. In this blog, we will cover a much simpler topic of using these pre-trained Embeddings in NLP for Deep Learning models. We will be using Keras to show how Embedding layer can be initialized with random/default word embeddings and how pre-trained word2vec or GloVe embeddings can be initialized.
Pre-processing with Keras tokenizer:
We will use Keras tokenizer to do pre-processing needed to clean up the data.
First, create a Keras tokenizer object. Using the tokenizer object call “fit_on_texts” function by passing the dataset as a list of data samples. This fits the Keras tokenizer to the dataset. Now other methods inside the tokenizer class/object can be used to apply meaningful operations on the data set.
Following are some of the methods in Keras tokenizer object:
tokenizer.text_to_sequence():
This line of code tokenizes the input text by splitting the corpus into tokens of words an makes a list of them. Each unique word token is given corresponding dedicated integer value. For example a sentence: “I don’t like movies because movies are not real” becomes “5, 6,20,9,12,9,22,3,23” here I have taken random dedicated integers for a corresponding word. Word “movies” gets an integer “9”. The text is converted into a stream of integer strings replacing word tokens.
pad_sequence(list_tokenized_train, maxlen=maxlen):
The pad_sequence takes two arguments, one tokenized text in the form of integers. Which is converted from the dataset using “text_to_sequence()” method. The second argument takes the maximum possible length of a sentence in the text corpus. We can set the “maxlen” by doing some analysis on the length of sentences in the dataset. Ideally, take the length of the longest sentence by removing outliers which are extremely long.
Taking these two arguments, pad_sequece makes the sentences in input dataset uniform in length. By padding sentences smaller than “maxlen” with empty values and by truncating the sentences longer than “maxlen”
max_features = 20000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
vocab_size = len(tokenizer.word_index) + 1
maxlen = 40
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)
Tokenizer.word_index:
This method of the Tokenizer returns all the unique words in the dataset, in a dictionary format with keys as words and values as the index of the words.
How to load GloVe word vectors:
Download “glove.6B.zip” file and unzip the file. It contains 4 text files with word vectors trained using GloVe. Each text file has 400,000 unique word vector dictionary. These 4 files provide vectors of different dimensions, 50d, 100d,200d, 300d. Bigger the dimension richer the ability for the word vector to contain more contexts of the word for the training corpus. The above embeddings of Glove are trained on a corpus of 6 Billion words. For NLP applications it is always better to go with the highest vector dimension if you have sufficient hardware to train on. In our example, we are using Glove with 300d and 100d word vectors. The results show that accuracy with 300d is much better than 100d.
Each line in the “glove.6B.300d.text” file has the name of the word followed by the embedding vector values. If we take each line iteratively and split them in a list, we get 301 values in the list for each word and its embedding. Three hundred floating type values as word vectors and one-word string. This list of 301 values is converted into a dictionary type with the key as the word and value as embedding vector of 300 values. After completing all the iterations line by line we end up creating a dictionary with all 400,000 words as keys and corresponding vectors as values.
# load the whole embedding into memory
embeddings_index = dict()
f = open(‘glove.6B.100d.txt’)
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype=’float32')
embeddings_index[word] = coefs
f.close()
print(‘Loaded %s word vectors.’ % len(embeddings_index))
Keras “tokenizer.word_index” has a dictionary of unique tokens/words form the input data. The keys of this dictionary are the words, values are the corresponding dedicated integer values.
Using keys of this word_index dictionary, we get corresponding word vector from the dictionary created by the Glove word Embeddings.
These procured Embeddings are saved in a matrix variable “embedding_matrix”, whose index will be the dedicated integer of the word during word_index dictionary.
The matrix is used to initialize weights in the Embedding layer of the model.
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size, 100))
for word, i in tokenizer.word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
How to load word2vec word vectors:
Word2vec is an alternate method of training word vectors. We will load word vectors trained on google news dataset of 3 Billion words. This dictionary consists of 3 million unique word vectors, which is much bigger than GloVe. Each vector is 300 dimensions long.
We will use the gensim library to extract word vectors from the google news vector database.
Two functions “getVector” is used to get the word vector for a word passed as an argument, and other function “isInModel” is to check whether the passed word is available in the google news dictionary.
from gensim.models.keyedvectors import KeyedVectors
import logginglogging.basicConfig(format=’%(asctime)s : %(levelname)s : %(message)s’, level=logging.INFO)print(“loading word2vec model…”)
word2vec_model = KeyedVectors.load_word2vec_format(‘GoogleNews-vectors-negative300.bin’, binary=True)def getVector(str):
if str in word2vec_model:
return word2vec_model[str]
else:
return None;def isInModel(str):
return str in word2vec_model
Keras “tokenizer.word_index” has a dictionary of unique tokens/words form the input data. The keys of this dictionary are the words, values are the corresponding dedicated integer values.
Using keys of this word_index dictionary, we get corresponding word vector from the dictionary created by the word2vec google news word Embeddings.
These procured Embeddings are saved in a matrix variable “embedding_matrix”, whose index will be the dedicated integer of the word during word_index dictionary.
The matrix is used to initialize weights in the Embedding layer of the model.
This part of the code is similar to GloVe or any other model from which we load pre-trained vectors.
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size, 300))
for word, i in tokenizer.word_index.items():
embedding_vector = getVector(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
How to load FastText word vectors:
FastText is another way to train word embeddings, they are made available by Facebook.
FastText word embeddings are trained using word2vec. But FastText has a slight advantage over regular word2vec. FastText uses n-grams for each word in the dataset. It means that each word is seen as n number of sub-words(called root words). For example, the word “apple” with n=3 will have three sub-words “app”, “ppl” and “ple”.
In FastText each sub-word(according to the n-grams) will have a word vector. It means that it is much richer than simple word2vec. The biggest advantage that comes out of this change is that we can now represent some unknown words by breaking them into sub-words or their n-grams representation. For example, “Gastroenteritis” does not exist in word2vec but its n-gram representation will have a word vector available in FastText.
It does not mean that all the unknown words can find their sub-words in FastText, but chances of finding a sub-word which is a root word in FastText is much higher than finding an unknown word in word2vec.
We can load FastText word dictionary using the gensim wrapper created for FastText. This particular code snippet below is trained on Wikipedia text corpus.
from gensim.models.wrappers import FastTextmodel = FastText.load_fasttext_format(‘wiki.simple’)
Embedding layer:
Embedding layer has two mandatory arguments “vocab_size” and “embed_size”. vocab_size is the number of unique words in the input dataset. Embed_size is the size of Embedding word vectors. In our example embed_size is 300d. If we go with default weights for Embedding layer(word vectors) specifying these two parameters will be sufficient. But if we have to use pre-trained word vectors, we have to pass embedding_matrix for the weights parameter. Another argument trainable should be set to True to fine tune the Embedding layer during training.
Embedding layer for default/random weight initialization:
x = Embedding(vocab_size, embed_size)(inp)
Embedding layer for pre-trained weight initialization(Glove,word2vec or fastText):
x = Embedding(vocab_size, embed_size, weights=[embedding_matrix], trainable=True)(inp)
Conclusion:
We have seen how different pre-trained word vectors can be loaded and used to represent words in the input text corpus. Compared to the random initialization of word vectors, pre-trained vectors provide a good starting point for the model to learn. It reduces the burden on the model to train the basic Language syntax and semantics.
In this article, we have given a brief theoretical introduction to there types of pretraining methods GloVe, word2vec, and FastText. We have seen how we can use them in our NLP Deep Learning models.