Word Embeddings in Keras

  1. Vector length of each word representation is equal to a total number of unique words in the dictionary. In NLP application, this will make the vector length too big.
  2. Different values of the variables can be represented with any relationship using one-hot vectors. Variables features cannot be represented and so the relationship between them. Ex: Day of the week, weekdays will have some sort of relationship and similarly weekends. In one-hot representation, it will not be able to make a distinction.
max_features = 20000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
vocab_size = len(tokenizer.word_index) + 1
maxlen = 40
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)
# load the whole embedding into memory
embeddings_index = dict()
f = open(‘glove.6B.100d.txt’)
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype=’float32')
embeddings_index[word] = coefs
f.close()
print(‘Loaded %s word vectors.’ % len(embeddings_index))
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size, 100))
for word, i in tokenizer.word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
from gensim.models.keyedvectors import KeyedVectors
import logging
logging.basicConfig(format=’%(asctime)s : %(levelname)s : %(message)s’, level=logging.INFO)print(“loading word2vec model…”)
word2vec_model = KeyedVectors.load_word2vec_format(‘GoogleNews-vectors-negative300.bin’, binary=True)
def getVector(str):
if str in word2vec_model:
return word2vec_model[str]
else:
return None;
def isInModel(str):
return str in word2vec_model
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size, 300))
for word, i in tokenizer.word_index.items():
embedding_vector = getVector(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
from gensim.models.wrappers import FastTextmodel = FastText.load_fasttext_format(‘wiki.simple’)
x = Embedding(vocab_size, embed_size)(inp)
x = Embedding(vocab_size, embed_size, weights=[embedding_matrix], trainable=True)(inp)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Suresh Pasumarthi

Suresh Pasumarthi

Product Management | AI and ML Enthusiast | Intrapreneur