word2vec之numpy原生实现(译)

作者: __豆约翰__ | 来源:发表于2019-08-04 13:16 被阅读92次

An implementation guide to Word2Vec using NumPy and Google Sheets

原文链接:

https://medium.com/@derekchia/an-implementation-guide-to-word2vec-using-numpy-and-google-sheets-13445eebd281


Word2Vec被认为是自然语言处理(NLP)领域中最大、最新的突破之一。其的概念简单,优雅,(相对)容易掌握。Google一下就会找到一堆关于如何使用诸如Gensim和TensorFlow的库来调用Word2Vec方法的结果。另外,对于那些好奇心强的人,可以查看Tomas Mikolov基于C语言的原始实现。原稿也可以在这里找到。

本文的主要重点是详细介绍Word2Vec。为此,我在Python上使用Numpy(在其他教程的帮助下)实现了Word2Vec,还准备了一个Google Sheet来展示计算结果。以下是代码和Google Sheet的链接。

image

直观上看

Word2Vec的目标是生成带有语义的单词的向量表示,用于进一步的NLP任务。每个单词向量通常有几百个维度,语料库中每个唯一的单词在空间中被分配一个向量。例如,单词“happy”可以表示为4维向量[0.24、0.45、0.11、0.49],“sad”具有向量[0.88、0.78、0.45、0.91]。

这种从单词到向量的转换也被称为单词嵌入(word embedding)。这种转换的原因是机器学习算法可以对数字(在向量中的)而不是单词进行线性代数运算。

为了实现Word2Vec,有两种风格可以选择,Continuous Bag-of-Words(CBOW)或Skip-gram(SG)。简单来说,CBOW尝试从相邻单词(上下文单词)猜测输出(目标单词),而Skip-Gram从目标单词猜测上下文单词。实际上,Word2Vec是基于分布假说,其认为每个单词的上下文都在其附近的单词中。因此,通过查看它的相邻单词我们可以尝试对目标单词进行预测。

根据Mikolov(引用于这篇文章),以下是Skip-gram和CBOW之间的区别:

Skip-gram:能够很好地处理少量的训练数据,而且能够很好地表示不常见的单词或短语

CBOW:比skip-gram训练快几倍,对出现频率高的单词的准确度稍微更好一些

更详细地说,由于Skip-gram学习用给定单词来预测上下文单词,所以万一两个单词(一个出现频率较低,另一个出现频率较高)放在一起,那么当最小化loss值时,两个单词将进行有相同的处理,因为每个单词都将被当作目标单词和上下文单词。与CBOW相比,不常见的单词将只是用于预测目标单词的上下文单词集合的一部分。因此,该模型将给不常现的单词分配一个低概率。

image

实现过程

在本文中,我们将实现Skip-gram体系结构。为了便于阅读,内容分为以下几个部分:

1.数据准备——定义语料库、整理、规范化和分词

2.超参数——学习率、训练次数、窗口尺寸、嵌入(embedding)尺寸

3.生成训练数据——建立词汇表,对单词进行one-hot编码,建立将id映射到单词的字典,以及单词映射到id的字典

4.模型训练——通过正向传递编码过的单词,计算错误率,使用反向传播调整权重和计算loss值

5.结论——获取词向量,并找到相似的词

6.进一步的改进 —— 利用Skip-gram负采样(Negative Sampling)和Hierarchical Softmax提高训练速度

1.数据准备

首先,我们从以下语料库开始:

natural language processing and machine learning is fun and exciting

简单起见,我们选择了一个没有标点和大写的橘子。而且,我们没有删除停用词“and”和“is”。

实际上,文本数据是非结构化的,甚至可能很“很不干净”清理它们涉及一些步骤,例如删除停用词、标点符号、将文本转换为小写(实际上取决于你的实际例子)和替换数字等。KDnuggets 上有一篇关于这个步骤很棒的文章。另外,Gensim也提供了执行简单文本预处理的函数——gensim.utils.simple_preprocess,它将文档转换为由小写的词语(Tokens )组成的列表,并忽略太短或过长的词语。

text = "natural language processing and machine learning is fun and exciting"

# Note the .lower() as upper and lowercase does not matter in our implementation
# [['natural', 'language', 'processing', 'and', 'machine', 'learning', 'is', 'fun', 'and', 'exciting']]
corpus = [[word.lower() for word in text.split()]]

在预处理之后,我们开始对语料库进行分词。我们按照单词间的空格对我们的语料库进行分词,结果得到一个单词列表:

[“natural”, “language”, “processing”, “ and”, “ machine”, “ learning”, “ is”, “ fun”, “and”, “ exciting”]

2.超参数

在进入word2vec的实现之前,让我们先定义一些稍后需要用到的超参数。

settings = {
    'window_size': 2    # context window +- center word
    'n': 10,        # dimensions of word embeddings, also refer to size of hidden layer
    'epochs': 50,       # number of training epochs
    'learning_rate': 0.01   # learning rate
}
image

[n]:这是单词嵌入(word embedding)的维度,通常其的大小通常从100到300不等,取决于词汇库的大小。超过300维度会导致效益递减(参见图2(a)的1538页)。请注意,维度也是隐藏层的大小。

[epochs] :表示遍历整个样本的次数。在每个epoch中,我们循环通过一遍训练集的样本。

[learning_rate/学习率]:学习率控制着损失梯度对权重进行调整的量。

3.生成训练数据

在本节中,我们的主要目标是将语料库转换one-hot编码表示,以方便Word2vec模型用来训练。从我们的语料库中,图4中显示了10个窗口(#1到#10)中的每一个。每个窗口都由目标单词及其上下文单词组成,分别用橙色和绿色高亮显示。

image

第一个和最后一个训练窗口中的第一个和最后一个元素的示例如下所示:

1 [目标单词(natural)], [上下文单词 (language, processing)][list([1, 0, 0, 0, 0, 0, 0, 0, 0])

list([[0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0]])]

*****#2 to #9 省略****#10

[ 目标单词 (exciting)], [ 上下文单词 (fun, and)]

[list([0, 0, 0, 0, 0, 0, 0, 0, 1])

list([[0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0]])]

为了生成one-hot训练数据,我们首先初始化word2vec()对象,然后使用对象w2v通过settings 和corpus 参数来调用函数generate_training_data。

# Initialise object
w2v = word2vec()
# Numpy ndarray with one-hot representation for [target_word, context_words]
training_data = w2v.generate_training_data(settings, corpus)

在函数generate_training_data内部,我们进行以下操作:

self.v_count: 词汇表的长度(注意,词汇表指的就是语料库中不重复的单词的数量)

self.words_list: 在词汇表中的单词组成的列表

self.word_index: 以词汇表中单词为key,索引为value的字典数据

self.index_word: 以索引为key,以词汇表中单词为value的字典数据

for循环给用one-hot表示的每个目标词和其的上下文词添加到training_data中,one-hot编码用的是word2onehot函数。

class word2vec():
  def __init__(self):
    self.n = settings['n']
    self.lr = settings['learning_rate']
    self.epochs = settings['epochs']
    self.window = settings['window_size']

  def generate_training_data(self, settings, corpus):
    # Find unique word counts using dictonary
    word_counts = defaultdict(int)
    for row in corpus:
      for word in row:
        word_counts[word] += 1
    ## How many unique words in vocab? 9
    self.v_count = len(word_counts.keys())
    # Generate Lookup Dictionaries (vocab)
    self.words_list = list(word_counts.keys())
    # Generate word:index
    self.word_index = dict((word, i) for i, word in enumerate(self.words_list))
    # Generate index:word
    self.index_word = dict((i, word) for i, word in enumerate(self.words_list))

    training_data = []
    # Cycle through each sentence in corpus
    for sentence in corpus:
      sent_len = len(sentence)
      # Cycle through each word in sentence
      for i, word in enumerate(sentence):
        # Convert target word to one-hot
        w_target = self.word2onehot(sentence[i])
        # Cycle through context window
        w_context = []
        # Note: window_size 2 will have range of 5 values
        for j in range(i - self.window, i + self.window+1):
          # Criteria for context word 
          # 1. Target word cannot be context word (j != i)
          # 2. Index must be greater or equal than 0 (j >= 0) - if not list index out of range
          # 3. Index must be less or equal than length of sentence (j <= sent_len-1) - if not list index out of range 
          if j != i and j <= sent_len-1 and j >= 0:
            # Append the one-hot representation of word to w_context
            w_context.append(self.word2onehot(sentence[j]))
            # print(sentence[i], sentence[j]) 
            # training_data contains a one-hot representation of the target word and context words
        training_data.append([w_target, w_context])
    return np.array(training_data)

  def word2onehot(self, word):
    # word_vec - initialise a blank vector
    word_vec = [0 for i in range(0, self.v_count)] # Alternative - np.zeros(self.v_count)
    # Get ID of word from word_index
    word_index = self.word_index[word]
    # Change value from 0 to 1 according to ID of the word
    word_vec[word_index] = 1
    return word_vec

4.模型训练

image

拥有了training_data,我们现在可以准备训练模型了。训练从w2v.train(training_data)开始,我们传入训练数据,并执行train函数。

Word2Vec2模型有两个权重矩阵(w1和w2),为了展示,我们把值初始化到形状分别为(9x10)和(10x9)的矩阵。这便于反向传播误差的计算,这部分将在后文讨论。在实际的训练中,你应该随机初始化这些权重(比如使用np.random.uniform())。想要这么做,把第九第十行注释掉,把11和12行取消注释就好。

# Training
w2v.train(training_data)

class word2vec():
  def train(self, training_data):
  # Initialising weight matrices
  # Both s1 and s2 should be randomly initialised but for this demo, we pre-determine the arrays (getW1 and getW2)
  # getW1 - shape (9x10) and getW2 - shape (10x9)
  self.w1 = np.array(getW1)
  self.w2 = np.array(getW2)
  # self.w1 = np.random.uniform(-1, 1, (self.v_count, self.n))
  # self.w2 = np.random.uniform(-1, 1, (self.n, self.v_count))

训练——向前传递

接下来,我们开始用第一组训练样本来训练第一个epoch,方法是把w_t 传入forward_pass 函数,w_t 是表示目标词的one-hot向量。在forward_pass 函数中,我们执行一个w1 和w_t 的点乘积,得到h (原文是24行,但图中实际是第22行)。然后我们执行w2和h 点乘积,得到输出层的u( 原文是26行,但图中实际是第24行 )。最后,在返回预测向量y_pred和隐藏层h 和输出层u 前,我们使用softmax把u 的每个元素的值映射到0和1之间来得到用来预测的概率(第28行)。

class word2vec():
  def train(self, training_data):
  ##Removed##
  
    # Cycle through each epoch
    for i in range(self.epochs):
      # Intialise loss to 0
      self.loss = 0

      # Cycle through each training sample
      # w_t = vector for target word, w_c = vectors for context words
      for w_t, w_c in training_data:
        # Forward pass - Pass in vector for target word (w_t) to get:
        # 1. predicted y using softmax (y_pred) 2. matrix of hidden layer (h) 3. output layer before softmax (u)
        y_pred, h, u = self.forward_pass(w_t)
        
        ##Removed##
        
  def forward_pass(self, x):
    # x is one-hot vector for target word, shape - 9x1
    # Run through first matrix (w1) to get hidden layer - 10x9 dot 9x1 gives us 10x1
    h = np.dot(self.w1.T, x)
    # Dot product hidden layer with second matrix (w2) - 9x10 dot 10x1 gives us 9x1
    u = np.dot(self.w2.T, h)
    # Run 1x9 through softmax to force each element to range of [0, 1] - 1x8
    y_c = self.softmax(u)
    return y_c, h, u
  
  def softmax(self, x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

我附上一些截图展示第一窗口(#1)中第一个训练样本的计算,其中目标词是“natural”,上下文单词是“language”和“processing”。可以在这里查看Google Sheet中的公式。

image

训练——误差,反向传播和损失(loss)

误差——对于y_pred、h 和u,我们继续计算这组特定的目标词和上下文词的误差。这是通过对y_pred 与在w_c 中的每个上下文词之间的差的加合来实现的。

image

反向传播——接下来,我们使用反向传播函数backprop ,通过传入误差EI 、隐藏层h 和目标字w_t 的向量,来计算我们所需的权重调整量。

为了更新权重,我们将权重的调整量(dl_dw1 和dl_dw2 )与学习率相乘,然后从当前权重(w1 和w2 )中减去它。

image image
class word2vec():
  ##Removed##
  
  for i in range(self.epochs):
    self.loss = 0
    for w_t, w_c in training_data:
    ##Removed##
      
      # Calculate error
      # 1. For a target word, calculate difference between y_pred and each of the context words
      # 2. Sum up the differences using np.sum to give us the error for this particular target word
      EI = np.sum([np.subtract(y_pred, word) for word in w_c], axis=0)

      # Backpropagation
      # We use SGD to backpropagate errors - calculate loss on the output layer 
      self.backprop(EI, h, w_t)

      # Calculate loss
      # There are 2 parts to the loss function
      # Part 1: -ve sum of all the output +
      # Part 2: length of context words * log of sum for all elements (exponential-ed) in the output layer before softmax (u)
      # Note: word.index(1) returns the index in the context word vector with value 1
      # Note: u[word.index(1)] returns the value of the output layer before softmax
      self.loss += -np.sum([u[word.index(1)] for word in w_c]) + len(w_c) * np.log(np.sum(np.exp(u)))
    print('Epoch:', i, "Loss:", self.loss)

  def backprop(self, e, h, x):
    # https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.outer.html
    # Column vector EI represents row-wise sum of prediction errors across each context word for the current center word
    # Going backwards, we need to take derivative of E with respect of w2
    # h - shape 10x1, e - shape 9x1, dl_dw2 - shape 10x9
    dl_dw2 = np.outer(h, e)
    # x - shape 1x8, w2 - 5x8, e.T - 8x1
    # x - 1x8, np.dot() - 5x1, dl_dw1 - 8x5
    dl_dw1 = np.outer(x, np.dot(self.w2, e.T))
    # Update weights
    self.w1 = self.w1 - (self.lr * dl_dw1)
    self.w2 = self.w2 - (self.lr * dl_dw2)

backprop的另一个实现


    def backprop(self, e, h, x):
        # https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.outer.html
        # Column vector EI represents row-wise sum of prediction errors across each context word for the current center word
        # Going backwards, we need to take derivative of E with respect of w2
        # h - shape 10x1, e - shape 9x1, dl_dw2 - shape 10x9
        # x - shape 9x1, w2 - 10x9, e.T - 9x1
        leng = len(e)
        dl_dw2 = np.outer(h, e)
        dl_dw22 = np.dot(h.reshape(1,-1).T,e.reshape(-1,9)) #9==len(e)
        # print('Delta for w2', dl_dw2)         #
        # print('Delta for w22', dl_dw22)           #
        dl_dw1 = np.outer(x, np.dot(self.w2, e.T))
        # 9==len(e)
        dh = np.dot(e.reshape(-1,9),self.w2.T)#1x10
        dl_dw11 = np.dot(np.array(x).reshape(1,-1).T,dh)
        # print('Delta for w1', dl_dw1)  #
        # print('Delta for w11', dl_dw11)  #
        ########################################
        # print('Delta for w2', dl_dw2)         #
        # print('Hidden layer', h)              #
        # print('np.dot', np.dot(self.w2, e.T)) #
        # print('Delta for w1', dl_dw1)         #
        #########################################

        # Update weights
        self.w1 = self.w1 - (self.lr * dl_dw1)
        self.w2 = self.w2 - (self.lr * dl_dw2)

损失——最后,根据损失函数计算出每个训练样本完成后的总损失。注意,损失函数包括两个部分。第一部分是输出层(在softmax之前)中所有元素的和的负数。第二部分是上下文单词的数量乘以在输出层中所有元素(在 exp之后)之和的对数。

image

引用至:https://arxiv.org/pdf/1411.2738.pdf

5. 推论和总结(Inferencing)

既然我们已经完成了50个epoch的训练,两个权重(w1和w2)现在都准备好执行推论了。

获取单词的向量

有了一组训练后的权重,我们可以做的第一件事是查看词汇表中单词的词向量。我们可以简单地通过查找单词的索引来对训练后的权重(w1)进行查找。在下面的示例中,我们查找单词“machine”的向量。

# Get vector for word
vec = w2v.word_vec("machine")

class word2vec():
  ## Removed ##
  
  # Get vector from word
  def word_vec(self, word):
    w_index = self.word_index[word]
    v_w = self.w1[w_index]
    return v_w
print(w2v.word_vec("machine"))

[ 0.76702922 -0.95673743  0.49207258  0.16240808 -0.4538815

-0.74678226  0.42072706 -0.04147312  0.08947326 -0.24245257]

查询相似的单词

我们可以做的另一件事就是找到类似的单词。即使我们的词汇量很小,我们仍然可以通过计算单词之间的余弦相似度来实现函数vec_sim 。

# Find similar words
w2v.vec_sim("machine", 3)

class word2vec():
  ## Removed##
  
  # Input vector, returns nearest word(s)
  def vec_sim(self, word, top_n):
    v_w1 = self.word_vec(word)
    word_sim = {}

    for i in range(self.v_count):
      # Find the similary score for each word in vocab
      v_w2 = self.w1[i]
      theta_sum = np.dot(v_w1, v_w2)
      theta_den = np.linalg.norm(v_w1) * np.linalg.norm(v_w2)
      theta = theta_sum / theta_den

      word = self.index_word[i]
      word_sim[word] = theta

    words_sorted = sorted(word_sim.items(), key=lambda kv: kv[1], reverse=True)

    for word, sim in words_sorted[:top_n]:
      print(word, sim)

w2v.vec_sim("machine", 3)

machine 1.0

fun 0.6223490454018772

and 0.5190154215400249

余弦相似度参考:
http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html

完整代码:

"""
This is a implementation of Word2Vec using numpy. Uncomment the print functions to see Word2Vec in action! Also remember to change the number of epochs and set training_data to training_data[0] to avoid flooding your terminal. A Google Sheet implementation of Word2Vec is also available here - https://docs.google.com/spreadsheets/d/1mgf82Ue7MmQixMm2ZqnT1oWUucj6pEcd2wDs_JgHmco/edit?usp=sharing
Have fun learning!
Author: Derek Chia
Email: derek@derekchia.com
"""

import numpy as np
from collections import defaultdict

## Randomly initialise
getW1 = [[0.236, -0.962, 0.686, 0.785, -0.454, -0.833, -0.744, 0.677, -0.427, -0.066],
        [-0.907, 0.894, 0.225, 0.673, -0.579, -0.428, 0.685, 0.973, -0.070, -0.811],
        [-0.576, 0.658, -0.582, -0.112, 0.662, 0.051, -0.401, -0.921, -0.158, 0.529],
        [0.517, 0.436, 0.092, -0.835, -0.444, -0.905, 0.879, 0.303, 0.332, -0.275],
        [0.859, -0.890, 0.651, 0.185, -0.511, -0.456, 0.377, -0.274, 0.182, -0.237],
        [0.368, -0.867, -0.301, -0.222, 0.630, 0.808, 0.088, -0.902, -0.450, -0.408],
        [0.728, 0.277, 0.439, 0.138, -0.943, -0.409, 0.687, -0.215, -0.807, 0.612],
        [0.593, -0.699, 0.020, 0.142, -0.638, -0.633, 0.344, 0.868, 0.913, 0.429],
        [0.447, -0.810, -0.061, -0.495, 0.794, -0.064, -0.817, -0.408, -0.286, 0.149]]

getW2 = [[-0.868, -0.406, -0.288, -0.016, -0.560, 0.179, 0.099, 0.438, -0.551],
        [-0.395, 0.890, 0.685, -0.329, 0.218, -0.852, -0.919, 0.665, 0.968],
        [-0.128, 0.685, -0.828, 0.709, -0.420, 0.057, -0.212, 0.728, -0.690],
        [0.881, 0.238, 0.018, 0.622, 0.936, -0.442, 0.936, 0.586, -0.020],
        [-0.478, 0.240, 0.820, -0.731, 0.260, -0.989, -0.626, 0.796, -0.599],
        [0.679, 0.721, -0.111, 0.083, -0.738, 0.227, 0.560, 0.929, 0.017],
        [-0.690, 0.907, 0.464, -0.022, -0.005, -0.004, -0.425, 0.299, 0.757],
        [-0.054, 0.397, -0.017, -0.563, -0.551, 0.465, -0.596, -0.413, -0.395],
        [-0.838, 0.053, -0.160, -0.164, -0.671, 0.140, -0.149, 0.708, 0.425],
        [0.096, -0.995, -0.313, 0.881, -0.402, -0.631, -0.660, 0.184, 0.487]]

class word2vec():

    def __init__(self):
        self.n = settings['n']
        self.lr = settings['learning_rate']
        self.epochs = settings['epochs']
        self.window = settings['window_size']

    def generate_training_data(self, settings, corpus):
        # Find unique word counts using dictonary
        word_counts = defaultdict(int)
        for row in corpus:
            for word in row:
                word_counts[word] += 1
        #########################################################################################################################################################
        # print(word_counts)                                                                                                                                    #
        # # defaultdict(<class 'int'>, {'natural': 1, 'language': 1, 'processing': 1, 'and': 2, 'machine': 1, 'learning': 1, 'is': 1, 'fun': 1, 'exciting': 1}) #
        #########################################################################################################################################################

        ## How many unique words in vocab? 9
        self.v_count = len(word_counts.keys())
        #########################
        # print(self.v_count)   #
        # 9                     #
        #########################

        # Generate Lookup Dictionaries (vocab)
        self.words_list = list(word_counts.keys())
        #################################################################################################
        # print(self.words_list)                                                                        #
        # ['natural', 'language', 'processing', 'and', 'machine', 'learning', 'is', 'fun', 'exciting']  #
        #################################################################################################
        
        # Generate word:index
        self.word_index = dict((word, i) for i, word in enumerate(self.words_list))
        #############################################################################################################################
        # print(self.word_index)                                                                                                    #
        # # {'natural': 0, 'language': 1, 'processing': 2, 'and': 3, 'machine': 4, 'learning': 5, 'is': 6, 'fun': 7, 'exciting': 8} #
        #############################################################################################################################

        # Generate index:word
        self.index_word = dict((i, word) for i, word in enumerate(self.words_list))
        #############################################################################################################################
        # print(self.index_word)                                                                                                    #
        # {0: 'natural', 1: 'language', 2: 'processing', 3: 'and', 4: 'machine', 5: 'learning', 6: 'is', 7: 'fun', 8: 'exciting'}   #
        #############################################################################################################################

        training_data = []

        # Cycle through each sentence in corpus
        for sentence in corpus:
            sent_len = len(sentence)

            # Cycle through each word in sentence
            for i, word in enumerate(sentence):
                # Convert target word to one-hot
                w_target = self.word2onehot(sentence[i])

                # Cycle through context window
                w_context = []

                # Note: window_size 2 will have range of 5 values
                for j in range(i - self.window, i + self.window+1):
                    # Criteria for context word 
                    # 1. Target word cannot be context word (j != i)
                    # 2. Index must be greater or equal than 0 (j >= 0) - if not list index out of range
                    # 3. Index must be less or equal than length of sentence (j <= sent_len-1) - if not list index out of range 
                    if j != i and j <= sent_len-1 and j >= 0:
                        # Append the one-hot representation of word to w_context
                        w_context.append(self.word2onehot(sentence[j]))
                        # print(sentence[i], sentence[j]) 
                        #########################
                        # Example:              #
                        # natural language      #
                        # natural processing    #
                        # language natural      #
                        # language processing   #
                        # language append       #
                        #########################
                        
                # training_data contains a one-hot representation of the target word and context words
                #################################################################################################
                # Example:                                                                                      #
                # [Target] natural, [Context] language, [Context] processing                                    #
                # print(training_data)                                                                          #
                # [[[1, 0, 0, 0, 0, 0, 0, 0, 0], [[0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0]]]]   #
                #################################################################################################
                training_data.append([w_target, w_context])

        return np.array(training_data)

    def word2onehot(self, word):
        # word_vec - initialise a blank vector
        word_vec = [0 for i in range(0, self.v_count)] # Alternative - np.zeros(self.v_count)
        #############################
        # print(word_vec)           #
        # [0, 0, 0, 0, 0, 0, 0, 0]  #
        #############################

        # Get ID of word from word_index
        word_index = self.word_index[word]

        # Change value from 0 to 1 according to ID of the word
        word_vec[word_index] = 1

        return word_vec

    def train(self, training_data):
        # Initialising weight matrices
        # np.random.uniform(HIGH, LOW, OUTPUT_SHAPE)
        # https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.uniform.html
        self.w1 = np.array(getW1)
        self.w2 = np.array(getW2)
        # self.w1 = np.random.uniform(-1, 1, (self.v_count, self.n))
        # self.w2 = np.random.uniform(-1, 1, (self.n, self.v_count))
        
        # Cycle through each epoch
        for i in range(self.epochs):
            # Intialise loss to 0
            self.loss = 0
            # Cycle through each training sample
            # w_t = vector for target word, w_c = vectors for context words
            for w_t, w_c in training_data:
                # Forward pass
                # 1. predicted y using softmax (y_pred) 2. matrix of hidden layer (h) 3. output layer before softmax (u)
                y_pred, h, u = self.forward_pass(w_t)
                #########################################
                # print("Vector for target word:", w_t) #
                # print("W1-before backprop", self.w1)  #
                # print("W2-before backprop", self.w2)  #
                #########################################

                # Calculate error
                # 1. For a target word, calculate difference between y_pred and each of the context words
                # 2. Sum up the differences using np.sum to give us the error for this particular target word
                EI = np.sum([np.subtract(y_pred, word) for word in w_c], axis=0)
                #########################
                # print("Error", EI)    #
                #########################

                # Backpropagation
                # We use SGD to backpropagate errors - calculate loss on the output layer 
                self.backprop(EI, h, w_t)
                #########################################
                #print("W1-after backprop", self.w1)    #
                #print("W2-after backprop", self.w2)    #
                #########################################

                # Calculate loss
                # There are 2 parts to the loss function
                # Part 1: -ve sum of all the output +
                # Part 2: length of context words * log of sum for all elements (exponential-ed) in the output layer before softmax (u)
                # Note: word.index(1) returns the index in the context word vector with value 1
                # Note: u[word.index(1)] returns the value of the output layer before softmax
                self.loss += -np.sum([u[word.index(1)] for word in w_c]) + len(w_c) * np.log(np.sum(np.exp(u)))
                
                #############################################################
                # Break if you want to see weights after first target word  #
                # break                                                     #
                #############################################################
            print('Epoch:', i, "Loss:", self.loss)

    def forward_pass(self, x):
        # x is one-hot vector for target word, shape - 9x1
        # Run through first matrix (w1) to get hidden layer - 10x9 dot 9x1 gives us 10x1
        h = np.dot(x, self.w1)
        # Dot product hidden layer with second matrix (w2) - 9x10 dot 10x1 gives us 9x1
        u = np.dot(h, self.w2)
        # Run 1x9 through softmax to force each element to range of [0, 1] - 1x8
        y_c = self.softmax(u)
        return y_c, h, u

    def softmax(self, x):
        e_x = np.exp(x - np.max(x))
        return e_x / e_x.sum(axis=0)

    def backprop(self, e, h, x):
        # https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.outer.html
        # Column vector EI represents row-wise sum of prediction errors across each context word for the current center word
        # Going backwards, we need to take derivative of E with respect of w2
        # h - shape 10x1, e - shape 9x1, dl_dw2 - shape 10x9
        # x - shape 9x1, w2 - 10x9, e.T - 9x1
        dl_dw2 = np.outer(h, e)
        dl_dw1 = np.outer(x, np.dot(self.w2, e.T))
        ########################################
        # print('Delta for w2', dl_dw2)         #
        # print('Hidden layer', h)              #
        # print('np.dot', np.dot(self.w2, e.T)) #
        # print('Delta for w1', dl_dw1)         #
        #########################################

        # Update weights
        self.w1 = self.w1 - (self.lr * dl_dw1)
        self.w2 = self.w2 - (self.lr * dl_dw2)

    # Get vector from word
    def word_vec(self, word):
        w_index = self.word_index[word]
        v_w = self.w1[w_index]
        return v_w

    # Input vector, returns nearest word(s)
    def vec_sim(self, word, top_n):
        v_w1 = self.word_vec(word)
        word_sim = {}

        for i in range(self.v_count):
            # Find the similary score for each word in vocab
            v_w2 = self.w1[i]
            theta_sum = np.dot(v_w1, v_w2)
            theta_den = np.linalg.norm(v_w1) * np.linalg.norm(v_w2)
            theta = theta_sum / theta_den

            word = self.index_word[i]
            word_sim[word] = theta

        words_sorted = sorted(word_sim.items(), key=lambda kv: kv[1], reverse=True)

        for word, sim in words_sorted[:top_n]:
            print(word, sim)

#####################################################################
settings = {
    'window_size': 2,           # context window +- center word
    'n': 10,                    # dimensions of word embeddings, also refer to size of hidden layer
    'epochs': 50,               # number of training epochs
    'learning_rate': 0.01       # learning rate
}

text = "natural language processing and machine learning is fun and exciting"

# Note the .lower() as upper and lowercase does not matter in our implementation
# [['natural', 'language', 'processing', 'and', 'machine', 'learning', 'is', 'fun', 'and', 'exciting']]
corpus = [[word.lower() for word in text.split()]]

# Initialise object
w2v = word2vec()

# Numpy ndarray with one-hot representation for [target_word, context_words]
training_data = w2v.generate_training_data(settings, corpus)

# Training
w2v.train(training_data)

# Get vector for word
word = "machine"
vec = w2v.word_vec(word)
print(word, vec)

# Find similar words
w2v.vec_sim("machine", 3)

相关文章

网友评论

    本文标题:word2vec之numpy原生实现(译)

    本文链接:https://www.haomeiwen.com/subject/bptydctx.html