语言模型：使用NLTK训练

作者: ArthurN | 来源:发表于2019-06-30 15:36 被阅读0次

语言模型：使用NLTK训练
NLTK下载数据报错
python自然语言处理工具包
保存自己训练的分类模型
3.2.1自然语言处理包（NLTK）
NER----pythonNLP工具包
【深度学习】跟着TensorFlow官网学习笔记（一）第一个神经
NLTK（三）：使用模型做预测
feature-based 和 fine-tune
tensorflow模型保存

语言模型：使用NLTK训练并计算困惑度和文本熵

Author: Sixing Yan

这一部分主要记录我在阅读NLTK的两种语言模型源码时，一些遇到的问题和理解。

1. NLTK中训练语言模型MLE和Lidstone有什么不同

NLTK 中两种准备ngram语言模型的方式：最大似然估计MLE和平滑Lidstone。两者的实现方式都是基于统计信息，而没有进行又梯度更新的估计。

（1）使用语言模型计算困惑度或文本熵

使用NLTK构建了一个语言模型后，一般用于计算给定输入的困惑度 perplexity 或者文本熵 entropy 。这两个打分函数都调用了它们（MLE和Lidstone）父类（LanguageModel）的score函数。

外部调用score的时候，其实是通过各自的unmasked_score函数计算的，MLE和Lid有不同的unmasked_score函数

    def score(self, word, context=None):
        """Masks out of vocab (OOV) words and computes their model score.

        For model-specific logic of calculating scores, see the `unmasked_score`
        method.
        """
        return self.unmasked_score(
            self.vocab.lookup(word), self.vocab.lookup(context) if context else None
        )

（2）`lm.MLE` vs. `lm.Lidstone`的区别和联系

最大似然估计的核心方法

class MLE(LanguageModel):
    """Class for providing MLE ngram model scores.

    Inherits initialization from BaseNgramModel.
    """

    def unmasked_score(self, word, context=None):
        """Returns the MLE score for a word given a context.

        Args:
        - word is expcected to be a string
        - context is expected to be something reasonably convertible to a tuple
        """
        return self.context_counts(context).freq(word)

加k平滑的核心方法

    def unmasked_score(self, word, context=None):
        """Add-one smoothing: Lidstone or Laplace.

        To see what kind, look at `gamma` attribute on the class.

        """
        counts = self.context_counts(context)
        word_count = counts[word]
        norm_count = counts.N()
        return (word_count + self.gamma) / (norm_count + len(self.vocab) * self.gamma)

这两个模型都是用了LanguageModel.context_counts()函数。

（3）现在探究一下两者`unmasked_score`函数都调用的`context_counts`函数。

    def context_counts(self, context):
        """Helper method for retrieving counts for a given context.

        Assumes context has been checked and oov words in it masked.
        :type context: tuple(str) or None

        """
        return (
            self.counts[len(context) + 1][context] if context else self.counts.unigrams
        )

如果没指定上下文的时候，这个函数返回的是self.counts.unigrams对象

那么再看self.counts是什么：默认情况下，是一个NgramCounter()对象

self.counts = NgramCounter() if counter is None else counter

（4）所以我希望知道`self.counts.unigrams.freq(word)`计算的是什么

self._counts[1] = self.unigrams = FreqDist() 可知，FreqDist().freq(word)对应是word的频率

word_count = counts[word] 的作用也是调取word的频率，所以和FreqDist().freq(word)的作用一样。

（5）结论

lm.MLE 是没加平滑的 lm.Lidstone方法

2. 另一个问题：如何使用`context`这个功能呢

从context_counts()可知有context的时候会返回 self.counts[len(context) + 1][context] 。

然后看self.counts = NgramCounter() 的[len(context) + 1][context] 调用，它会使用self._counts 这个成员：

    def __getitem__(self, item):
        """User-friendly access to ngram counts."""
        if isinstance(item, int):
            return self._counts[item]
        elif isinstance(item, string_types):
            return self._counts.__getitem__(1)[item]
        elif isinstance(item, Sequence):
            return self._counts.__getitem__(len(item) + 1)[tuple(item)]

self._counts = defaultdict(ConditionalFreqDist)由此初始化，由第7行[len(word)][word]取到同长度下的该word的频数。

而从counts = self.context_counts(context);word_count = counts[word] 这部分看，它返回一个Key-value型的数据类型，类似于如果context=(word1, word2);len(content)=2，那么应该返回trigrams的类实例（这里用了len(item)+1做索引）。于是乎就是考虑了前面两个词，然后计量(word1, word2, word)这个组合。

（1）结论

使用context不用改变原有类的初始化，直接使用即可。

NgramCounter会根据输入的List[Tuple]的tuple长度len(tuple)来设置ngram_order的值（具体可看update()函数）。所以如果fit( )的时候用的是ngrams(2)的列表，那么context输入的长度就必须是1。

问题是，self.context_counts(context)返回int值，它是怎么可以使用word_count = counts[word]这种调用呢？因为在计算 entropy 的时候，它会放入context内容。

具体实现细节，参考
http://www.nltk.org/_modules/nltk/lm/models.html#Lidstone
http://www.nltk.org/_modules/nltk/lm/api.html#LanguageModel
http://www.nltk.org/_modules/nltk/lm/counter.html#NgramCounter
http://www.nltk.org/_modules/nltk/probability.html#FreqDist.freq

网友评论

工作生活

本文标题：语言模型：使用NLTK训练

本文链接：https://www.haomeiwen.com/subject/vpdbcctx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！