美文网首页
n-gram n元语法

n-gram n元语法

作者: 奔向超级开发者xAI | 来源:发表于2017-04-18 16:19 被阅读735次

NLP刚入门或还未入门,搜资料时经常碰到的概念就是n-gram,特别是bigram,更加常见。了解它,会省不少事~
维基百科的定义:

n元语法(英语:n-gram)指文本中连续出现的n个语词。n元语法模型是基于(n-1)阶马尔可夫链的一种概率语言模型,通过n个语词出现的概率来推断语句的结构。
当n分别为1、2、3时,又分别称为一元语法(unigram)、二元语法(bigram)与三元语法(trigram)

所以概念本身非常简单,就是把文本连续出现的n个词都找出来。
举例:
文本:我是一个好人
先做分词:我 是 一个 好人
unigram:


一个
好人

bigram:
我 是
是 一个
一个 好人

trigram:
我 是 一个
是 一个 好人

你可能会问,最后面词语的个数不够n个呢?这样的情况,就需要由你确定是在左边补齐还是在右边补齐了。
nltk的实现挺好的,可以参考它的代码,在此摘录一下

# 此方法用来做补齐
def pad_sequence(sequence, n, pad_left=False, pad_right=False,
              left_pad_symbol=None, right_pad_symbol=None):
      """
      Returns a padded sequence of items before ngram extraction.
          >>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
         ['<s>', 1, 2, 3, 4, 5, '</s>']
         >>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>'))
         ['<s>', 1, 2, 3, 4, 5]
         >>> list(pad_sequence([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>'))
         [1, 2, 3, 4, 5, '</s>']

     :param sequence: the source data to be padded
     :type sequence: sequence or iter
     :param n: the degree of the ngrams
     :type n: int
     :param pad_left: whether the ngrams should be left-padded
     :type pad_left: bool
     :param pad_right: whether the ngrams should be right-padded
     :type pad_right: bool
     :param left_pad_symbol: the symbol to use for left padding (default is None)
     :type left_pad_symbol: any
     :param right_pad_symbol: the symbol to use for right padding (default is None)
     :type right_pad_symbol: any
     :rtype: sequence or iter
     """
     sequence = iter(sequence)
     if pad_left:
         sequence = chain((left_pad_symbol,) * (n-1), sequence)
     if pad_right:
         sequence = chain(sequence, (right_pad_symbol,) * (n-1))
     return sequence



def ngrams(sequence, n, pad_left=False, pad_right=False,
       left_pad_symbol=None, right_pad_symbol=None):
    """
    Return the ngrams generated from a sequence of items, as an iterator.
    For example:
        >>> from nltk.util import ngrams
        >>> list(ngrams([1,2,3,4,5], 3))
        [(1, 2, 3), (2, 3, 4), (3, 4, 5)]
    Wrap with list for a list version of this function.  Set pad_left
    or pad_right to true in order to get additional ngrams:
        >>> list(ngrams([1,2,3,4,5], 2, pad_right=True))
        [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]
        >>> list(ngrams([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>'))
        [(1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
        >>> list(ngrams([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>'))
        [('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5)]
        >>> list(ngrams([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
        [('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
    :param sequence: the source data to be converted into ngrams
    :type sequence: sequence or iter
    :param n: the degree of the ngrams
    :type n: int
    :param pad_left: whether the ngrams should be left-padded
    :type pad_left: bool
    :param pad_right: whether the ngrams should be right-padded
    :type pad_right: bool
    :param left_pad_symbol: the symbol to use for left padding (default is None)
    :type left_pad_symbol: any
    :param right_pad_symbol: the symbol to use for right padding (default is None)
    :type right_pad_symbol: any
    :rtype: sequence or iter
    """
    sequence = pad_sequence(sequence, n, pad_left, pad_right,
                        left_pad_symbol, right_pad_symbol)

    history = []
    while n > 1:
        history.append(next(sequence))
        n -= 1
    for item in sequence:
        history.append(item)
        yield tuple(history)
        del history[0]


def bigrams(sequence, **kwargs):
    """
    Return the bigrams generated from a sequence of items, as an iterator.
    For example:
        >>> from nltk.util import bigrams
        >>> list(bigrams([1,2,3,4,5]))
        [(1, 2), (2, 3), (3, 4), (4, 5)]
    Use bigrams for a list version of this function.
    :param sequence: the source data to be converted into bigrams
    :type sequence: sequence or iter
    :rtype: iter(tuple)
    """

    for item in ngrams(sequence, 2, **kwargs):
        yield item

def trigrams(sequence, **kwargs):
    """
    Return the trigrams generated from a sequence of items, as an iterator.
    For example:
        >>> from nltk.util import trigrams
        >>> list(trigrams([1,2,3,4,5]))
        [(1, 2, 3), (2, 3, 4), (3, 4, 5)]
    Use trigrams for a list version of this function.
    :param sequence: the source data to be converted into trigrams
    :type sequence: sequence or iter
    :rtype: iter(tuple)
    """

    for item in ngrams(sequence, 3, **kwargs):
        yield item

相关文章

  • n-gram n元语法

    NLP刚入门或还未入门,搜资料时经常碰到的概念就是n-gram,特别是bigram,更加常见。了解它,会省不少事~...

  • 自然语言处理——5.1 语言模型(基本概念)

    n 元文法(n-gram)模型 当n=1 时,即出现在第i 位上的基元wi 独立于历史。一元文法也被写为uni-g...

  • Text Analysis

    NPL Natural Text Processing 基本上就是个词没什么意思 N-gram 所谓N-gram就...

  • N-gram

    N-gram模型是一种语言模型(Language Model,LM),语言模型是一个基于概率的判别模型,它的输入是...

  • n-gram

    n-gram的数据平滑方法 拉普拉斯平滑:缺点都在分子上加一,而大部分n-gram都是没有出现过的,这样就会给他们...

  • n-gram模型创建与分析

    n-gram模型:在自然语言里有一个模型叫做n-gram,表示文字或语言中的n个连续的单词组成序列。在进行自然语言...

  • n-gram模型

    n-gram模型 N-Gram是一种基于统计语言模型的算法。它的基本思想是将文本里面的内容按照字节进行大小为N的滑...

  • 自然语言处理中的N-Gram模型详解

    1.自然语言处理中的N-Gram模型详解

  • 语言模型笔记

    参考n-gram: https://blog.csdn.net/songbinxu/article/details...

  • 2017招行AI专场现场笔试题目

    选择题 填空题 N-gram题目,给定一个文本分词列表,三元组有( )个。 Logistic回归无法处理缺...

网友评论

      本文标题:n-gram n元语法

      本文链接:https://www.haomeiwen.com/subject/wugfzttx.html