美文网首页
Tokenizer参数

Tokenizer参数

作者: 纵春水东流 | 来源:发表于2020-03-30 01:01 被阅读0次

一、作用

将文本或数据向量化

二、函数

1、用待向量化数据的数据、单词更新内部字典
Tokenizer.fit_on_sequences(sequences)
#sequences:数字列表,字符串为整数[1,2,3,4]

Tokenizer.fit_on_texst(texts)
#texts:字符串列表,['a','b','c']

2、将数据或文本向量化
#将整数列表转化成numpy数组
sequences_to_matrix(sequences,mode='binary')

#将整数转化为文本
sequences_to_texts(sequences)

#将文本转化为numpy数组
texts_to_matrix(texts,mode='binary')

#将文本转化为数字
texts_to_sequences

三、帮助文档

Help on class Tokenizer in module keras_preprocessing.text:

class Tokenizer(builtins.object)
 |  Text tokenization utility class.
 |  
 |  This class allows to vectorize a text corpus, by turning each
 |  text into either a sequence of integers (each integer being the index
 |  of a token in a dictionary) or into a vector where the coefficient
 |  for each token could be binary, based on word count, based on tf-idf...
 |  
 |  # Arguments
 |      num_words: the maximum number of words to keep, based
 |          on word frequency. Only the most common `num_words` words will
 |          be kept.
 |      filters: a string where each element is a character that will be
 |          filtered from the texts. The default is all punctuation, plus
 |          tabs and line breaks, minus the `'` character.
 |      lower: boolean. Whether to convert the texts to lowercase.
 |      split: str. Separator for word splitting.
 |      char_level: if True, every character will be treated as a token.
 |      oov_token: if given, it will be added to word_index and used to
 |          replace out-of-vocabulary words during text_to_sequence calls
 |  
 |  By default, all punctuation is removed, turning the texts into
 |  space-separated sequences of words
 |  (words maybe include the `'` character). These sequences are then
 |  split into lists of tokens. They will then be indexed or vectorized.
 |  
 |  `0` is a reserved index that won't be assigned to any word.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  fit_on_sequences(self, sequences)
 |      Updates internal vocabulary based on a list of sequences.
 |      
 |      Required before using `sequences_to_matrix`
 |      (if `fit_on_texts` was never called).
 |      
 |      # Arguments
 |          sequences: A list of sequence.
 |              A "sequence" is a list of integer word indices.
 |  
 |  fit_on_texts(self, texts)
 |      Updates internal vocabulary based on a list of texts.
 |      
 |      In the case where texts contains lists,
 |      we assume each entry of the lists to be a token.
 |      
 |      Required before using `texts_to_sequences` or `texts_to_matrix`.
 |      
 |      # Arguments
 |          texts: can be a list of strings,
 |              a generator of strings (for memory-efficiency),
 |              or a list of list of strings.
 |  
 |  sequences_to_matrix(self, sequences, mode='binary')
 |      Converts a list of sequences into a Numpy matrix.
 |      
 |      # Arguments
 |          sequences: list of sequences
 |              (a sequence is a list of integer word indices).
 |          mode: one of "binary", "count", "tfidf", "freq"
 |      
 |      # Returns
 |          A Numpy matrix.
 |      
 |      # Raises
 |          ValueError: In case of invalid `mode` argument,
 |              or if the Tokenizer requires to be fit to sample data.
 |  
 |  sequences_to_texts(self, sequences)
 |      Transforms each sequence into a list of text.
 |      
 |      Only top "num_words" most frequent words will be taken into account.
 |      Only words known by the tokenizer will be taken into account.
 |      
 |      # Arguments
 |          texts: A list of sequences (list of integers).
 |      
 |      # Returns
 |          A list of texts (strings)
 |  
 |  sequences_to_texts_generator(self, sequences)
 |      Transforms each sequence in `sequences` to a list of texts(strings).
 |      
 |      Each sequence has to a list of integers.
 |      In other words, sequences should be a list of sequences
 |      
 |      Only top "num_words" most frequent words will be taken into account.
 |      Only words known by the tokenizer will be taken into account.
 |      
 |      # Arguments
 |          texts: A list of sequences.
 |      
 |      # Yields
 |          Yields individual texts.
 |  
 |  texts_to_matrix(self, texts, mode='binary')
 |      Convert a list of texts to a Numpy matrix.
 |      
 |      # Arguments
 |          texts: list of strings.
 |          mode: one of "binary", "count", "tfidf", "freq".
 |      
 |      # Returns
 |          A Numpy matrix.
 |  
 |  texts_to_sequences(self, texts)
 |      Transforms each text in texts to a sequence of integers.
 |      
 |      Only top "num_words" most frequent words will be taken into account.
 |      Only words known by the tokenizer will be taken into account.
 |      
 |      # Arguments
 |          texts: A list of texts (strings).
 |      
 |      # Returns
 |          A list of sequences.
 |  
 |  texts_to_sequences_generator(self, texts)
 |      Transforms each text in `texts` to a sequence of integers.
 |      
 |      Each item in texts can also be a list,
 |      in which case we assume each item of that list to be a token.
 |      
 |      Only top "num_words" most frequent words will be taken into account.
 |      Only words known by the tokenizer will be taken into account.
 |      
 |      # Arguments
 |          texts: A list of texts (strings).
 |      
 |      # Yields
 |          Yields individual sequences.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

相关文章

网友评论

      本文标题:Tokenizer参数

      本文链接:https://www.haomeiwen.com/subject/ewzpuhtx.html