一、作用
将文本或数据向量化
二、函数
1、用待向量化数据的数据、单词更新内部字典
Tokenizer.fit_on_sequences(sequences)
#sequences:数字列表,字符串为整数[1,2,3,4]
Tokenizer.fit_on_texst(texts)
#texts:字符串列表,['a','b','c']
2、将数据或文本向量化
#将整数列表转化成numpy数组
sequences_to_matrix(sequences,mode='binary')
#将整数转化为文本
sequences_to_texts(sequences)
#将文本转化为numpy数组
texts_to_matrix(texts,mode='binary')
#将文本转化为数字
texts_to_sequences
三、帮助文档
Help on class Tokenizer in module keras_preprocessing.text:
class Tokenizer(builtins.object)
| Text tokenization utility class.
|
| This class allows to vectorize a text corpus, by turning each
| text into either a sequence of integers (each integer being the index
| of a token in a dictionary) or into a vector where the coefficient
| for each token could be binary, based on word count, based on tf-idf...
|
| # Arguments
| num_words: the maximum number of words to keep, based
| on word frequency. Only the most common `num_words` words will
| be kept.
| filters: a string where each element is a character that will be
| filtered from the texts. The default is all punctuation, plus
| tabs and line breaks, minus the `'` character.
| lower: boolean. Whether to convert the texts to lowercase.
| split: str. Separator for word splitting.
| char_level: if True, every character will be treated as a token.
| oov_token: if given, it will be added to word_index and used to
| replace out-of-vocabulary words during text_to_sequence calls
|
| By default, all punctuation is removed, turning the texts into
| space-separated sequences of words
| (words maybe include the `'` character). These sequences are then
| split into lists of tokens. They will then be indexed or vectorized.
|
| `0` is a reserved index that won't be assigned to any word.
|
| Methods defined here:
|
| __init__(self, num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, **kwargs)
| Initialize self. See help(type(self)) for accurate signature.
|
| fit_on_sequences(self, sequences)
| Updates internal vocabulary based on a list of sequences.
|
| Required before using `sequences_to_matrix`
| (if `fit_on_texts` was never called).
|
| # Arguments
| sequences: A list of sequence.
| A "sequence" is a list of integer word indices.
|
| fit_on_texts(self, texts)
| Updates internal vocabulary based on a list of texts.
|
| In the case where texts contains lists,
| we assume each entry of the lists to be a token.
|
| Required before using `texts_to_sequences` or `texts_to_matrix`.
|
| # Arguments
| texts: can be a list of strings,
| a generator of strings (for memory-efficiency),
| or a list of list of strings.
|
| sequences_to_matrix(self, sequences, mode='binary')
| Converts a list of sequences into a Numpy matrix.
|
| # Arguments
| sequences: list of sequences
| (a sequence is a list of integer word indices).
| mode: one of "binary", "count", "tfidf", "freq"
|
| # Returns
| A Numpy matrix.
|
| # Raises
| ValueError: In case of invalid `mode` argument,
| or if the Tokenizer requires to be fit to sample data.
|
| sequences_to_texts(self, sequences)
| Transforms each sequence into a list of text.
|
| Only top "num_words" most frequent words will be taken into account.
| Only words known by the tokenizer will be taken into account.
|
| # Arguments
| texts: A list of sequences (list of integers).
|
| # Returns
| A list of texts (strings)
|
| sequences_to_texts_generator(self, sequences)
| Transforms each sequence in `sequences` to a list of texts(strings).
|
| Each sequence has to a list of integers.
| In other words, sequences should be a list of sequences
|
| Only top "num_words" most frequent words will be taken into account.
| Only words known by the tokenizer will be taken into account.
|
| # Arguments
| texts: A list of sequences.
|
| # Yields
| Yields individual texts.
|
| texts_to_matrix(self, texts, mode='binary')
| Convert a list of texts to a Numpy matrix.
|
| # Arguments
| texts: list of strings.
| mode: one of "binary", "count", "tfidf", "freq".
|
| # Returns
| A Numpy matrix.
|
| texts_to_sequences(self, texts)
| Transforms each text in texts to a sequence of integers.
|
| Only top "num_words" most frequent words will be taken into account.
| Only words known by the tokenizer will be taken into account.
|
| # Arguments
| texts: A list of texts (strings).
|
| # Returns
| A list of sequences.
|
| texts_to_sequences_generator(self, texts)
| Transforms each text in `texts` to a sequence of integers.
|
| Each item in texts can also be a list,
| in which case we assume each item of that list to be a token.
|
| Only top "num_words" most frequent words will be taken into account.
| Only words known by the tokenizer will be taken into account.
|
| # Arguments
| texts: A list of texts (strings).
|
| # Yields
| Yields individual sequences.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)









网友评论