进入NLP世界的最佳桥梁笔记

作者: 臻甄 | 来源:发表于2022-05-20 10:29 被阅读0次

进入NLP世界的最佳桥梁笔记
【无戒学堂】研发自己的写作NLP技术2
【无戒学堂】研发自己的写作NLP技术3
【无戒学堂】研发自己的写作NLP技术5——剖析NLP的定义(三）
【无戒学堂】研发自己的写作NLP技术4——VAK 你的倾向？
【无戒学堂】研发自己的写作NLP技术1
《和另一个自己谈谈心》读书笔记22
2020-02-08
NLP的12条前提假设
桥梁书章节-英文

本文来自一位东京的数据科学家总结的自然语言处理入门指南，因为原文文章篇幅太长，在此做了浓缩版的笔记，并把繁体字改成简体字描述。

借用作者的一句话：

希望这篇文章能成为你前往自然语言处理世界的最佳桥梁。

NLP的定义：专注在如何让计算机处理并分析大量人类的自然语言数据，常见的任务有：
（1）语音识别
（2）语义理解
（3）机器翻译
（4）语言生成
最快了解一个NLP应用，Kaggle 的一个比赛：WSDM - Fake News Classification
（1）自动找出假新闻以节省人工检查成本
（2）数据集来自于中国的手机应用，有今日头条的母公司字节跳动提供。

假新闻分类任务

先看一眼数据集：是一个简单的分类问题，可以分为3个label
（1）Training Set 大概有32w条数据
（2）Test Set 大约有8w条数据。
第1列title1_zh是已知的假新闻A
第2列title2_zh是一条新的新闻B，真假未知。
第3列是A的英文翻译，是机器翻译
第4列是B的英文翻译，也是机器翻译
第5列是label：
（1）unrelated 说明B跟A没关系，可以考虑进一步处理B（不在本文的讨论范围内）
（2）agreed代表B同意A的叙述，说明B也是假新闻
（3）disagreed代表B不同意A的叙述，说明B是真新闻

FakeNewsData
Baseline：
（1）先找一个简单方法来作为参考基线，首先分析了Training Set 中的各类数据比例，可以看出是一个常见的Ubbalanced Dataset。
（2）我们假设Test Set里的比例分布也差不多，那最快分辨unrelated的办法是什么呢？就是全猜unrelated了！
（3）但实际上组委会评分的时候稍微调整了一下猜对unrelated数据的分数比例，但baseline仍然能够得到0.666的成绩，满分为1。
（4）有趣的事，排行榜上有非常多提交结果低于这个分数，大概是因为每天只能上传2次评分所以大家不想浪费评估自己比较渣渣的模型的机会。
（5）确定baseline的作用是：判断我们手上训练出来的模型到底有多少潜在价值，值不值得继续花费自己的研究时间和电脑计算力。

机器学习处理新闻分类

Step1 数据前处理

数据集处理：让机器可以处理文字 --> 转成数字。

用大蒜鉴别地沟油的方法,怎么鉴别地沟油
[217, 1268, 32, 1178, 25, 489, 116]

为了完成上面的效果，需要经历4个步骤
（1）文本分词
（2）建立字典并将文本转成数字索引序列
（3）序列的Zero Padding
（4）将数字索引值转成 one-hot encoding
具体做法如下
（1）下载kaggle的数据集，用pandas将训练资料集读取进来

import pandas as pd
TRAIN_CSV_PATH = './train.csv'
train = pd.read_csv(TRAIN_CSV_PATH, index_col=0)
train.head(3)

cols = ['title1_zh', 
        'title2_zh', 
        'label']
train = train.loc[:, cols] # 只根据这3列数据来获取数据切片
train.head(3)

（2）文本分词（Text Segmentation）是将一连串文字切成一个个有意义的单元，可以是：
一個中文汉字 / 英文字母（Character）
一個中文词汇 / 英文单字（Word） # 比较常见
一個中文句子 / 英文句子（Sentence）
英文的分词非常容易，一般是按照空格

text = 'I am rical, an engineer from china.'
words = text.split(' ')
words
#  ['I', 'am', 'rical,', 'an', 'engineer', 'from', 'china.']

中文无法借助空格，一般借助Jieba这个中文断词工具。还可以给出对应的词性。

import jieba.posseg as pseg

text = '我是李，在中国工作的工程师'
words = pseg.cut(text)
[word for word in words]  
# [pair('我', 'r'), pair('是', 'v'), pair('李', 'nr'), pair('，', 'x'), pair('在', 'p'), pair('中国', 'ns'), pair('工作', 'vn'), pair('的', 'uj'), pair('工程师', 'n')]

# 如果不要标点符号：
words = pseg.cut(text)
[word for word, flag in words if flag != 'x']
# ['我', '是', '李', '在', '中国', '工作', '的', '工程师']

我们可以利用Pandas的apply接口，将jieba_tokenizer套用到所有新闻标题A以及B之上来做文本分词

def jieba_tokenizer(text):
    words = pseg.cut(text)
    return ' '.join([
        word for word, flag in words if flag != 'x'])

train['title1_tokenized'] = train.loc[:, 'title1_zh'].apply(jieba_tokenizer)
train['title2_tokenized'] = train.loc[:, 'title2_zh'].apply(jieba_tokenizer)

# 查看一下新闻标题AB的断词结果
train.iloc[:, [0, 3]].head() # A
train.iloc[:, [1, 4]].head() # B

不管用什么切法，切完词后的每个文字片段在NLP领域习惯被称为Token

（3）简历字典并将文本转成数字序列，也就是数字索引，从0开始标号，举个例子

# 先切词
import jieba.posseg as pseg
text = '狐狸被陌生人拍照'
words = pseg.cut(text)
words = [w for w, f in words]
words
#['狐狸', '被', '陌生人', '拍照']

# 然后建立词典
word_index = {
    word: idx  
    for idx, word in enumerate(words)
}
word_index
# {'狐狸': 0, '被': 1, '陌生人': 2, '拍照': 3}

# 有了词典之后，这句话就可以转变为数字
print(words)
# ['狐狸', '被', '陌生人', '拍照']
print([word_index[w] for w in words])
# [0, 1, 2, 3]

# 再来一个新句子
text = '陌生人被狐狸拍照'
words = pseg.cut(text)
words = [w for w, f in words]
print(words)
print([word_index[w] for w in words])
# [2, 1, 0, 3]

用类似的方法，我们也可以为新闻A和新闻B建立一个大的词典，但手动处理还是比较繁琐的，Keras有专门的API可以处理

import keras
MAX_NUM_WORDS = 10000 # 限制词典只能包含1w个词汇，其余的词汇被视为unknown，避免词典太大
tokenizer = keras.preprocessing.text.Tokenizer(num_words=MAX_NUM_WORDS)

# 把新闻A和新闻B都汇聚起来
corpus_x1 = train.title1_tokenized
corpus_x2 = train.title2_tokenized
corpus = pd.concat([corpus_x1, corpus_x2])
corpus.shape
# (641086,) #  因为训练集有32w条，所以新闻A+新闻B一共有2倍大小

#取一部分语料来看看
pd.DataFrame(corpus.iloc[:5], columns=['title'])

# 有了语料库corpus之后，就可以用tokenizer来帮我们建立一个字典，大概需要10秒钟
tokenizer.fit_on_texts(corpus)
# 建立词典之后，就可以把上述的训练集转成数字
x1_train = tokenizer.texts_to_sequences(corpus_x1)
x2_train = tokenizer.texts_to_sequences(corpus_x2)

# 看看结果
len(x1_train) # 320543
x1_train[:1]  # [[217, 1268, 32, 1178, 5967, 25, 489, 2877, 116, 5559, 4, 1850, 2, 13]]

# 如果想要看看数字本来对应的词汇
for seq in x1_train[:1]:
    print([tokenizer.index_word[idx] for idx in seq])
# ['2017', '养老保险', '又', '新增', '两项', '农村', '老人', '人人', '可', '申领', '你', '领到', '了', '吗']

（4）一个显而易见的问题，每一行的词汇个数不一致，意味着每个句子对应的数字list长度不一样，最长的甚至可以达到61个词汇，为了方便模型处理，一般会设定一个MAX_SEQUENCE_LENGTH 让所有list长度相同，不足的就补0。Keras有一个方便的函数 pad_sequences 可以使用。

MAX_SEQUENCE_LENGTH = 20
# 一般可以设置max=61，但为了让模型看前20个词就作出判断以节省训练时间，这个值可以调整
x1_train = keras.preprocessing.sequence.pad_sequences(x1_train, maxlen=MAX_SEQUENCE_LENGTH)
x2_train = keras.preprocessing.sequence.pad_sequences(x2_train, maxlen=MAX_SEQUENCE_LENGTH)

x1_train[0] # 前面6个字段被补0了
# array([   0,    0,    0,    0,    0,    0,  217, 1268,   32, 1178, 5967, 25,  489, 2877,  116, 5559,    4, 1850,    2,   13], dtype=int32)

（5）是时候处理label啦。转成 one-hot encoding

# 看一眼label长什样子
train.label[:5]
"""
id
0    unrelated
3    unrelated
1    unrelated
2    unrelated
9       agreed
Name: label, dtype: object
"""

# label 的处理相对简单，需要把分类转成数字索引
import numpy as np 

# 定义每个分类对应一个数字索引
label_to_index = {
    'unrelated': 0, 
    'agreed': 1, 
    'disagreed': 2
}

# 将分类标签对应到刚刚定义的数字索引
y_train = train.label.apply(lambda x: label_to_index[x])
y_train = np.asarray(y_train).astype('float32')
y_train[:5]
# array([0., 0., 0., 0., 1.], dtype=float32)

# 利用Keras做one-hot encoding
y_train = keras.utils.to_categorical(y_train)
y_train[:5] # 里面的每一行就是一个label
"""
array([[1., 0., 0.], 代表unrelated
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.]], dtype=float32)
"""

[1, 0, 0] 代表 label 为 unrelated
[0, 1, 0] 代表 label 为 agreed
[0, 0, 1] 代表 label 为 disagreed
这样分类的好处是可以把分类结果看成概率分布，1代表概率为100%，0代表概率为0%

Step 2 切割训练集和验证集

和正常的监督学习一样，总的数据集有
（1）训练集 Training Set
（2）验证集 Validation Set 反复用来看看训练效果，可能会有overfitting的风险
（3）测试集 Test Set 最后训好的模型用于决一生死，需要一个完全与训练过程隔离的数据集，防止overfitting
用scikit-learn中的train_test_split是一个不错的方式

from sklearn.model_selection import train_test_split

VALIDATION_RATIO = 0.1  #按照10%的验证集来切分
RANDOM_STATE = 9527  # 小彩蛋

x1_train, x1_val, \ # 新闻标题A
x2_train, x2_val, \ # 新闻标题B
y_train, y_val = \  # 分类标签
    train_test_split(
        x1_train, x2_train, y_train, 
        test_size=VALIDATION_RATIO, 
        random_state=RANDOM_STATE
)

# 看下数据
print("Training Set")
print("-" * 10)
print(f"x1_train: {x1_train.shape}") #  (288488, 20)
print(f"x2_train: {x2_train.shape}") #  (288488, 20)
print(f"y_train : {y_train.shape}") # (288488, 3)

print("-" * 10)
print(f"x1_val:   {x1_val.shape}") #  (32055, 20)
print(f"x2_val:   {x2_val.shape}") #  (32055, 20)
print(f"y_val :   {y_val.shape}") # (32055, 3)
print("-" * 10)
print("Test Set")

Step3 搭建神经网络

RNN是循环神经网络 Recurrent Neural Network，是一个常用于序列数据的网络模型，具体原理不在这里赘述

RNN

# 举一个简单到不行的代码例子,Keras 只需要两行
from keras import layers
rnn = layers.SimpleRNN()

LSTM：长短期记忆网络，Long Short-Term Memory 可以解决RNN容易遗忘的问题

LSTM

from keras import layers
lstm = layers.LSTM()

每个时间点塞入RNN的词汇不是1个数字，而是一个N维的向量，图中N=3

我们该如何得到N，实际上就是embedding的维度，可以用来表示不同词汇之间的距离和关系。

# 在Keras里，可以直接完成
from keras import layers
embedding_layer = layers.Embedding(MAX_NUM_WORDS, NUM_EMBEDDING_DIM)
# MAX_NUM_WORDS 是字典大小 1w 个词汇
# NUM_EMBEDDING_DIM 是词向量的维度，就是上图中的N，常见的有128 ，256， 1024

假设N=3

使用学生神经网络Siamese Network，来输出两个序列，输出一个分类
学生神经网络来自Siamese Twins，是发生在美国19实际的一对连体泰国人兄弟的故事，可以想想网络里也有2个一模一样的神经网络双胞胎，共享同一份参数。

Siamese Network
深度学习3步骤

# 基本参数设置，有几个分类
NUM_CLASSES = 3

# 在语料库里有多少词汇
MAX_NUM_WORDS = 10000

# 一个标题最长有几个词汇
MAX_SEQUENCE_LENGTH = 20

# 一个词向量的维度
NUM_EMBEDDING_DIM = 256

# LSTM输出的向量维度
NUM_LSTM_UNITS = 128

# 建立学生LSTM架构 Siamese Network
from keras import Input
from keras.layers import Embedding, LSTM, concatenate, Dense
from keras.models import Model

# 分别定义2个新闻标题A&B作为模型输入 
# 2个标题都是一个长度为 20 的数字序列
top_input = Input(shape=(MAX_SEQUENCE_LENGTH, ), dtype='int32')
bm_input  = Input(shape=(MAX_SEQUENCE_LENGTH, ), dtype='int32')

# 词嵌入层
# 经过词嵌入层的转换，两个新闻标题都变成一个词向量的序列，而每个词向量的维度为256
embedding_layer = Embedding(MAX_NUM_WORDS, NUM_EMBEDDING_DIM)
top_embedded = embedding_layer(top_input)
bm_embedded = embedding_layer(bm_input)

# LSTM 层
# 两个新闻标题经过此层后，为一个128维度向量
shared_lstm = LSTM(NUM_LSTM_UNITS)
top_output = shared_lstm(top_embedded)
bm_output = shared_lstm(bm_embedded)

# 串接层
# 将两个新闻标题的结果串接成单一向量，方便跟全连接层相连
merged = concatenate([top_output, bm_output], axis=-1)

# 全连接层搭配softmax激活函数，可以回传3个概率，代表属于各个类别的概率
dense =  Dense(units=NUM_CLASSES, activation='softmax')
predictions = dense(merged)

# 我们的模型就是将数字序列输入，转换成3个分类
model = Model(inputs=[top_input, bm_input], outputs=predictions)

# 可以把架构图画出来看看，还可以看到每一层的输入输出张量Tensor的维度，None代表可以指定任何值，一般代表batch_size的值
from keras.utils import plot_model
plot_model(
    model, 
    to_file='model.png', 
    show_shapes=True, 
    show_layer_names=False, 
    rankdir='LR')

# 还可以查看文字总结
model.summary()

keras.utils.plot_model

model.summary()

模型搭建好了，loss function 怎么写呢，用交叉熵

cross entropy

model.compile(
    optimizer='rmsprop',  #优化器选择
    loss='categorical_crossentropy',  # 交叉熵
    metrics=['accuracy'])  # 准确度代表模型预测正确的样本数 / 总样本数

开始训练

# 决定一次要放多少对新闻标题给模型训练
BATCH_SIZE = 512

# 决定模型要看几遍数据集
NUM_EPOCHS = 10

# 实际训练模型
history = model.fit(
    # 输入是两个长度为20的数字序列
    x=[x1_train, x2_train], 
    y=y_train,
    batch_size=BATCH_SIZE,
    epochs=NUM_EPOCHS,
    # 每个epoch结束后计算验证集上的loss以及准确度
    validation_data=(
        [x1_val, x2_val], 
        y_val
    ),
    # 每个epoch随机调整整个训练集里的数据让训练更稳定
    shuffle=True
)

Step4 进行预测并提交结果

测试集的数据跟训练集的唯一差别在于没有label

import pandas as pd
TEST_CSV_PATH = './test.csv'
test = pd.read_csv(TEST_CSV_PATH, index_col=0)
test.head(3)

然后我们就可以把测试集的数据也处理一下

# 分别对新闻标题AB进行
# 文本断词 / Word Segmentation
test['title1_tokenized'] = test.loc[:, 'title1_zh'].apply(jieba_tokenizer)
test['title2_tokenized'] = test.loc[:, 'title2_zh'].apply(jieba_tokenizer)

# 将词汇list 转为 数字索引list
x1_test = tokenizer.texts_to_sequences(test.title1_tokenized)
x2_test = tokenizer.texts_to_sequences(test.title2_tokenized)

# 将数字索引list加入padding
x1_test = keras.preprocessing.sequence.pad_sequences(x1_test, maxlen=MAX_SEQUENCE_LENGTH)
x2_test = keras.preprocessing.sequence.pad_sequences(x2_test, maxlen=MAX_SEQUENCE_LENGTH)    

# 利用已经训练好的模型做预测
predictions = model.predict([x1_test, x2_test])

输出结果看看

predictions[:5]

把概率最大的当做答案，生成答案后上传到kaggle

index_to_label = {v: k for k, v in label_to_index.items()}

test['Category'] = [index_to_label[idx] for idx in np.argmax(predictions, axis=1)]

submission = test.loc[:, ['Category']].reset_index()

submission.columns = ['Id', 'Category']
submission.head()

整理代码

我把代码整理成1份，使用前记得把操作系统的编码改一改

export LANG=zh_CN.UTF-8

train.py 大概两百多行，浓缩一下把注释去掉的话可能也就一百多行吧

# pip install pandas==1.3.5
# pip install jieba==0.42.1
# pip install sklearn==0.0
######## tensorflow and keras must use fixed version
# pip install tensorflow==1.15.5
# pip install keras==2.3.1

########################################################################
############################## 训练数据处理 ##############################
########################################################################


TRAIN_CSV_PATH = './data/train.csv'  # 改成你的训练集存放地址
TEST_CSV_PATH = './data/test.csv'  # 改成你的训练集存放地址

################### 获取数据 ###################
import pandas as pd
train = pd.read_csv(TRAIN_CSV_PATH, index_col=0, encoding='utf-8', dtype=str)
train=train.astype(str) # bug jieba bug of AttributeError: 'float' object has no attribute 'decode'
# train = train.head(200) # TODO use for debug
print('origin train data \n', train.head(3))

cols = ['title1_zh', 
        'title2_zh', 
        'label']
train = train.loc[:, cols] # 只根据这3列数据来获取数据切片
print('get 3 clos of train data \n', train.head(3))


################### 文本切词 ###################
import jieba.posseg as pseg
def jieba_tokenizer(text):
    words = pseg.cut(text)
    return ' '.join([
        word for word, flag in words if flag != 'x'])

train['title1_tokenized'] = train.loc[:, 'title1_zh'].apply(jieba_tokenizer)
train['title2_tokenized'] = train.loc[:, 'title2_zh'].apply(jieba_tokenizer)

# 查看一下新闻标题A B 的断词结果
print('word segment of news A \n', train.iloc[:3, [0, 3]].head())
print('word segment of news B \n', train.iloc[:3, [1, 4]].head())


################### 建立词典，切词转数字 ###################
import keras
MAX_NUM_WORDS = 10000 # 限制词典只能包含1w个词汇，其余的词汇被视为unknown，避免词典太大
tokenizer = keras.preprocessing.text.Tokenizer(num_words=MAX_NUM_WORDS)

# 把新闻A和新闻B都汇聚起来
corpus_x1 = train.title1_tokenized
corpus_x2 = train.title2_tokenized
corpus = pd.concat([corpus_x1, corpus_x2])
print('shape of corpus: ', corpus.shape) # (641086,) #  因为训练集有32w条，所以新闻A+新闻B一共有2倍大小
print('part of corpus: \n', pd.DataFrame(corpus.iloc[:5], columns=['title'])) #取一部分语料来看看

# 有了语料库corpus之后，就可以用tokenizer来帮我们建立一个字典，大概需要10秒钟
tokenizer.fit_on_texts(corpus)
# 建立词典之后，就可以把上述的训练集转成数字
x1_train = tokenizer.texts_to_sequences(corpus_x1)
x2_train = tokenizer.texts_to_sequences(corpus_x2)
print('length of news A', len(x1_train)) # 320543
print('first line of news A \n', x1_train[:1])  # [[217, 1268, 32, 1178, 5967, 25, 489, 2877, 116, 5559, 4, 1850, 2, 13]]
for seq in x1_train[:1]: # 如果想要看看数字本来对应的词汇
    print([tokenizer.index_word[idx] for idx in seq])
# ['2017', '养老保险', '又', '新增', '两项', '农村', '老人', '人人', '可', '申领', '你', '领到', '了', '吗']


################### padding ###################
MAX_SEQUENCE_LENGTH = 20
# 一般可以设置max=61，但为了让模型看前20个词就作出判断以节省训练时间，这个值可以调整
x1_train = keras.preprocessing.sequence.pad_sequences(x1_train, maxlen=MAX_SEQUENCE_LENGTH)
x2_train = keras.preprocessing.sequence.pad_sequences(x2_train, maxlen=MAX_SEQUENCE_LENGTH)

print('after padding', x1_train[0]) # 前面6个字段被补0了
# array([   0,    0,    0,    0,    0,    0,  217, 1268,   32, 1178, 5967, 25,  489, 2877,  116, 5559,    4, 1850,    2,   13], dtype=int32)


################### label 转one-hot ###################
# 看一眼label长什样子
print('part of labels: \n', train.label[:5])

# label 的处理相对简单，需要把分类转成数字索引
import numpy as np 

# 定义每个分类对应一个数字索引
label_to_index = {
    'unrelated': 0, 
    'agreed': 1, 
    'disagreed': 2
}

# 将分类标签对应到刚刚定义的数字索引
y_train = train.label.apply(lambda x: label_to_index[x])
y_train = np.asarray(y_train).astype('float32')
print('part of labels index: \n', y_train[:5])
# array([0., 0., 0., 0., 1.], dtype=float32)

# 利用Keras做one-hot encoding
y_train = keras.utils.np_utils.to_categorical(y_train)
print('part of labels one-hot', y_train[:5]) # 里面的每一行就是一个label


################### 切割训练集和验证集 ###################
from sklearn.model_selection import train_test_split

VALIDATION_RATIO = 0.1  #按照10%的验证集来切分
RANDOM_STATE = 9527  # 小彩蛋

x1_train, x1_val, \
x2_train, x2_val, \
y_train, y_val = \
    train_test_split(
        x1_train, x2_train, y_train, 
        test_size=VALIDATION_RATIO, 
        random_state=RANDOM_STATE
)

# 看下数据
print("="*10)
print("Training Set")
print("-" * 10)
print(f"x1_train: {x1_train.shape}") #  (288488, 20)
print(f"x2_train: {x2_train.shape}") #  (288488, 20)
print(f"y_train : {y_train.shape}") # (288488, 3)

print("-" * 10)
print(f"x1_val:   {x1_val.shape}") #  (32055, 20)
print(f"x2_val:   {x2_val.shape}") #  (32055, 20)
print(f"y_val :   {y_val.shape}") # (32055, 3)
print("-" * 10)
print("Test Set")




########################################################################
##############################  模型构建   ##############################
########################################################################

# 建立学生LSTM架构 Siamese Network
from keras import Input
from keras.layers import Embedding, LSTM, concatenate, Dense
from keras.models import Model

# 基本参数设置，有几个分类
NUM_CLASSES = 3

# 在语料库里有多少词汇
MAX_NUM_WORDS = 10000

# 一个标题最长有几个词汇
MAX_SEQUENCE_LENGTH = 20

# 一个词向量的维度
NUM_EMBEDDING_DIM = 256

# LSTM输出的向量维度
NUM_LSTM_UNITS = 128



# 分别定义2个新闻标题A&B作为模型输入 
# 2个标题都是一个长度为 20 的数字序列
top_input = Input(shape=(MAX_SEQUENCE_LENGTH, ), dtype='int32')
bm_input  = Input(shape=(MAX_SEQUENCE_LENGTH, ), dtype='int32')

# 词嵌入层
# 经过词嵌入层的转换，两个新闻标题都编程一个词向量的序列，而每个词向量的维度为256
embedding_layer = Embedding(MAX_NUM_WORDS, NUM_EMBEDDING_DIM)
top_embedded = embedding_layer(top_input)
bm_embedded = embedding_layer(bm_input)

# LSTM 层
# 两个新闻标题经过此层后，为一个128维度向量
shared_lstm = LSTM(NUM_LSTM_UNITS)
top_output = shared_lstm(top_embedded)
bm_output = shared_lstm(bm_embedded)

# 串接层
# 将两个新闻标题的结果串接成单一向量，方便跟全连接层相连
merged = concatenate([top_output, bm_output], axis=-1)

# 全连接层搭配softmax激活函数，可以回传3个概率，代表属于各个类别的概率
dense =  Dense(units=NUM_CLASSES, activation='softmax')
predictions = dense(merged)

# 我们的模型就是将数字序列输入，转换成3个分类
model = Model(inputs=[top_input, bm_input], outputs=predictions)
model.summary()

model.compile(
    optimizer='rmsprop',  #优化器选择
    loss='categorical_crossentropy',  # 交叉熵
    metrics=['accuracy'])  # 准确度代表模型预测正确的样本数 / 总样本数



# 决定一次要放多少对新闻标题给模型训练
BATCH_SIZE = 512
BATCH_SIZE = 5

# 决定模型要看几遍数据集
NUM_EPOCHS = 10

# 实际训练模型
history = model.fit(
    # 输入是两个长度为20的数字序列
    x=[x1_train, x2_train], 
    y=y_train,
    batch_size=BATCH_SIZE,
    epochs=NUM_EPOCHS,
    # 每个epoch结束后计算验证集上的loss以及准确度
    validation_data=(
        [x1_val, x2_val], 
        y_val
    ),
    # 每个epoch随机调整整个训练集里的数据让训练更稳定
    shuffle=True
)




########################################################################
############################## 评估结果产出 ##############################
########################################################################
import pandas as pd
test = pd.read_csv(TEST_CSV_PATH, index_col=0, encoding='utf-8', dtype=str)
test = test.astype(str)
# test = test.head(20) # TODO use for debug
test.head(3)

# 分别对新闻标题AB进行
# 文本断词 / Word Segmentation
test['title1_tokenized'] = test.loc[:, 'title1_zh'].apply(jieba_tokenizer)
test['title2_tokenized'] = test.loc[:, 'title2_zh'].apply(jieba_tokenizer)

# 将词汇list 转为 数字索引list
x1_test = tokenizer.texts_to_sequences(test.title1_tokenized)
x2_test = tokenizer.texts_to_sequences(test.title2_tokenized)

# 将数字索引list加入padding
x1_test = keras.preprocessing.sequence.pad_sequences(x1_test, maxlen=MAX_SEQUENCE_LENGTH)
x2_test = keras.preprocessing.sequence.pad_sequences(x2_test, maxlen=MAX_SEQUENCE_LENGTH)    


# 利用已经训练好的模型做预测
predictions = model.predict([x1_test, x2_test])
print('predictions: \n', predictions)

index_to_label = {v: k for k, v in label_to_index.items()}

test['Category'] = [index_to_label[idx] for idx in np.argmax(predictions, axis=1)]
submission = test.loc[:, ['Category']].reset_index()
submission.columns = ['Id', 'Category']
print('submission data: \n', submission.head())

# 保存到本地
submission.to_csv('./submission.csv', columns=['Id', 'Category'], index=0)

执行python train.py可以得到如下日志

"""
origin train data
    tid1 tid2                          title1_zh                  title2_zh                                          title1_en                                          title2_en      label
id
0     0    1      2017养老保险又新增两项，农村老人人人可申领，你领到了吗   警方辟谣“鸟巢大会每人领5万” 仍有老人坚持进京  There are two new old-age insurance benefits f...  Police disprove "bird's nest congress each per...  unrelated
3     2    3  "你不来深圳，早晚你儿子也要来"，不出10年深圳人均GDP将超香港  深圳GDP首超香港？深圳统计局辟谣：只是差距在缩小  "If you do not come to Shenzhen, sooner or lat...  Shenzhen's GDP outstrips Hong Kong? Shenzhen S...  unrelated
1     2    4  "你不来深圳，早晚你儿子也要来"，不出10年深圳人均GDP将超香港       GDP首超香港？深圳澄清：还差一点点……  "If you do not come to Shenzhen, sooner or lat...  The GDP overtopped Hong Kong? Shenzhen clarifi...  unrelated
get 3 clos of train data
                             title1_zh                  title2_zh      label
id
0       2017养老保险又新增两项，农村老人人人可申领，你领到了吗   警方辟谣“鸟巢大会每人领5万” 仍有老人坚持进京  unrelated
3   "你不来深圳，早晚你儿子也要来"，不出10年深圳人均GDP将超香港  深圳GDP首超香港？深圳统计局辟谣：只是差距在缩小  unrelated
1   "你不来深圳，早晚你儿子也要来"，不出10年深圳人均GDP将超香港       GDP首超香港？深圳澄清：还差一点点……  unrelated
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.966 seconds.
Prefix dict has been built successfully.
word segment of news A
                             title1_zh                                 title1_tokenized
id
0       2017养老保险又新增两项，农村老人人人可申领，你领到了吗         2017 养老保险 又 新增 两项 农村 老人 人人 可 申领 你 领到 了 吗
3   "你不来深圳，早晚你儿子也要来"，不出10年深圳人均GDP将超香港  你 不 来 深圳 早晚 你 儿子 也 要 来 不出 10 年 深圳 人均 GDP 将 超 香港
1   "你不来深圳，早晚你儿子也要来"，不出10年深圳人均GDP将超香港  你 不 来 深圳 早晚 你 儿子 也 要 来 不出 10 年 深圳 人均 GDP 将 超 香港
word segment of news B
                     title2_zh                    title2_tokenized
id
0    警方辟谣“鸟巢大会每人领5万” 仍有老人坚持进京   警方 辟谣 鸟巢 大会 每人 领 5 万 仍 有 老人 坚持 进京
3   深圳GDP首超香港？深圳统计局辟谣：只是差距在缩小  深圳 GDP 首 超 香港 深圳 统计局 辟谣 只是 差距 在 缩小
1        GDP首超香港？深圳澄清：还差一点点……            GDP 首 超 香港 深圳 澄清 还 差 一点点
Using TensorFlow backend.
shape of corpus:  (641104,)
part of corpus:
                                               title
id
0          2017 养老保险 又 新增 两项 农村 老人 人人 可 申领 你 领到 了 吗
3   你 不 来 深圳 早晚 你 儿子 也 要 来 不出 10 年 深圳 人均 GDP 将 超 香港
1   你 不 来 深圳 早晚 你 儿子 也 要 来 不出 10 年 深圳 人均 GDP 将 超 香港
2   你 不 来 深圳 早晚 你 儿子 也 要 来 不出 10 年 深圳 人均 GDP 将 超 香港
9                        用 大蒜 鉴别 地沟油 的 方法 怎么 鉴别 地沟油
length of news A 320552
first line of news A
 [[217, 1268, 32, 1178, 5967, 25, 489, 2877, 116, 5559, 4, 1850, 2, 13]]
['2017', '养老保险', '又', '新增', '两项', '农村', '老人', '人人', '可', '申领', '你', '领到', '了', '吗']
after padding [   0    0    0    0    0    0  217 1268   32 1178 5967   25  489 2877
  116 5559    4 1850    2   13]
part of labels:
 id
0    unrelated
3    unrelated
1    unrelated
2    unrelated
9       agreed
Name: label, dtype: object
part of labels index:
 [0. 0. 0. 0. 1.]
part of labels one-hot [[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]]
==========
Training Set
----------
x1_train: (288496, 20)
x2_train: (288496, 20)
y_train : (288496, 3)
----------
x1_val:   (32056, 20)
x2_val:   (32056, 20)
y_val :   (32056, 3)
----------
Test Set
WARNING:tensorflow:From /ssd2/likejiao/tools/miniconda3/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_1 (InputLayer)            (None, 20)           0
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 20)           0
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 20, 256)      2560000     input_1[0][0]
                                                                 input_2[0][0]
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 128)          197120      embedding_1[0][0]
                                                                 embedding_1[1][0]
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 256)          0           lstm_1[0][0]
                                                                 lstm_1[1][0]
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 3)            771         concatenate_1[0][0]
==================================================================================================
Total params: 2,757,891
Trainable params: 2,757,891
Non-trainable params: 0
__________________________________________________________________________________________________
2022-05-19 23:24:11.312828: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2022-05-19 23:24:11.386936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:45:00.0
2022-05-19 23:24:11.390223: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2022-05-19 23:24:11.422222: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2022-05-19 23:24:11.436477: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2022-05-19 23:24:11.445699: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2022-05-19 23:24:11.483745: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2022-05-19 23:24:11.516001: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2022-05-19 23:24:11.521368: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-05-19 23:24:11.525591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2022-05-19 23:24:11.526504: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2022-05-19 23:24:11.538162: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1999565000 Hz
2022-05-19 23:24:11.541688: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f483a63fac0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-05-19 23:24:11.541763: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2022-05-19 23:24:11.705051: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f4839f819d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2022-05-19 23:24:11.705101: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla P40, Compute Capability 6.1
2022-05-19 23:24:11.707231: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:45:00.0
2022-05-19 23:24:11.707312: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2022-05-19 23:24:11.707332: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2022-05-19 23:24:11.707349: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2022-05-19 23:24:11.707366: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2022-05-19 23:24:11.707382: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2022-05-19 23:24:11.707398: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2022-05-19 23:24:11.707415: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-05-19 23:24:11.710838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2022-05-19 23:24:11.710890: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2022-05-19 23:24:11.714143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-05-19 23:24:11.714182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0
2022-05-19 23:24:11.714203: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N
2022-05-19 23:24:11.718236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21625 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:45:00.0, compute capability: 6.1)
WARNING:tensorflow:From /ssd2/likejiao/tools/miniconda3/envs/tf/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

Train on 288496 samples, validate on 32056 samples
Epoch 1/10
2022-05-19 23:24:14.226767: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
288496/288496 [==============================] - 1947s 7ms/step - loss: 0.5265 - accuracy: 0.7560 - val_loss: 0.5354 - val_accuracy: 0.7676
Epoch 2/10
288496/288496 [==============================] - 2108s 7ms/step - loss: 0.5510 - accuracy: 0.7758 - val_loss: 0.5808 - val_accuracy: 0.7813
Epoch 3/10
288496/288496 [==============================] - 2030s 7ms/step - loss: 0.5796 - accuracy: 0.7791 - val_loss: 0.5969 - val_accuracy: 0.7777
Epoch 4/10
288496/288496 [==============================] - 1991s 7ms/step - loss: 0.6181 - accuracy: 0.7839 - val_loss: 0.6919 - val_accuracy: 0.7831
Epoch 5/10
288496/288496 [==============================] - 1969s 7ms/step - loss: 0.6393 - accuracy: 0.7902 - val_loss: 0.6963 - val_accuracy: 0.7805
Epoch 6/10
288496/288496 [==============================] - 2017s 7ms/step - loss: 0.6653 - accuracy: 0.7935 - val_loss: 0.7428 - val_accuracy: 0.7535
Epoch 7/10
288496/288496 [==============================] - 2060s 7ms/step - loss: 0.6662 - accuracy: 0.8020 - val_loss: 0.7212 - val_accuracy: 0.7927
Epoch 8/10
288496/288496 [==============================] - 1904s 7ms/step - loss: 0.6745 - accuracy: 0.8041 - val_loss: 0.7524 - val_accuracy: 0.7926
Epoch 9/10
288496/288496 [==============================] - 1959s 7ms/step - loss: 0.6725 - accuracy: 0.8083 - val_loss: 0.7527 - val_accuracy: 0.7896
Epoch 10/10
288496/288496 [==============================] - 1969s 7ms/step - loss: 0.6857 - accuracy: 0.8087 - val_loss: 0.7451 - val_accuracy: 0.7871
predictions:
 [[9.1157912e-04 3.0455074e-05 9.9905795e-01]
 [1.0000000e+00 8.8517709e-09 4.8298606e-15]
 [9.8941737e-01 1.0582623e-02 2.0958998e-22]
 ...
 [8.6871430e-02 9.1312855e-01 1.2175124e-09]
 [4.9998474e-01 5.0001526e-01 1.7392097e-08]
 [4.5924637e-01 5.4075301e-01 5.6084860e-07]]
submission data:
        Id   Category
0  321187  disagreed
1  321190  unrelated
2  321189  unrelated
3  321193  unrelated
4  321191  unrelated
"""

上传了数据，试了下模型训练效果还是比较差的。。。待调试

最后作者推荐了3门课程

台大电机系李宏毅教授的深度学习课程
- 奠定理论基础
Coursera 的 Deep Learning 专项课程
- 理论 70 % + 实践 30 %
Deep Learning with Python
- 注重编程实践
- 5,000 多颗星的 Github Repo deep-learning-with-python-notebooks 看到跟该课程相关的所有 Jupyter Notebooks。

进入NLP世界的最佳桥梁笔记
本文来自一位东京的数据科学家总结的自然语言处理入门指南[https://leemeng.tw/shortest-p...
【无戒学堂】研发自己的写作NLP技术2
希望这篇文章能成为你我开始了解及应用NLP技术，促进你我达成写作目标的最佳桥梁。研发自己的写作NLP技术2 ——...
【无戒学堂】研发自己的写作NLP技术3
希望这篇文章能成为你我开始了解及应用NLP技术，促进你我达成写作目标的最佳桥梁。研发自己的写作NLP技术3 ——...
【无戒学堂】研发自己的写作NLP技术5——剖析NLP的定义(三）
希望这篇文章能成为你我开始了解及应用NLP技术，促进你我达成写作目标的最佳桥梁。本章导读一、NLP发展始源二...
【无戒学堂】研发自己的写作NLP技术4——VAK 你的倾向？
希望这篇文章能成为你我开始了解及应用NLP技术，促进你我达成写作目标的最佳桥梁。假如在你前面，摆着一张椅子，想一...
【无戒学堂】研发自己的写作NLP技术1
希望这篇文章能成为你我开始了解及应用NLP技术，促进你我达成写作目标的最佳桥梁。最近尝试写作日更，正巧在梳理NL...
《和另一个自己谈谈心》读书笔记22
今天进入“孤独篇”的第22章。标题是:沟通，让我们从想象世界进入现实世界。有句话说得好“沟通是情感的桥梁”，有...
2020-02-08
掌控自己的人生，迈向卓越很荣幸能读到一本好书，适合小白的NLP书籍《迈向卓越的桥梁》，书中分享很多NLP的咨讯和...
NLP的12条前提假设
Nlp的12条前提假设，是Nlp概念及技巧的基础，可以简单理解为Nlp世界中的世界观。Nlp不称这些前提假设为原则...
桥梁书章节-英文
原则：多读桥梁，及早进入桥梁和初章听力原则：桥梁一般在70页左右原则：桥梁200L-400L，词汇量...