- BERT模型压缩大致分为以下几类:(参考:http://mitchgordon.me/machine/learning/2019/11/18/all-the-ways-to-compress-BERT.html)
(1)剪枝(Pruning);
(2)权重因式分解(Weight Factorization ),该方法基本思想是将原始的大矩阵分解为两个或多个低秩矩阵的乘积。就模型压缩技术而言主要用于全连接层和卷积层。
(3)知识蒸馏(Knowledge Distillation ),基本思想是将知识从大型的,经过预训练的教师模型转移到通常较小的学生模型中,常见的学生模型根据教师模型的输出以及分类标签进行训练。比如DistillBERT、TinyBERT、MobileBERT等;
(4)权重共享(Weight Sharing);比如ALBERT等;
(5)量化(Quantization ),量化技术通过减少用于表示每个权重值的精度来压缩模型。例如模型使用float32标准定义参数的精度进行训练,然后我们可以使用量化技术选择float16,甚至int8表示参数的精度用于压缩模型。比如QBERT等;
参考论文
-【预训练模型综述】Pre-trained Models for Natural Language Processing: A Survey
https://arxiv.org/pdf/2003.08271.pdf(邱锡鹏老师 - 视频讲解:https://www.bilibili.com/video/BV16K4y1475Z/)
-【BERT模型压缩综述】Compressing Large-Scale Transformer-Based Models: A Case Study on BERT
https://arxiv.org/abs/2002.11985
-【ALBERT】ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
https://arxiv.org/abs/1909.11942
-【DistillBERT】DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
https://arxiv.org/abs/1910.01108
-【TinyBERT】TinyBERT: Distilling BERT for Natural Language Understanding
https://arxiv.org/abs/1909.10351
-【MobileBERT】MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
https://arxiv.org/abs/2004.02984
-【QBERT】Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT
https://arxiv.org/abs/1909.05840
参考资料
- 【李宏毅 - 2020深度学习与人类自然语言处理】http://speech.ee.ntu.edu.tw/~tlkagk/courses_DLHLP20.html
(1)B站视频集合:https://www.bilibili.com/video/BV1H34y1S7y7?p=20
(2)GPT-3模型:https://www.bilibili.com/video/BV1H34y1S7y7?p=33
(3)BERT and its family集合:https://www.bilibili.com/video/BV1H34y1S7y7?p=20&spm_id_from=333.788.b_6d756c74695f70616765.20 - 预训练模型代码集合:https://github.com/huggingface/transformers
- BERT模型压缩方法总结:http://mitchgordon.me/machine/learning/2019/11/18/all-the-ways-to-compress-BERT.html
- 关于BERT的模型压缩简介 https://zhuanlan.zhihu.com/p/110934513










网友评论