在自然语言处理领域中,预训练语言模型(Pretrained Language Models)已成为非常重要的基础技术,本仓库主要收集目前网上公开的一些高质量中文预训练模型(感谢分享资源的大佬),并将持续更新…
Expand Table of Contents
NLU系列
BERT
- 2018 | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | Jacob Devlin, et al. | arXiv |
PDF
- 2019 | Pre-Training with Whole Word Masking for Chinese BERT | Yiming Cui, et al. | arXiv |
PDF
备注:
wwm全称为**Whole Word Masking **,一个完整的词的部分WordPiece子词被mask,则同属该词的其他部分也会被mask
ext表示在更多数据集下训练
ChineseBERT
- 2021 | ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information | Zijun Sun, et al. | arXiv |
PDF
RoBERTa
- 2019 | RoBERTa: A Robustly Optimized BERT Pretraining Approach | Yinhan Liu, et al. | arXiv |
PDF
ALBERT
- 2019 | ALBERT: A Lite BERT For Self-Supervised Learning Of Language Representations | Zhenzhong Lan, et al. | arXiv |
PDF
NEZHA
- 2019 | NEZHA: Neural Contextualized Representation for Chinese Language Understanding | Junqiu Wei, et al. | arXiv |
PDF
MacBERT
- 2020 | Revisiting Pre-Trained Models for Chinese Natural Language Processing | Yiming Cui, et al. | arXiv |
PDF
WoBERT
- 2020 | 提速不掉点:基于词颗粒度的中文WoBERT | 苏剑林. | spaces |
Blog post
XLNET
- 2019 | XLNet: Generalized Autoregressive Pretraining for Language Understanding | Zhilin Yang, et al. | arXiv |
PDF
ELECTRA
- 2020 | ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators | Kevin Clark, et al. | arXiv |
PDF
ZEN
- 2019 | ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations | Shizhe Diao, et al. | arXiv |
PDF
ERNIE
-
2019 | ERNIE: Enhanced Representation through Knowledge Integration | Yu Sun, et al. | arXiv | PDF
-
2020 | SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis | Hao Tian, et al. | arXiv | PDF
-
2020 | ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding | Dongling Xiao, et al. | arXiv | PDF
备注:
PaddlePaddle转TensorFlow可参考: tensorflow_ernie
PaddlePaddle转PyTorch可参考: ERNIE-Pytorch
ERNIE3
-
2021 | ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation | Yu Sun, et al. | arXiv | PDF
-
2021 | ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation | Shuohuan Wang, et al. | arXiv | PDF
模型 |
版本 |
PaddlePaddle |
PyTorch |
作者 |
源地址 |
应用领域 |
ernie-3.0-base |
12-layer, 768-hidden, 12-heads |
link |
huggingface |
PaddlePaddle |
github |
通用 |
ernie-3.0-medium |
6-layer, 768-hidden, 12-heads |
link |
huggingface |
PaddlePaddle |
github |
通用 |
ernie-3.0-mini |
6-layer, 384-hidden, 12-heads |
link |
huggingface |
PaddlePaddle |
github |
通用 |
ernie-3.0-micro |
4-layer, 384-hidden, 12-heads |
link |
huggingface |
PaddlePaddle |
github |
通用 |
ernie-3.0-nano |
4-layer, 312-hidden, 12-heads |
link |
huggingface |
PaddlePaddle |
github |
通用 |
PaddlePaddle转PyTorch可参考: ERNIE-Pytorch
-
2021 | RoFormer: Enhanced Transformer with Rotary Position Embedding | Jianlin Su, et al. | arXiv | PDF
-
2021 | Transformer升级之路:2、博采众长的旋转式位置编码 | 苏剑林. | spaces | Blog post
StructBERT
- 2019 | StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding | Wei Wang, et al. | arXiv |
PDF
模型 |
版本 |
TensorFlow |
PyTorch |
作者 |
源地址 |
应用领域 |
StructBERT |
large(L24) |
|
阿里云 |
Alibaba |
github |
通用 |
Lattice-BERT
- 2021 | Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models | Yuxuan Lai, et al. | arXiv |
PDF
Mengzi-BERT
- 2021 | Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese | Zhuosheng Zhang, et al. | arXiv |
PDF
Bloom
- 2022 | Bloom: BigScience Large Open-science Open-access Multilingual Language Model | huggingface bigscience | - |
BLOG
模型 |
版本 |
TensorFlow |
PyTorch |
作者 |
源地址 |
应用领域 |
bloom-6b4-zh |
6B(L30) |
|
huggingface |
Langboat (作者另有bloom-389m-zh到bloom-2b5-zh等多个中文模型) |
github |
通用 |
TaCL
- 2021 | TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning | Yixuan Su, et al. | arXiv |
PDF
MC-BERT
- 2021 | MC-BERT: Conceptualized Representation Learning for Chinese Biomedical Text Mining | alibaba-research | arXiv |
PDF
二郎神
PERT
- 2022 | PERT: Pre-Training BERT with Permuted Language Model | Yiming Cui, et al. | arXiv |
PDF
MobileBERT
- 2020 | MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices | Zhiqing Sun, et al. | arXiv |
PDF
GAU-α
- 2022 | GAU-α: (FLASH) Transformer Quality in Linear Time | Weizhe Hua, et al. | arXiv |
PDF
| blog
DeBERTa
- 2020 | DeBERTa: Decoding-enhanced BERT with Disentangled Attention | Pengcheng He, et al. | arXiv |
PDF
|
GlyphBERT
- 2021 | GlyphCRM: Bidirectional Encoder Representation for Chinese Character with its Glyph | Yuxin li, et al. | arXiv |
PDF
|
CKBERT
- 2022 | Revisiting and Advancing Chinese Natural Language Understanding with Accelerated Heterogeneous Knowledge Pre-training | Zhang, Taolin, et al. | arXiv |
PDF
LERT
- 2022 | LERT: A Linguistically-motivated Pre-trained Language Model | Yiming Cui et al. | arXiv |
PDF
RoCBert
- 2022 | RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining | Hui Su et al. | ACL |
PDF
NLG系列
GPT
-
2019 | Improving Language Understandingby Generative Pre-Training | Alec Radford, et al. | arXiv | PDF
-
2019 | Language Models are Unsupervised Multitask Learners | Alec Radford, et al. | arXiv | PDF
GPT-3
-
2019 | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Zihang Dai, et al. | arXiv | PDF
-
2020 | Language Models are Few-Shot Learners | Tom B. Brown, et al. | arXiv | PDF
NEZHA-Gen
-
2019 | NEZHA: Neural Contextualized Representation for Chinese Language Understanding | Junqiu Wei, et al. | arXiv | PDF
-
2019 | Improving Language Understandingby Generative Pre-Training | Alec Radford, et al. | arXiv | PDF
CPM-Generate
- 2020 | CPM: A Large-scale Generative Chinese Pre-trained Language Model | Zhengyan Zhang, et al. | arXiv |
PDF
备注:
PyTorch转TensorFlow可参考: CPM-LM-TF2
PyTorch转PaddlePaddle可参考: CPM-Generate-Paddle
T5
- 2019 | Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | Colin Raffel, et al. | arXiv |
PDF
T5-PEGASUS
-
2019 | Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | Colin Raffel, et al. | arXiv | PDF
-
2019 | PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization | Jingqing Zhang, et al. | arXiv | PDF
-
2021 | T5 PEGASUS:开源一个中文生成式预训练模型 | 苏剑林. | spaces | Blog post
Keras转PyTorch可参考: t5-pegasus-pytorch
Mengzi-T5
- 2021 | Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese | Zhuosheng Zhang, et al. | arXiv |
PDF
PanGu-Alpha
- 2021 | PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation | Wei Zeng, et al. | arXiv |
PDF
EVA
- 2021 | EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training | Hao Zhou, et al. | arXiv |
PDF
BART
- 2019 | BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension | Mike Lewis, et al. | arxiv |
PDF
闻仲
余元
RWKV
- 2021 | An Attention Free Transformer | Shuangfei Zhai, et al. | arxiv |
PDF
- 2022 | The RWKV Language Model . | github
PromptCLUE
ChatYuan
SkyText
ProphetNet
- 2020 | Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training | Qi, Weizhen, et al. | arxiv |
PDF
- 2021 | ProphetNet-X: Large-Scale Pre-training Models for English, Chinese, Multi-lingual, Dialog, and Code Generation | Qi, Weizhen, et al. | arxiv |
PDF
NLU-NLG系列
UniLM
- 2019 | Unified Language Model Pre-training for Natural Language Understanding and Generation | Li Dong, et al. | arXiv |
PDF
Simbert
- 2020 | 鱼与熊掌兼得:融合检索和生成的SimBERT模型 | 苏剑林. | spaces |
Blog post
- 2021 | SimBERTv2来了!融合检索和生成的RoFormer-Sim模型 | 苏剑林. | spaces |
Blog post
周文王
CPM-2
- 2021 | CPM-2: Large-scale Cost-effective Pre-trained Language Models | Zhengyan Zhang, et al. | arXiv |
PDF
CPT
- 2021 | CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation | Yunfan Shao, et al. | arxiv |
PDF
GLM
- 2022 | GLM: General Language Model Pretraining with Autoregressive Blank Infilling | Zhengxiao Du, et al. | arXiv |
PDF
- 2022 | GLM-130B: An Open Bilingual Pre-trained Model | Aohan Zeng, et al. | arXiv |
PDF
PLUG
- 2019 | StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding | Wei Wang, et al. | arXiv |
PDF
- 2020 | PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation | Bin Bi, et al. | ACL|
PDF
OPD
- 2022 | 待定 | , et al. | arXiv |
PDF
Multi-Modal
WenLan
- 2021 | WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training | Yuqi Huo, et al. | arXiv |
PDF
CogView
- 2021 | CogView: Mastering Text-to-Image Generation via Transformers | Ming Ding, et al. | arXiv |
PDF
紫东太初
Mengzi-oscar
- 2021 | Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese | Zhuosheng Zhang, et al. | arXiv |
PDF
R2D2
- 2022 | Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework | Chunyu Xie, et al. | arXiv |
PDF
Chinese-CLIP
- 2021 | Learning Transferable Visual Models From Natural Language Supervision | Alec Radford, et al. | arXiv |
PDF
- 2022 | Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese | An Yang, et al. | arXiv |
PDF
TaiYi-CLIP
- 2021 | Learning Transferable Visual Models From Natural Language Supervision | Alec Radford, et al. | arXiv |
PDF
- 2022 | Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence | Junjie Wang, et al. | arXiv |
PDF
AltCLIP
- 2022 | AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities | Chen, Zhongzhi, et al. | arXiv |
PDF
AltDiffusion
- 2022 | AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities | Chen, Zhongzhi, et al. | arXiv |
PDF
- 2022 | High-Resolution Image Synthesis With Latent Diffusion Models | Rombach, et al. | arXiv |
PDF
Taiyi-Stable-Diffusion
- 2022 | Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence | Junjie Wang, et al. | arXiv |
PDF
- 2022 | High-Resolution Image Synthesis With Latent Diffusion Models | Rombach, et al. | arXiv |
PDF
wukong
- 2022 | Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark | Jiaxi Gu, et al. | arXiv |
PDF
Table
SDCUP
- 2021 | Improving Text-to-SQL with Schema Dependency Learning | Binyuan Hui, et al. | arXiv |
PDF
LLM
大规模语言模型:表格中只罗列出参数量大于10B
以上模型。
ChatLLM
具备问答和对话等功能的大型语言模型。
开源模型库平台
- 🤗huggingface: The AI community building the future.
- ModelScope: ModelScope平台是以模型为中心的模型开源社区
- flagopen: flagopen飞智大模型技术开源体系
开源数据集库
- huggfaceing数据集仓库: https://huggingface.co/datasets
- 包含了自然语言处理、计算机视觉、语音、多模态等数据集,内置100多个多语言公共数据集下载
- ModelScope数据集仓库:https://modelscope.cn/datasets
- 提供了覆盖自然语言处理、计算机视觉、语音、多模态等数据集,更有阿里巴巴集团贡献的专业领域数据集,
- flagopen数据集仓库: https://data.baai.ac.cn/data
- cluebenchmarks数据集仓库:https://www.cluebenchmarks.com/dataSet_search.html
- MNBVC: Massive Never-ending BT Vast Chinese corpus
- OpenDataLab数据集仓库: https://opendatalab.com/
- OpenDataLab 是有影响力的数据开源开放平台,公开数据集触手可及。
- OSCAR: Open Super-large Crawled Aggregated coRpus, 多语言数据集
更新
- 2023.03.20 增加BELLE,开源中文对话大模型-70亿参数,基于Stanford Alpaca,对中文做了优化,模型调优仅使用由ChatGPT生产的数据.
- 2023.03.14 增加ChatLLM列表,主要收集具备问答跟对话等功能的大型语言模型,并增加ChatGLM模型。
- 2023.03.11 增加ProphetNet,提出了一种新的自监督学习目标——同时预测多个未来字符,在序列到序列的多个自然语言生成任务都取得了优异性能。
- 2023.03.10 增加RoCBert,利用对抗学习生成更多噪声数据,用来进行中文BERT模型的训练,得到鲁棒性更强的中文BERT模型。
- 2023.03.03 更新LLM,新增多语言模型
Flan-ul2
和Flan-t5-xxl
- 2023.02.21 增加LLM,大规模语言模型列表,只罗列出参数量大于10B以上模型,其余量级模型,可参考对应的项目地址。
- 2023.01.14 增加SkyText,SkyText是由奇点智源发布的中文GPT3预训练大模型,可以进行聊天、问答、中英互译等不同的任务.
- 2023.01.14 增加ChatYuan,ChatYuan模型可以用于问答、结合上下文做对话、做各种生成任务,包括创意性写作,也能回答一些像法律、新冠等领域问题。
- 2022.12.10 增加PromptCLUE,全中文任务零样本学习模型,基于1000亿token中文语料上预训练,并且在数百种任务上进行Prompt任务式训练。
- 2022.12.01 增加wukong,基于一个名为「悟空」的大型中文跨模态数据集,其中包含来自网络的 1 亿个图文对,预训练的多模态模型。
- 2022.11.30 增加AltDiffusion,使用 AltCLIP 作为text encoder,基于 Stable Diffusion 训练了中英双语Diffusion模型(AltDiffusion)
- 2022.11.30 增加AltCLIP,一个简单高效的方法去训练更加优秀的双语CLIP模型,名为AltCLIP。AltCLIP基于 OpenAI CLIP 训练。
- 2022.11.30 增加Taiyi-Stable-Diffusion,首个开源的中英双语Stable Diffusion模型,基于0.2亿筛选过的中文图文对训练。
- 2022.11.9 增加OPD,OPD是一个中文开放域对话预训练模型,拥有63亿参数,在70GB高质量对话数据上进行训练而成.
大规模
& 高性能
- 2022.11.8 更新Chinese-CLIP,Chinese-CLIP是中文多模态图文表征模型,更新后Chinese-CLIP扩充到5个模型规模,同时增加了技术报告论文以及检索demo,同时在达摩院ModelScope平台同步集成。
- 2022.10.31 增加LERT,为了验证通过显式注入语言学知识预训练模型能否获得进一步性能提升,HFL提出了一种语言学信息增强的预训练模型LERT,融合了多种语言学知识。大量实验结果表明,在同等训练数据规模下,LERT能够带来显著性能提升。
- 2022.10.14 增加CKBERT,中文知识库增强BERT预训练语言模型。
- 2022.10.01 增加GlyphBERT, GlyphBERT是一个包含了汉字字形特征中文预训练模型。它通过将输入的字符渲染成图像并设计成多通道位置特征图的形式,并设计了一个两层 残差卷积神经网络模块来提取字符的图像特征进行训练。
- 2022.09.30 增加DeBERTa,一个中文版的DeBERTa-v2,我们用悟道语料库(180G版本)进行预训练,在预训练阶段中使用了封神框架。
- 2022.09.30 增加TaiYi-CLIP,首个开源的中文CLIP模型,1.23亿图文对上进行预训练的文本端RoBERTa-large。
- 2022.09.27 增加PLUG,PLUG集语言理解与生成能力于一身,支持文本生成、问答、语义理解等多类下游任务,PLUG开源将助力开发者在语言理解和语言生成上做出更多延拓。
- 2022.09.11 增加bloom-6b4,多语言预训练bloom系列生成模型7b1参数(https://huggingface.co/bigscience/bloom-7b1 )的中文vocab提取,bloom系列另有最大176B模型(https://huggingface.co/bigscience/bloom).
- 2022.09.11 增加GLM-130B,提出了开源的双语预训练生成模型 GLM(General Language Model)。
- 2022.09.11 增加PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation 2.6B和13B 生成模型pytorch版
- 2022.06.29 增加ERNIE 3.0,大规模知识增强预训练语言理解和生成.
- 2022.06.22 增加Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework,基于大规模中文跨模态基准数据集Zero,训练视觉语言预训练框架 R2D2,用于大规模跨模态学习。
- 2022.06.15 增加GLM: General Language Model Pretraining with Autoregressive Blank Infilling,提出了一种新的通用语言模型 GLM(General Language Model)。 使用自回归填空目标进行预训练,可以针对各种自然语言理解和生成任务进行微调。
- 2022.05.16 增加GAU-α,主要提出了一个融合了Attention层和FFN层的新设计GAU(Gated Attention Unit,门控注意力单元),它是新模型更快、更省、更好的关键,此外它使得整个模型只有一种层,也显得更为优雅。
- 2022.03.27 增加RoFormer-V2,RoFormer升级版,主要通过结构的简化来提升速度,并通过无监督预训练和有监督预训练的结合来提升效果,从而达到了速度与效果的“双赢”。
- 2022.03.02 增加MobileBERT,MobileBERT是BERT-large模型更“苗条”的版本,使用了瓶颈结构(bottleneck)并且对自注意力和前馈神经网络之间的平衡做了细致的设计。
- 2022.02.24 增加PERT: Pre-Training BERT with Permuted Language Model,一种基于乱序语言模型的预训练模型(PERT),在不引入掩码标记[MASK]的情况下自监督地学习文本语义信息。
- 2021.12.06 增加SDCUP: Improving Text-to-SQL with Schema Dependency Learning,达摩院深度语言模型体系 AliceMind 发布中文社区首个表格预训练模型 SDCUP。
- 2021.11.27 增加RWKV中文预训练生成模型,类似 GPT-2,模型参考地址:RWKV-LM
- 2021.11.27 增加IDEA研究院开源的封神榜系列语言模型,包含二郎神、周文王、闻仲、余元。
- 2021.11.25 增加MC-BERT: Conceptualized Representation Learning for Chinese Biomedical Text Mining, 生物医学领域的中文预训练模型.
- 2021.11.24 增加TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning, Token-aware对比学习预训练模型.
- 2021.10.18 增加Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese,基于语言学信息融入和训练加速等方法研发了 Mengzi 系列模型.
- 2021.10.14 增加中文版BART,训练比较可靠的中文版BART,为中文生成类任务如摘要等提供Baseline.
- 2021.10.14 增加CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation,CPT:兼顾理解和生成的中文预训练模型.
- 2021.10.13 增加紫东太初多模态大模型: 全球首个多模态图文音预训练模型,实现了视觉-文本-语音三模态统一表示,构建了三模态预训练大模型。
- 2021.09.19 增加CogView: Mastering Text-to-Image Generation via Transformers,世界最大的中文多模态生成模型,模型支持文生成图为基础的多领域下游任务.
- 2021.09.10 增加WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training,首个中文通用图文多模态大规模预训练模型。
- 2021.09.10 增加EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training,一个开放领域的中文对话预训练模型。
- 2021.08.19 增加Chinese-Transformer-XL:基于中文预训练语料WuDaoCorpus(290G)训练的GPT-3模型。
- 2021.08.16 增加CPM-2: Large-scale Cost-effective Pre-trained Language Models
- 2021.08.16 增加Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models
- 2021.07.19 增加roformer-sim-v2:利用标注数据增强版本
- 2021.07.15 增加BERT-CCPoem:古典诗歌语料训练的BERT
- 2021.07.06 增加ChineseBERT:Chinese Pretraining Enhanced by Glyph and Pinyin Information
- 2021.06.22 增加StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding
- 2021.06.14 增加RoFormer:Enhanced Transformer with Rotary Position Embedding
- 2021.05.25 增加ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding
- 2021.04.28 增加PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation
- 2021.03.16 增加T5-PEGASUS: 开源一个中文生成式预训练模型
- 2021.03.09 增加UER系列模型
- 2021.03.04 增加WoBERT: 基于词颗粒度的中文
- 2020.11.11 初始化BERT系列模型BERT
Misc
↳ Stargazers

↳ Forkers

↳ Star History