MNBVC
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
How to download and setup MNBVC
Open terminal and run command
git clone https://github.com/esbatmop/MNBVC.git
git clone is used to create a copy or clone of MNBVC repositories.
You pass git clone a repository URL. it supports a few different network protocols and corresponding URL formats.
Also you may download zip file with MNBVC https://github.com/esbatmop/MNBVC/archive/master.zip
Or simply clone MNBVC with SSH
[email protected]:esbatmop/MNBVC.git
If you have some problems with MNBVC
You may open issue on MNBVC support forum (system) here: https://github.com/esbatmop/MNBVC/issuesSimilar to MNBVC repositories
Here you may see MNBVC alternatives and analogs
lectures spaCy HanLP gensim tensorflow_cookbook tensorflow-nlp Awesome-pytorch-list spacy-models TagUI Repo-2017 stanford-tensorflow-tutorials awesome-nlp franc nlp_tasks nltk TextBlob CoreNLP allennlp mycroft-core practical-pytorch prose ltp libpostal sling DeepNLP-models-Pytorch attention-is-all-you-need-pytorch kaggle-CrowdFlower hubot-natural chat KGQA-Based-On-medicine