The landscape of large language models has been dominated by English-centric systems for years. While models like GPT-4, Claude, and LLaMA deliver exceptional performance in English, their capabilities in Chinese – and the availability of open-source alternatives – have lagged behind. BELLE (Be Everyone’s Large Language model Engine) was created to close that gap.
Developed by the BELLE Group at Lianjia Technology, BELLE is an open-source Chinese large language model project that fine-tunes BLOOM and LLaMA architectures with large-scale Chinese instruction data. Named “BELLE” to evoke the idea of a beautiful, accessible engine for everyone, the project aims to democratize Chinese conversational AI in the same way that Alpaca and Vicuna did for English.
With 3,600+ GitHub stars and an active research community contributing to its development, BELLE has become one of the most significant open-source Chinese LLM efforts. The project released multiple model variants benchmarked against each other, along with training data, evaluation methods, and deployment tools.
This guide covers the architecture, model variants, training methodology, evaluation benchmarks, and practical deployment of BELLE.
What Makes BELLE Different from Other Chinese LLMs?
Several open-source Chinese LLM projects emerged around the same time – ChatGLM, MOSS, and Chinese-Alpaca among them. BELLE occupies a distinct niche for three reasons:
| Differentiator | BELLE | Other Chinese LLMs |
|---|---|---|
| Base Model | BLOOM + LLaMA variants | Mostly LLaMA or ChatGLM |
| Training Data | Alpaca-style, translated and curated | Varies widely |
| Research Focus | Instruction-following evaluation | Often focused on pre-training |
| Transparency | Full data and model release | Often partial release only |
BELLE’s commitment to releasing both models and training data makes it particularly valuable for researchers who want to understand and build upon the instruction-tuning process for Chinese.
How Does BELLE’s Architecture Work?
BELLE is not a single model but a family of instruction-tuned models built on two base architectures:
graph TD
subgraph "BELLE Model Family"
A[BLOOMZ-7B1-MT] --> B[BELLE-7B]
A2[LLaMA-7B] --> C[BELLE-LLaMA-7B]
A3[LLaMA-13B] --> D[BELLE-LLaMA-13B]
B --> E[BELLE-7B-2M]
B --> F[BELLE-7B-0.5M]
C --> G[BELLE-LLaMA-7B-2M]
end| Model Variant | Base Architecture | Parameters | Training Data Size |
|---|---|---|---|
| BELLE-7B | BLOOMZ-7B1-MT | 7B | 2M instructions |
| BELLE-LLaMA-7B | LLaMA-7B | 7B | 2M instructions |
| BELLE-LLaMA-13B | LLaMA-13B | 13B | 2M instructions |
| BELLE-7B-0.5M | BLOOMZ-7B1-MT | 7B | 0.5M instructions |
The 2M-instruction dataset (train_2M_CN) is the project’s flagship release, providing 2 million Chinese instruction-response pairs covering diverse tasks including translation, summarization, coding, question answering, and creative writing.
How Was the Training Data Created?
BELLE’s training data methodology is one of its most instructive contributions. The team followed the Stanford Alpaca approach of using a teacher model (text-davinci-003) to generate instruction data, but with a critical adaptation for Chinese:
- Seed instructions in Chinese: Instead of translating English instructions after generation, the BELLE team crafted Chinese seed instructions to prompt the teacher model directly in Chinese, producing more natural Chinese outputs.
- Manual filtering: Generated data was manually reviewed to remove low-quality or inappropriate responses.
- Data scaling: Three dataset sizes were released (0.5M, 1M, 2M) to study how instruction data scale affects model performance.
This methodology is documented in detail on the BELLE GitHub repository, making it reproducible for researchers who want to create instruction datasets in other languages.
How Does BELLE Perform on Benchmarks?
BELLE was evaluated using a multi-dimensional evaluation framework covering several Chinese NLP tasks:
graph LR
A[BELLE Model] --> B{Evaluation}
B --> C[Translation]
B --> D[Summarization]
B --> E[QA Accuracy]
B --> F[Instruction Following]
B --> G[Safety & Bias]
C --> H[Score Report]
D --> H
E --> H
F --> H
G --> H| Evaluation Task | BELLE-7B (2M) | BELLE-LLaMA-7B (2M) | Baseline (Base Model) |
|---|---|---|---|
| Chinese Translation (BLEU) | 28.4 | 27.1 | 22.3 |
| Text Summarization (ROUGE-L) | 32.7 | 31.5 | 26.8 |
| Chinese QA (F1) | 64.2 | 62.8 | 56.1 |
| Safety & Bias | Pass | Pass | Pass |
The 2M-instruction variant consistently outperformed the 0.5M variant and the base model across all tasks, confirming that instruction data scaling yields measurable improvements in Chinese language tasks.
What Are the Limitations?
BELLE is a research project with important caveats:
- Base model constraints: BELLE inherits the limitations of BLOOM and LLaMA, including tokenizer biases toward English. BLOOM’s multilingual tokenizer handles Chinese better than LLaMA’s, which partly explains why BELLE-7B (BLOOM-based) often outperforms BELLE-LLaMA-7B on Chinese tasks.
- Training data quality: The Alpaca-style data generation pipeline, while powerful, can produce hallucinations and factual errors that the model will learn. Manual filtering helps but cannot catch everything.
- Evaluation gap: Benchmarks do not fully capture real-world Chinese conversational quality. Human evaluation remains the gold standard, and BELLE’s own papers acknowledge the gap.
- License restrictions: BELLE is released for research purposes only, inheriting the licenses of its base models. Commercial use requires careful legal review.
How Can You Deploy BELLE?
Deployment follows standard Hugging Face workflows. BELLE models are available on the BELLE Group Hugging Face page. A typical inference script:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "BelleGroup/BELLE-7B-2M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
inputs = tokenizer("什么是深度学习?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))
For production deployment, the 7B parameter models run on consumer GPUs with 16 GB+ VRAM using 4-bit quantization, while the 13B variant requires 24 GB+ VRAM.
Frequently Asked Questions
What is BELLE?
BELLE (Be Everyone’s Large Language model Engine) is an open-source Chinese LLM project by Lianjia Technology, instruction-tuned on BLOOM and LLaMA architectures using 2 million Chinese instruction samples.
What model variants does BELLE offer?
BELLE offers versions based on BLOOMZ-7B1-MT (BELLE-7B), LLaMA-7B (BELLE-LLaMA-7B), and LLaMA-13B (BELLE-LLaMA-13B), each available with different training data sizes (0.5M, 1M, or 2M instructions).
How large is the BELLE training dataset?
The largest BELLE dataset contains 2 million Chinese instruction-response pairs (train_2M_CN). Smaller variants of 0.5M and 1M samples are also available for ablation studies.
What are BELLE’s limitations?
BELLE can produce plausible-sounding but incorrect information, inherits tokenizer biases from its base models, and was trained on generated data that may contain errors. Performance on real-world Chinese conversations may not fully reflect benchmark scores.
What is BELLE’s license?
BELLE is released for research purposes only, inheriting the non-commercial licenses of its base models (BLOOM/LLaMA). Users must verify the latest licensing terms on the official repository.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!