update readme

lieral · lieral · commit 81eac1889747 · 2023-11-29T20:34:35.000+08:00
diff --git a/README.md b/README.md
@@ -33,6 +33,11 @@
 - **分词**：基于 BPE（Byte-Pair Encoding）算法，使用上百 GB 语料训练了一个词表大小为 100,534 的分词器，能够同时支持多语言，而无需额外扩展词表。
 - **训练框架**：自主研发多项关键技术，包括高效算子、显存优化、并行调度策略、数据-计算-通信重叠、平台和框架协同等，让训练效率更高，模型稳定性强，在千卡集群上的峰值算力利用率可达到 58.5%，位居业界前列。
 
+**XVERSE-13B-2-Chat**为 **XVERSE-13B-2** 底座模型对齐后的版本。
+
+对齐阶段，不同能力类型数据的采样比例如下所示：
+<img src="resources/chat_train_data.png">
+
 ## 评测结果
 
 为了综合评估模型的性能，我们在一系列标准数据集上进行了全面测试，包括C-Eval、CMMLU、Gaokao-Bench、MMLU、GAOKAO-English、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K和HumanEval。这些评估覆盖了模型在多个领域的能力，具体包括中文问答、英文问答、语言理解、常识问答、逻辑推理、数学问题解答以及编程能力。评估结果如下：
diff --git a/README_EN.md b/README_EN.md
@@ -33,6 +33,11 @@
 - **Tokenization**: Based on the BPE (Byte-Pair Encoding) algorithm, a tokenizer with a vocabulary size of 100,278 has been trained using hundreds of gigabytes of language data. This tokenizer is capable of supporting multilingual without the need for additional vocabulary expansion.
 - **Training Framework**: Several key technologies have also been independently developed, including efficient operators, memory optimization, parallel scheduling strategies, overlap of data-computation-communication, and synergy between platforms and frameworks. These advancements enhance training efficiency and model stability. With these technologies, the peak computational power utilization rate on a thousand-card cluster can reach 58.5%, ranking at the forefront of the industry.
 
+**XVERSE-13B-2-Chat** is the aligned version of model **XVERSE-13B-2**.
+
+In the alignment, the sampling ratio of data of different capability types is as follows:
+<img src="resources/chat_train_data.png">
+
 ## Model Evaluation
 
 To comprehensively assess the performance of the model, we conducted extensive testing across a range of standard datasets, including C-Eval, CMMLU, Gaokao-Bench, MMLU, GAOKAO-English, AGIEval, RACE-M, CommonSenseQA, PIQA, GSM8K and HumanEval. These evaluations spanned multiple capabilities of the model, specifically including Chinese question answering, English question answering, language comprehension, common sense questioning, logical reasoning, mathematical problem-solving, and coding ability. The results of the evaluations are as follows:
diff --git a/README_JA.md b/README_JA.md
@@ -33,6 +33,10 @@
 - **トークン化**: BPE（Byte-Pair Encoding）アルゴリズムに基づき、100,278 の語彙サイズを持つトークナイザーが、数百ギガバイトの言語データを用いて学習されました。このトークナイザは、追加の語彙拡張を必要とせず、多言語をサポートすることができます。
 - **トレーニングフレームワーク**: 効率的な演算子、メモリの最適化、並列スケジューリング戦略、データ-計算-通信のオーバーラップ、プラットフォームとフレームワーク間の相乗効果など、いくつかの重要な技術も独自に開発されています。これらの進歩により、トレーニング効率とモデルの安定性が向上しました。これらの技術により、1,000 枚クラスタのピーク演算能力利用率は 58.5% に達し、業界の最先端を走っています。
 
+**XVERSE-13B-2-Chat** は、**XVERSE-13B-2** ベース モデルの調整バージョンです。
+調整段階では、さまざまな機能タイプのデータのサンプリング率は次のとおりです:
+<img src="resources/chat_train_data.png">
+
 ## モデル評価
 
 モデルの性能を総合的に評価するために、C-Eval、CMMLU、Gaokao-Bench、MMLU、GAOKAO-English、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K、HumanEvalを含む一連の標準データセットで幅広いテストを行いました。これらの評価は、中国語の質問応答、英語の質問応答、言語理解、常識問題、論理的推論、数学問題解決、およびコーディング能力を含むモデルの複数の能力をカバーしています。評価結果は以下の通りです：
diff --git a/resources/chat_train_data.png b/resources/chat_train_data.png