update xverse-13b-256k

miange91 · miange91 · commit c4e86d294329 · 2024-01-15T17:21:17.000+08:00
diff --git a/README.md b/README.md
@@ -19,6 +19,7 @@
 </h4>
 
 ## 更新信息
+**[2024/01/16]** 发布长序列对话模型**XVERSE-13B-256K** ，该版本模型最大支持 256K 的上下文窗口长度，约 25w 字的输入内容，可以协助进行文献总结、报告分析等任务。  
 **[2023/11/06]** 发布新版本的 **XVERSE-13B-2** 底座模型和 **XVERSE-13B-2-Chat** 对话模型，相较于原始版本，新版本的模型训练更加充分（从 1.4T 增加到 3.2T），各方面的能力均得到大幅提升，同时新增工具调用能力。  
 **[2023/09/26]** 发布 7B 尺寸的 [XVERSE-7B](https://github.com/xverse-ai/XVERSE-7B) 底座模型和 [XVERSE-7B-Chat](https://github.com/xverse-ai/XVERSE-7B) 对话模型，支持在单张消费级显卡部署运行，并保持高性能、全开源、免费可商用。  
 **[2023/08/22]** 发布经过指令精调的 XVERSE-13B-Chat 对话模型。   
@@ -39,6 +40,8 @@
 
 <img src="resources/chat_train_data.png">
 
+**XVERSE-13B-256K**是[**XVERSE-13B-2**](https://huggingface.co/xverse/XVERSE-13B)模型经过ABF+继续预训练、NTK+SFT微调后的版本。
+
 ## 评测结果
 
 为了综合评估模型的性能，我们在一系列标准数据集上进行了全面测试，包括C-Eval、CMMLU、Gaokao-Bench、MMLU、GAOKAO-English、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K和HumanEval。这些评估覆盖了模型在多个领域的能力，具体包括中文问答、英文问答、语言理解、常识问答、逻辑推理、数学问题解答以及编程能力。评估结果如下：
@@ -62,6 +65,26 @@
 对于上述所有比较模型，我们优先汇报其官方公布的结果。在缺少官方结果的情况下，我们采用了 [OpenCompass 榜单](https://opencompass.org.cn/leaderboard-llm)的报告结果。其他结果则来自于我们自行执行的评估流程所获得的数据。   
 对于 MMLU ，我们采用作者提供的[评测工具](https://github.com/hendrycks/test)，C-Eval、AGIEval、GAOKAO-Bench、GAOKAO-English 与 MMLU 的评测方式相同，其余评测数据集使用 [OpenCompass 评估框架](https://github.com/open-compass/OpenCompass/)进行评估。
 
+### XVERSE-13B-256K
+
+为了验证长序列的效果，这里我们使用了LongBench数据集。[LongBench](https://github.com/THUDM/LongBench)是第一个多任务、中英双语、针对大语言模型长文本理解能力的评测基准。LongBench由六大类、二十一个不同的任务组成，覆盖了单文档问答、多文档问答、摘要、Few shot任务、合成任务和代码补全等关键的长文本应用场景。LongBench包含14个英文任务、5个中文任务和2个代码任务，多数任务的平均长度在5k-15k之间，共包含4750条测试数据。评估结果如下：
+
+
+|  能力维度  |  数据集 |  XVERSE-13B-256K | GPT-3.5-Turbo-16K | Yi-6B-200K | LongChat-7B-16K | Llama2-7B-Chat-4K | 
+| :--------: | :-------------------: | :----: | :----------: | :--------: | :-----------: | :--------: |
+|  多文档问答  |      HotpotQA         |     58.3     |    51.6    |     48.3      |    22.4    |    24.3    |
+|             |      DuReader         |     28.9     |    28.7    |     14.2       |    19.1    |    1.9    |
+|  单文档问答  |      NarrativeQA      |    24.1      |    23.6    |     14.5      |    21.6    |    19.1    |
+|             |       Qasper          |     30.2     |    43.3    |     21.6      |    21.6    |    19.6    |
+|    摘要     |      VCSUM            |     11.3     |    16.0    |      8.2       |    14.0   |    0.2     |
+|  Few shot   |      TREC             |     72.0     |    68.0    |     71.0      |    61.5    |    60.5    |
+|             |      LSHT             |     35.0     |    29.2    |     38.0      |    20.8    |    19.8    |
+|  合成任务    |  PassageRetrieval-en |     63.0     |    71.0    |     6.0       |    24.0    |    9.2     |
+|             |  PassageRetrieval-zh |     44.0     |    77.5    |     7.9       |    4.8     |    0.5     |
+|      代码   |  RepoBench-P          |    55.6     |    53.6    |     61.5      |    54.7    |    42.4    |
+
+对于上述所有比较模型，我们优先汇报其官方公布的结果。在缺少官方结果的情况下，我们采用自行执行的评估流程所获得的数据。   
+
 ## 使用方法
 
 ### 环境安装
diff --git a/README_EN.md b/README_EN.md
@@ -19,6 +19,7 @@
 </h4>
 
 ## Update Information
+**[2024/01/16]** Released the long-sequence model **XVERSE-13B-256K** . This model version supports a maximum window length of 256K, accommodating approximately 250,000 words for tasks such as literature summarization and report analysis.  
 **[2023/11/06]** The new versions of the **XVERSE-13B-2** base model and the **XVERSE-13B-2-Chat** model have been released. Compared to the original versions, the new models have undergone more extensive training (increasing from 1.4T to 3.2T), resulting in significant improvements in all capabilities, along with the addition of Function Call abilities.  
 **[2023/09/26]** Released the [XVERSE-7B](https://github.com/xverse-ai/XVERSE-7B) base model and [XVERSE-7B-Chat](https://github.com/xverse-ai/XVERSE-7B) instruct-finetuned model with 7B size, which support deployment and operation on a single consumer-grade graphics card while maintaining high performance, full open source, and free for commercial use.   
 **[2023/08/22]** Released the aligned instruct-finetuned model XVERSE-13B-Chat.
@@ -40,6 +41,9 @@ In the alignment, the sampling ratio of data of different capability types is as
 |:-------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
 | Ratio(%) |   21.2   |   18.6   |   12.4   |   11.3   |    9.8   |    6.8   |    5.4   |    5.1   |     4.8  |   4.6    |
 
+**XVERSE-13B-256K** is the long-sequence version of model [**XVERSE-13B-2**](https://huggingface.co/xverse/XVERSE-13B),
+updated by **Continual-Pre-Training** based on **ABF** and **supervised fine-tuning** based on **NTK**.
+
 ## Model Evaluation
 
 To comprehensively assess the performance of the model, we conducted extensive testing across a range of standard datasets, including C-Eval, CMMLU, Gaokao-Bench, MMLU, GAOKAO-English, AGIEval, RACE-M, CommonSenseQA, PIQA, GSM8K and HumanEval. These evaluations spanned multiple capabilities of the model, specifically including Chinese question answering, English question answering, language comprehension, common sense questioning, logical reasoning, mathematical problem-solving, and coding ability. The results of the evaluations are as follows:
@@ -60,9 +64,31 @@ To comprehensively assess the performance of the model, we conducted extensive t
 
 > <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>   
 
+###  XVERSE-13B-256K
+
 For all the comparison models mentioned above, we prioritize the disclosure of their officially published results. In the absence of official data, we refer to the reported outcomes from [OpenCompass Leaderboard](https://opencompass.org.cn/leaderboard-llm). Results not covered by the aforementioned sources are derived from our own evaluation pipline.   
 For MMLU, we adopt the [evaluation tools](https://github.com/hendrycks/test) provided by the authors, C-Eval, AGIEval, GAOKAO-Bench, GAOKAO-English are the same as MMLU. For the remaining evaluation datasets, the [OpenCompass](https://github.com/open-compass/OpenCompass/) is employed for evaluation.
 
+
+To assess the performance of long sequences, we employed the LongBench dataset. [LongBench](https://github.com/THUDM/LongBench) stands as the inaugural multi-task, bilingual (English-Chinese), evaluation benchmark specifically designed to gauge the long-text comprehension capabilities of large language models. Comprising six major categories and twenty-one distinct tasks, LongBench encompasses critical long-text application scenarios such as single-document QA, multi-document QA, summarization, few-shot tasks, synthetic tasks, and code completion. The dataset consists of 14 English tasks, 5 Chinese tasks, and 2 code tasks, with the majority of tasks having an average length ranging from 5,000 to 15,000 tokens, totaling 4,750 test instances. The evaluation results are presented below:
+
+
+|  Capability Dimension  |  Dataset |  XVERSE-13B-256K | GPT-3.5-Turbo-16K | Yi-6B-200K | LongChat-7B-16K | Llama2-7B-Chat-4K | 
+| :--------: | :-------------------: | :----: | :----------: | :--------: | :-----------: | :--------: |
+|  multi-document QA  |      HotpotQA         |     58.3     |    51.6    |     48.3      |    22.4    |    24.3    |
+|                     |      DuReader         |     28.9     |    28.7    |     14.2      |    19.1    |    1.9     |
+|  single-document QA |      NarrativeQA      |     24.1     |    23.6    |     14.5      |    21.6    |    19.1    |
+|                     |       Qasper          |     30.2     |    43.3    |     21.6      |    21.6    |    19.6    |
+|    summarization    |      VCSUM            |     11.3     |    16.0    |      8.2      |    14.0    |    0.2     |
+|    Few shot         |      TREC             |     72.0     |    68.0    |     71.0      |    61.5    |    60.5    |
+|                     |      LSHT             |     35.0     |    29.2    |     38.0      |    20.8    |    19.8    |
+|  synthetic tasks    |  PassageRetrieval-en  |     63.0     |    71.0    |     6.0       |    24.0    |    9.2     |
+|                     |  PassageRetrieval-zh  |     44.0     |    77.5    |     7.9       |    4.8     |    0.5     |
+|   code completion   |  RepoBench-P          |     55.6     |    53.6    |     61.5      |    54.7    |    42.4    |
+
+ 
+For all the comparison models mentioned above, we prioritize the disclosure of their officially published results. In the absence of official data, we refer to the results derived from our own evaluation pipline. 
+
 ## Usage
 
 ### Environment Setup
diff --git a/README_JA.md b/README_JA.md
@@ -19,6 +19,7 @@
 </h4>
 
 ## アップデート情報
+**[2024/01/16]** 長いシーケンス対話モデルの**XVERSE-13B-256K** をリリースしました。このバージョンのモデルは、最大256Kウィンドウサイズをサポートしており、約25万文字の入力コンテンツを扱うことができます。文献の要約、報告の分析などのタスクを支援することができます。  
 **[2023/11/06]** 新しいバージョンの**XVERSE-13B-2**ベースモデルと**XVERSE-13B-2-Chat**対話モデルがリリースされました。元のバージョンと比べて、新しいモデルはより充実したトレーニングを受けています（1.4Tから3.2Tに増加）。その結果、さまざまな能力が大幅に向上しました。また、Function Callの機能が新たに追加されています。  
 **[2023/09/26]** サイズ7Bの [XVERSE-7B](https://github.com/xverse-ai/XVERSE-7B) ベースモデルおよび [XVERSE-7B-Chat](https://github.com/xverse-ai/XVERSE-7B) 会話モデルをリリースします。これらのモデルは、シングルのコンシューマーグレードグラフィックカードでのデプロイメントと運用をサポートし、高性能を維持します。完全にオープンソースで、商用利用無料です。    
 **[2023/08/22]** 微調整して公開する XVERSE-13B-Chat 対話モデル。
@@ -40,6 +41,8 @@
 |:-------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
 | Ratio(%) |   21.2   |   18.6   |   12.4   |   11.3   |    9.8   |    6.8   |    5.4   |    5.1   |     4.8  |   4.6    |
 
+**XVERSE-13B-256K**は、[**XVERSE-13B-2**](https://huggingface.co/xverse/XVERSE-13B)モデルにABF+を用いて継続的に予訓練し、NTK+SFTで微調整したバージョンです。
+
 ## モデル評価
 
 モデルの性能を総合的に評価するために、C-Eval、CMMLU、Gaokao-Bench、MMLU、GAOKAO-English、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K、HumanEvalを含む一連の標準データセットで幅広いテストを行いました。これらの評価は、中国語の質問応答、英語の質問応答、言語理解、常識問題、論理的推論、数学問題解決、およびコーディング能力を含むモデルの複数の能力をカバーしています。評価結果は以下の通りです：
@@ -63,6 +66,25 @@
 上記すべての比較モデルについて、まずは公式に公開された結果を報告します。公式の結果が不足している場合には、[OpenCompass リーダーボード](https://opencompass.org.cn/leaderboard-llm)の報告結果を参照しています。それ以外の結果は、当社の評価プロセスによって得られたデータから派生しています。   
 MMLUについては、著者が提供する[評価ツール](https://github.com/hendrycks/test)を使用します。C-Eval、AGIEval、GAOKAO-Bench、GAOKAO-Englishの評価方法はMMLUと同様ですが、その他の評価データセットについては[OpenCompass](https://github.com/open-compass/OpenCompass/)評価フレームワークを用いて評価を行います。
 
+###  XVERSE-13B-256K
+
+ 長いシーケンス効果の検証のために、ここではLongBenchデータセットを使用しました。[LongBench](https://github.com/THUDM/LongBench)は、大規模な言語モデルの長いテキスト理解能力を対象とする、初めての多タスク、中英バイリンガル、評価基準です。LongBenchは、6つのカテゴリ、21の異なるタスクから構成されており、単一ドキュメントQ&A、複数ドキュメントQ&A、要約、Few-shotタスク、合成タスク、コード補完など、重要な長いテキストアプリケーションシナリオをカバーしています。LongBenchには、14の英語タスク、5の中国語タスク、2のコードタスクが含まれており、多くのタスクの平均長さは5k-15kの間で、合計4750のテストデータが含まれています。評価結果は以下の通りです：
+
+ | 能力の次元 | データセット | XVERSE-13B-256K | GPT-3.5-Turbo-16K | Yi-6B-200K | LongChat-7B-16K | Llama2-7B-Chat-4K |
+| :--------: | :-------------------: | :----: | :----------: | :--------: | :-----------: | :--------: |
+| 多文書Q&A | HotpotQA | 58.3 | 51.6 | 48.3 | 22.4 | 24.3 |
+|           | DuReader | 28.9 | 28.7 | 14.2 | 19.1 | 1.9 |
+| 単文書Q&A | NarrativeQA | 24.1 | 23.6 | 14.5 | 21.6 | 19.1 |
+|           | Qasper | 30.2 | 43.3 | 21.6 | 21.6 | 19.6 |
+| 要約 | VCSUM | 11.3 | 16.0 | 8.2 | 14.0 | 0.2 |
+| Few shot | TREC | 72.0 | 68.0 | 71.0 | 61.5 | 60.5 |
+|           | LSHT | 35.0 | 29.2 | 38.0 | 20.8 | 19.8 |
+| 合成タスク | PassageRetrieval-en | 63.0 | 71.0 | 6.0 | 24.0 | 9.2 |
+|           | PassageRetrieval-zh | 44.0 | 77.5 | 7.9 | 4.8 | 0.5 |
+| コード | RepoBench-P | 55.6 | 53.6 | 61.5 | 54.7 | 42.4 |
+
+ 上記のすべての比較モデルについて、公式に発表された結果を優先して報告します。公式の結果がない場合には、独自の評価プロセスによって得られたデータを採用します。
+
 ## 説明書
 
 ### 環境設定