You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Copyright 2024 Bytedance Ltd. and/or its affiliates
2
+
#
3
+
# Licensed under the Apache License, Version 2.0 (the "License");
4
+
# you may not use this file except in compliance with the License.
5
+
# You may obtain a copy of the License at
6
+
#
7
+
# http://www.apache.org/licenses/LICENSE-2.0
8
+
#
9
+
# Unless required by applicable law or agreed to in writing, software
10
+
# distributed under the License is distributed on an "AS IS" BASIS,
11
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+
# See the License for the specific language governing permissions and
13
+
# limitations under the License.
14
+
"""
15
+
Preprocess the QA dataset to parquet format
16
+
"""
17
+
18
+
importre
19
+
importos
20
+
importdatasets
21
+
importjson
22
+
fromverl.utils.hdfs_ioimportcopy, makedirs
23
+
importargparse
24
+
25
+
defmake_prefix(dp, retriever):
26
+
input_str="""You are a search copilot for the generation model. Based on a user's query and initial searched results, you will first determine if the searched results are enough to produce an answer.
27
+
If the searched results are enough, you will use <search_complete>True</search_complete> to indicate that you have gathered enough information for the generation model to produce an answer.
28
+
If the searched results are not enough, you will go through a loop of <query> -> <information> -> <important_info> -> <search_complete> -> <query> (if not complete) ..., to help the generation model to generate a better answer with more relevant information searched.
29
+
You should show the search query between <query> and </query> in JSON format.
30
+
Based on the search query, we will return the top searched results between <information> and </information>. You need to put the doc ids of the important documents (up to 3 documents, within the current information window) between <important_info> and </important_info> (e.g., <important_info>[1, 4]</important_info>).
31
+
A search query MUST be followed by a <search_complete> tag if the search is not complete.
32
+
After reviewing the information, you must decide whether to continue searching with a new query or indicate that the search is complete. If you need more information, use <search_complete>False</search_complete> to indicate you want to continue searching with a better query. Otherwise, use <search_complete>True</search_complete> to terminate the search.
33
+
During the process, you can add reasoning process within <think></think> tag whenever you want. Note: Only the important information would be used for the generation model to produce an answer.
34
+
"""
35
+
36
+
ifretriever=="bm25":
37
+
input_str+="""Note: The search query should use Boolean operators (AND, OR) and parentheses for grouping terms appropriately."""
38
+
39
+
input_str+="""
40
+
For a question and initial searched results:
41
+
<question>
42
+
[user's question]
43
+
</question>
44
+
<information>
45
+
[initial searched results]
46
+
</information>
47
+
48
+
If the initial searched results are enough to produce an answer, you should output:
49
+
<search_complete>
50
+
True
51
+
</search_complete>
52
+
53
+
If the initial searched results are not enough to produce an answer, you should output:
54
+
<query>
55
+
{
56
+
"query": "[search query]"
57
+
}
58
+
</query>
59
+
<information>
60
+
[top searched results based on the above search query]
61
+
</information>
62
+
<important_info>
63
+
[doc ids]
64
+
</important_info>
65
+
<search_complete>
66
+
False
67
+
</search_complete>
68
+
<query>
69
+
{
70
+
"query": "[search query]"
71
+
}
72
+
</query>
73
+
...... (can be several turns until <search_complete> is True)
74
+
75
+
<search_complete>
76
+
True
77
+
</search_complete>
78
+
79
+
Now, start the loop with the following question and initial searched results:
0 commit comments