Dolly V2 Updates by matthayes · Pull Request #88 · databrickslabs/dolly

matthayes · 2023-04-15T23:20:13Z

This updates training to use the databricks-dolly-15k dataset. It also includes improvements to text generation and example notebooks.

Key Changes:

The train_dolly.py notebook now uses Pythia models as the input models and fine tunes using the databricks-dolly-15k dataset.
Added InstructionTextGenerationPipeline for text generation. This is derived from the code in the model repo, instruct_pipeline.py. It has been improved so that it is compatible with the TextGenerationPipeline from the transformers library. Some code, such as that in _forward, was copied from that pipeline to help with compatibility. The biggest change relative to the current instruct_pipeline.py version is that it returns a list of dicts per instruction, rather than just a dict. It also now has a return_full_text option. Both of these contribute towards being usable with langchain.
generate_response is now a wrapper around InstructionTextGenerationPipeline, as the code was all moved there.
trainer.py now uses the local databricks-dolly-15k.jsonl dataset. A text column has been constructed from the instruction, context, and response.

Minor Changes:

Added an experiment_id widget to help keep track of different models that are fine tuned.
Added more options to CLI for configuring training.

Additional Changes:

Added a generation.py example notebook that uses generate_response on a couple instructions.
Added a langchain.py example notebook that uses HuggingFacePipeline from langchain and InstructionTextGenerationPipeline to test instructions both with and without context.
Added a pipeline.py example notebook that uses InstructionTextGenerationPipeline to generate multiple samples per instruction.

eggie5 · 2023-06-21T15:01:40Z

training/generate.py

+        return preprocess_params, forward_params, postprocess_params
+
+    def preprocess(self, instruction_text, **generate_kwargs):
+        prompt_text = PROMPT_FOR_GENERATION_FORMAT.format(instruction=instruction_text)


You all define a prompt for open-book QA here https://github.com/databrickslabs/dolly/blob/master/training/consts.py#L43 (but call it closed-QA as a misnomer 😉) and use it during training if the data sample has some additional context.

however, if I'm reading this correctly, there's no way to do open-book (w/ a context) QA as this this line seems to default to closed-book, ie no context (INPUT_KEY) during inference?

I did see in the langchain example that you sorta hack it in to the instruction_text by concating INPUT_KEY and the context... maybe that is the expectation?

matthayes added 2 commits April 15, 2023 16:02

Dolly V2 Updates

8825316

fix return_full_text issue

4c3419c

matthayes merged commit 905e58a into master Apr 15, 2023

matthayes deleted the v2_updates branch April 15, 2023 23:49

This was referenced Apr 15, 2023

How to train V2? #78

Closed

RuntimeError: Expected only a single token for '### Response: ' but found [50402, 198] #51

Closed

eggie5 reviewed Jun 21, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dolly V2 Updates#88

Dolly V2 Updates#88
matthayes merged 2 commits intomasterfrom
v2_updates

matthayes commented Apr 15, 2023

Uh oh!

eggie5 Jun 21, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

matthayes commented Apr 15, 2023

Uh oh!

eggie5 Jun 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eggie5 Jun 21, 2023 •

edited

Loading