Skip to content

Dolly V2 Updates#88

Merged
matthayes merged 2 commits intomasterfrom
v2_updates
Apr 15, 2023
Merged

Dolly V2 Updates#88
matthayes merged 2 commits intomasterfrom
v2_updates

Conversation

@matthayes
Copy link
Contributor

This updates training to use the databricks-dolly-15k dataset. It also includes improvements to text generation and example notebooks.

Key Changes:

  • The train_dolly.py notebook now uses Pythia models as the input models and fine tunes using the databricks-dolly-15k dataset.
  • Added InstructionTextGenerationPipeline for text generation. This is derived from the code in the model repo, instruct_pipeline.py. It has been improved so that it is compatible with the TextGenerationPipeline from the transformers library. Some code, such as that in _forward, was copied from that pipeline to help with compatibility. The biggest change relative to the current instruct_pipeline.py version is that it returns a list of dicts per instruction, rather than just a dict. It also now has a return_full_text option. Both of these contribute towards being usable with langchain.
  • generate_response is now a wrapper around InstructionTextGenerationPipeline, as the code was all moved there.
  • trainer.py now uses the local databricks-dolly-15k.jsonl dataset. A text column has been constructed from the instruction, context, and response.

Minor Changes:

  • Added an experiment_id widget to help keep track of different models that are fine tuned.
  • Added more options to CLI for configuring training.

Additional Changes:

  • Added a generation.py example notebook that uses generate_response on a couple instructions.
  • Added a langchain.py example notebook that uses HuggingFacePipeline from langchain and InstructionTextGenerationPipeline to test instructions both with and without context.
  • Added a pipeline.py example notebook that uses InstructionTextGenerationPipeline to generate multiple samples per instruction.

return preprocess_params, forward_params, postprocess_params

def preprocess(self, instruction_text, **generate_kwargs):
prompt_text = PROMPT_FOR_GENERATION_FORMAT.format(instruction=instruction_text)
Copy link

@eggie5 eggie5 Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You all define a prompt for open-book QA here https://github.com/databrickslabs/dolly/blob/master/training/consts.py#L43 (but call it closed-QA as a misnomer 😉) and use it during training if the data sample has some additional context.

however, if I'm reading this correctly, there's no way to do open-book (w/ a context) QA as this this line seems to default to closed-book, ie no context (INPUT_KEY) during inference?

I did see in the langchain example that you sorta hack it in to the instruction_text by concating INPUT_KEY and the context... maybe that is the expectation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants