Merged
Conversation
This was referenced Apr 15, 2023
Closed
eggie5
reviewed
Jun 21, 2023
| return preprocess_params, forward_params, postprocess_params | ||
|
|
||
| def preprocess(self, instruction_text, **generate_kwargs): | ||
| prompt_text = PROMPT_FOR_GENERATION_FORMAT.format(instruction=instruction_text) |
There was a problem hiding this comment.
You all define a prompt for open-book QA here https://github.com/databrickslabs/dolly/blob/master/training/consts.py#L43 (but call it closed-QA as a misnomer 😉) and use it during training if the data sample has some additional context.
however, if I'm reading this correctly, there's no way to do open-book (w/ a context) QA as this this line seems to default to closed-book, ie no context (INPUT_KEY) during inference?
I did see in the langchain example that you sorta hack it in to the instruction_text by concating INPUT_KEY and the context... maybe that is the expectation?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This updates training to use the
databricks-dolly-15kdataset. It also includes improvements to text generation and example notebooks.Key Changes:
train_dolly.pynotebook now uses Pythia models as the input models and fine tunes using thedatabricks-dolly-15kdataset.InstructionTextGenerationPipelinefor text generation. This is derived from the code in the model repo, instruct_pipeline.py. It has been improved so that it is compatible with theTextGenerationPipelinefrom thetransformerslibrary. Some code, such as that in_forward, was copied from that pipeline to help with compatibility. The biggest change relative to the currentinstruct_pipeline.pyversion is that it returns a list of dicts per instruction, rather than just a dict. It also now has areturn_full_textoption. Both of these contribute towards being usable withlangchain.generate_responseis now a wrapper aroundInstructionTextGenerationPipeline, as the code was all moved there.trainer.pynow uses the localdatabricks-dolly-15k.jsonldataset. Atextcolumn has been constructed from the instruction, context, and response.Minor Changes:
experiment_idwidget to help keep track of different models that are fine tuned.Additional Changes:
generation.pyexample notebook that usesgenerate_responseon a couple instructions.langchain.pyexample notebook that usesHuggingFacePipelinefromlangchainandInstructionTextGenerationPipelineto test instructions both with and without context.pipeline.pyexample notebook that usesInstructionTextGenerationPipelineto generate multiple samples per instruction.