-
Notifications
You must be signed in to change notification settings - Fork 43
DPO #223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
+696
β39
Merged
DPO #223
Changes from all commits
Commits
Show all changes
56 commits
Select commit
Hold shift + click to select a range
818a162
initial dpo updates
422a78b
Merge branch 'main' into toby/dpo
40c96c8
dataset changes for dpo
tobyzl2 f7796d4
adding dpo loss
tobyzl2 54b686a
Merge remote-tracking branch 'origin/main' into toby/dpo
tobyzl2 3c0199f
packing disabled filter sequennces longer than seq length
tobyzl2 0e1335b
disable no packing for legacy sampling
tobyzl2 0e09098
adding dpo tests
tobyzl2 edca385
Merge branch 'main' of https://github.com/ServiceNow/Fast-LLM into toβ¦
tobyzl2 1075176
small fix
tobyzl2 4156349
span tokenization updates
tobyzl2 9669211
enable chosen/rejected text for preparator
tobyzl2 257d236
removing assert
tobyzl2 aa8a871
moving dpo loss call
tobyzl2 d08bf4d
renaming
tobyzl2 b410210
padding fix
tobyzl2 366a20b
dpo config changes
tobyzl2 dca842e
memmap version fixes
tobyzl2 ca86694
removing dpo flags and new sampling class
tobyzl2 aa94f9a
removing extra lines
tobyzl2 7f37038
small data configuration updates
tobyzl2 0d7ccbd
update test case
tobyzl2 5fd1c86
logp span using index instead
tobyzl2 63db041
small updates
tobyzl2 dab6dab
small fix
tobyzl2 41fb3e3
fixing fim
tobyzl2 3d77986
adding checks for chosen/rej spans in memmap dataset
tobyzl2 905bc00
refractor to preprocessor
tobyzl2 1db18f9
merge
tobyzl2 52c8f9f
moving puse_pref_loss_spans to sampling parameters and combining sampβ¦
tobyzl2 6ea086b
merge
tobyzl2 f51eedc
merge
tobyzl2 9067b6a
dpo loss enabling flag
tobyzl2 062ce88
check for config compatibility
tobyzl2 f53ac56
full dpo changes
tobyzl2 92f28ee
adding distillation model check
tobyzl2 2b2515f
update dpo test cases
tobyzl2 bd9142f
FFixing sampled for dpo
tobyzl2 4f26100
test case fixes
tobyzl2 41cc7fe
adding preference logps test case
tobyzl2 a6950f1
small fix
tobyzl2 fb9803d
higher mbs fixes
tobyzl2 723f30e
test higher mbs
tobyzl2 8063a21
small change
tobyzl2 c3a8ebb
updates
tobyzl2 db5242f
small changes
tobyzl2 ab139ca
small changes
tobyzl2 63041aa
remove comments
tobyzl2 e1c92f4
Merge branch 'main' of https://github.com/ServiceNow/Fast-LLM into toβ¦
tobyzl2 e60ad62
maxlen consistency
tobyzl2 85613f7
remove comments
tobyzl2 2742692
refractoring
tobyzl2 29c9a4b
Merge branch 'main' of https://github.com/ServiceNow/Fast-LLM into toβ¦
tobyzl2 8b837c0
fix
tobyzl2 16136ac
fix
tobyzl2 b64626c
merge
tobyzl2 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -34,13 +34,16 @@ def _init(self, name: str, prefix: pathlib.Path | str, num_documents: int | None | |
| self._name = name | ||
| self._prefix = pathlib.Path(prefix) | ||
| self._has_spans = 0 | ||
| self._has_preference_spans = False | ||
|
|
||
| with self._prefix.with_suffix(".idx").open("rb") as stream: | ||
| Assert.eq(stream.read(9), MEMMAP_INDEX_HEADER, msg=f"File: {stream.name}") | ||
| self._version = struct.unpack("<Q", stream.read(8))[0] | ||
| assert self._version in [1, 2], f"Unsupported version for gpt_memmap dataset: {self._version}." | ||
| if self._version == 2: | ||
| assert self._version in [1, 2, 3], f"Unsupported version for gpt_memmap dataset: {self._version}." | ||
| if self._version >= 2: | ||
| self._has_spans = struct.unpack("<B", stream.read(1))[0] | ||
| if self._version >= 3: | ||
| self._has_preference_spans = struct.unpack("<B", stream.read(1))[0] | ||
|
|
||
| self._dtype = MEMMAP_DTYPES[struct.unpack("<B", stream.read(1))[0]].numpy | ||
| self._num_documents = struct.unpack("<Q", stream.read(8))[0] | ||
|
|
@@ -52,18 +55,23 @@ def _init(self, name: str, prefix: pathlib.Path | str, num_documents: int | None | |
|
|
||
| self._index_bin_buffer_mmap = np.memmap(self._prefix.with_suffix(".idx"), mode="r", order="C") | ||
| self._index_bin_buffer = memoryview(self._index_bin_buffer_mmap) | ||
|
|
||
| # read document sizes | ||
| self._document_sizes = np.frombuffer( | ||
| self._index_bin_buffer, dtype=np.int32, count=self._num_documents, offset=offset | ||
| ) | ||
|
|
||
| # read pointers | ||
| self._pointers = np.frombuffer( | ||
| self._index_bin_buffer, | ||
| dtype=np.int64, | ||
| count=self._num_documents, | ||
| offset=offset + self._document_sizes.nbytes, | ||
| ) | ||
|
|
||
| # read spans | ||
| self._spans = None | ||
| if self._has_spans and self._version == 2: | ||
| if self._has_spans and self._version >= 2: | ||
| self._spans = [] | ||
| self._num_spans = np.frombuffer( | ||
| self._index_bin_buffer, | ||
|
|
@@ -83,6 +91,36 @@ def _init(self, name: str, prefix: pathlib.Path | str, num_documents: int | None | |
| ).reshape(-1, 2) | ||
| ) | ||
|
|
||
| # read preference spans | ||
| self._chosen_spans = None | ||
| self._rejected_spans = None | ||
| if self._has_preference_spans and self._version >= 3: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's just set self._has_preference_spans=False for other versions. |
||
| self._chosen_spans = [] | ||
| self._rejected_spans = [] | ||
| chosen_span_offset = offset + self._document_sizes.nbytes + self._pointers.nbytes | ||
| for idx in range(self._num_documents): | ||
| self._chosen_spans.append( | ||
| np.frombuffer( | ||
| self._index_bin_buffer, | ||
| dtype=np.int32, | ||
| count=2, | ||
| offset=chosen_span_offset + idx * 2 * np.dtype(np.int32).itemsize, | ||
| ) | ||
| ) | ||
|
|
||
| rejected_span_offset = ( | ||
| offset + self._document_sizes.nbytes + self._pointers.nbytes + np.array(self._chosen_spans).nbytes | ||
| ) | ||
| for idx in range(self._num_documents): | ||
| self._rejected_spans.append( | ||
| np.frombuffer( | ||
| self._index_bin_buffer, | ||
| dtype=np.int32, | ||
| count=2, | ||
| offset=rejected_span_offset + idx * 2 * np.dtype(np.int32).itemsize, | ||
| ) | ||
| ) | ||
|
tobyzl2 marked this conversation as resolved.
|
||
|
|
||
| self._bin_buffer_mmap = np.memmap(self._prefix.with_suffix(".bin"), mode="r", order="C") | ||
| self._bin_buffer = memoryview(self._bin_buffer_mmap) | ||
|
|
||
|
|
@@ -105,7 +143,12 @@ def __del__(self): | |
| del self._index_bin_buffer_mmap | ||
|
|
||
| def get( | ||
| self, idx: int, offset: int = 0, length: int | None = None, use_loss_masking_spans: bool = False | ||
| self, | ||
| idx: int, | ||
| offset: int = 0, | ||
| length: int | None = None, | ||
| use_loss_masking_spans: bool = False, | ||
| use_preference_loss_spans: bool = False, | ||
| ) -> GPTSample: | ||
| token_ids = np.frombuffer( | ||
| self._bin_buffer, | ||
|
|
@@ -116,13 +159,53 @@ def get( | |
| sample_spans = None | ||
| if use_loss_masking_spans and self._spans is not None: | ||
| sample_spans = self._spans[idx] | ||
| # adjust the spans for the offset and length | ||
|
|
||
| # filter spans that are outside the range of the selected tokens in the document | ||
| sample_spans = sample_spans[ | ||
| (sample_spans[:, 0] < offset + len(token_ids)) & (sample_spans[:, 1] >= offset) | ||
| ] | ||
| sample_spans[:, 0] = np.maximum(sample_spans[:, 0], offset) - offset | ||
|
|
||
| # subtract by offset to normalize span boundaries | ||
| sample_spans[:, 0] = np.maximum(sample_spans[:, 0], offset) - offset # offset | ||
| sample_spans[:, 1] = np.minimum(sample_spans[:, 1], offset + len(token_ids) - 1) - offset | ||
| return GPTSample(token_ids=token_ids, loss_masking_spans=sample_spans) | ||
|
|
||
| chosen_span = None | ||
| rejected_span = None | ||
|
|
||
| if use_preference_loss_spans: | ||
| if not self._has_preference_spans: | ||
| raise ValueError("No preference spans found in memmap dataset.") | ||
| elif self._has_preference_spans and self._chosen_spans is None: | ||
| raise ValueError("Failed to read chosen spans from memmap dataset.") | ||
| elif self._has_preference_spans and self._rejected_spans is None: | ||
| raise ValueError("Failed to read rejected spans from memmap dataset.") | ||
| else: | ||
| chosen_span = self._chosen_spans[idx] | ||
|
|
||
| # filter spans that are outside the range of the selected tokens in the document | ||
| chosen_span = chosen_span[(chosen_span[0] < offset + len(token_ids)) & (chosen_span[1] >= offset)][0] | ||
|
|
||
| # subtract by offset to normalize span boundaries | ||
| chosen_span[0] = np.maximum(chosen_span[0], offset) - offset # offset | ||
| chosen_span[1] = np.minimum(chosen_span[1], offset + len(token_ids) - 1) - offset | ||
|
|
||
| rejected_span = self._rejected_spans[idx] | ||
|
|
||
| # filter spans that are outside the range of the selected tokens in the document | ||
| rejected_span = rejected_span[ | ||
| (rejected_span[0] < offset + len(token_ids)) & (rejected_span[1] >= offset) | ||
| ][0] | ||
|
|
||
| # subtract by offset to normalize span boundaries | ||
| rejected_span[0] = np.maximum(rejected_span[0], offset) - offset # offset | ||
| rejected_span[1] = np.minimum(rejected_span[1], offset + len(token_ids) - 1) - offset | ||
|
|
||
| return GPTSample( | ||
| token_ids=token_ids, | ||
| loss_masking_spans=sample_spans, | ||
| chosen_span=chosen_span, | ||
| rejected_span=rejected_span, | ||
| ) | ||
|
|
||
| @property | ||
| def name(self) -> str: | ||
|
|
@@ -157,6 +240,8 @@ def write_dataset(cls, prefix: pathlib.Path | str, documents: typing.Iterable[GP | |
| # number of spans for each document | ||
| num_spans = [] | ||
| spans = [] | ||
| chosen_spans = [] | ||
| rejected_spans = [] | ||
|
|
||
| prefix = pathlib.Path(prefix) | ||
| prefix.parent.mkdir(parents=True, exist_ok=True) | ||
|
|
@@ -182,6 +267,10 @@ def write_dataset(cls, prefix: pathlib.Path | str, documents: typing.Iterable[GP | |
| if document.loss_masking_spans is not None: | ||
| num_spans.append(len(document.loss_masking_spans)) | ||
| spans.append(document.loss_masking_spans) | ||
| if document.chosen_span is not None: | ||
| chosen_spans.append(document.chosen_span) | ||
| if document.rejected_span is not None: | ||
| rejected_spans.append(document.rejected_span) | ||
| offset += doc_length * np.dtype(dtype).itemsize | ||
| num_documents += 1 | ||
|
|
||
|
|
@@ -193,15 +282,20 @@ def write_dataset(cls, prefix: pathlib.Path | str, documents: typing.Iterable[GP | |
| spans = np.vstack(spans, dtype=np.int32) | ||
| else: | ||
| spans = np.array(spans, dtype=np.int32) | ||
| chosen_spans = np.array(chosen_spans, dtype=np.int32).reshape(-1, 2) | ||
| rejected_spans = np.array(rejected_spans, dtype=np.int32).reshape(-1, 2) | ||
|
|
||
| # Write the index file (.idx) | ||
| with prefix.with_suffix(".idx").open("wb") as idx_stream: | ||
| idx_stream.write(MEMMAP_INDEX_HEADER) | ||
| # Indicates the version | ||
| # Version 2 optionally adds loss-masking spans | ||
| idx_stream.write(struct.pack("<Q", 2)) | ||
| # Version 3 optionally adds chosen/rejected spans | ||
| idx_stream.write(struct.pack("<Q", 3)) | ||
| # Flag to indicate whether loss-masking spans are present | ||
| idx_stream.write(struct.pack("<B", 1 if spans.size > 0 else 0)) | ||
| # Flag to indicate whether preference loss-masking spans are present | ||
| idx_stream.write(struct.pack("<B", 1 if chosen_spans.size > 0 and rejected_spans.size > 0 else 0)) | ||
| # Data type | ||
| idx_stream.write(struct.pack("<B", MEMMAP_DTYPES_INV[DataType.from_numpy(dtype.type)])) | ||
| # "Number of sequences", same as documents in our case | ||
|
|
@@ -216,5 +310,9 @@ def write_dataset(cls, prefix: pathlib.Path | str, documents: typing.Iterable[GP | |
| idx_stream.write(num_spans.tobytes(order="C")) | ||
| # Span indices for each document | ||
| idx_stream.write(spans.tobytes(order="C")) | ||
| # Chosen indices for each document | ||
| idx_stream.write(chosen_spans.tobytes(order="C")) | ||
| # Rejected indices for each document | ||
| idx_stream.write(rejected_spans.tobytes(order="C")) | ||
| # Document indices, unused but needed for compatibility with Megatron-LM | ||
| idx_stream.write(np.arange(num_documents + 1, dtype=np.int64).tobytes(order="C")) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.