[Fix] move calibrate load dataset location#4289
Conversation
There was a problem hiding this comment.
Pull request overview
Moves Hugging Face datasets loading/imports into get_calib_loaders to prevent datasets being imported at module import time when lite calibration isn’t used.
Changes:
- Removed module-level
datasetsimports and relocatedload_datasetcalls intoget_calib_loaders. - Updated dataset-specific helper functions (
get_wikitext2,get_c4, etc.) to accept a pre-loadeddatasetargument instead of callingload_datasetinternally.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Returns: | ||
| List of sampled and tokenized training examples. | ||
| """ | ||
| from datasets import VerificationMode, load_dataset |
There was a problem hiding this comment.
get_calib_loaders unconditionally imports datasets at runtime. If datasets is an optional dependency (as implied by the motivation), consider wrapping this import in a try/except ImportError and raising a clearer error instructing users how to install the lite/extra requirements, rather than propagating a bare ModuleNotFoundError.
| from datasets import VerificationMode, load_dataset | |
| try: | |
| from datasets import VerificationMode, load_dataset | |
| except ImportError as e: | |
| raise ImportError( | |
| "The 'datasets' package is required to use calibration data loaders. " | |
| "Please install lmdeploy with the 'lite' extra (e.g. `pip install lmdeploy[lite]`) " | |
| "or install the 'datasets' package directly." | |
| ) from e |
| raise InterruptedError('There have been some issues when generating ' | ||
| 'the dataset, you could try to download it ' | ||
| 'locally first, and replace the `data_files`' | ||
| 'with local addresses or use other datasets ' |
There was a problem hiding this comment.
The InterruptedError message concatenates two string literals without a separating space: "replace the data_files" + "with local addresses...", which renders as "data_fileswith". Add a trailing/leading space so the error message is readable.
| 'with local addresses or use other datasets ' | |
| ' with local addresses or use other datasets ' |
| from datasets import VerificationMode, load_dataset | ||
| if 'wikitext2' in name: | ||
| return get_wikitext2(tokenizer, nsamples, seed, seqlen) | ||
| dataset = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train') | ||
| return get_wikitext2(dataset, tokenizer, nsamples, seed, seqlen) |
There was a problem hiding this comment.
This change is intended to avoid importing datasets at module import time; there are no unit tests covering this behavior (e.g., that import lmdeploy.lite.utils.calib_dataloader succeeds when datasets is not installed, and that get_calib_loaders still calls load_dataset for each supported name). Please add tests that mock/monkeypatch datasets.load_dataset (and simulate ImportError for datasets) to prevent regressions without requiring network downloads.
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Avoiding finding dataset module when not using lite module.
Modification
Move load_dataset location from outside to the function of get_calib_loaders in lmdeploy/lmdeploy/lite/utils/calib_dataloader.py.
Checklist