feat: Implement lazy data loading for Dataset by Ratish1 · Pull Request #246 · radixark/miles

Ratish1 · 2025-11-24T11:19:58Z

this pr improves the Dataset class in miles/utils/data.py to support lazy data loading, filtering, and indexed access for large datasets. This addresses memory consumption issues, in SFT workloads, by avoiding the in-memory materialization of entire datasets. @fzyzcjy #226

Modified read_file to use row-by-row iteration for JSONL and batched reading for Parquet.
Refactored Dataset to build a lightweight index of valid sample locations on initialization, applying max_length filtering during this phase.
Implemented __getitem__ to read and process only the requested sample on demand.

fzyzcjy · 2025-11-24T11:56:04Z

Too busy now, I will try to squeeze some time and have a review later :)

Ratish1 · 2025-11-24T13:43:39Z

No problem

Ratish1 · 2025-11-25T13:38:15Z

Hello @fzyzcjy, I wanted to follow up on this PR , but if you are busy its fine you can look at it later. Thanks.

fzyzcjy · 2025-11-25T23:59:09Z

After an internal discussion, we think the most ideal code may be somehow different. Therefore, may be need to wait a little bit.

Ratish1 · 2025-12-07T13:34:01Z

Hello @fzyzcjy , I wanted to follow up on this. Thanks.

lhoestq · 2025-12-08T17:34:42Z

Hi ! Quentin from HF Datasets here :)

I was checking if I could help with anything, in particular if you are interested in loading larger than RAM datasets. E.g. the datasets library loads many formats like Parquet and JSON, makes tokenization easy, and works with larger than RAM datasets since the data is loaded with memory mapped Arrow files on disk. It also has streaming features to load datasets bigger than disk

fzyzcjy · 2025-12-09T07:31:42Z

@Ratish1 I am really too busy recently to check more details about this :/ @zhaochenyang20 do you want to find someone to review these?

@lhoestq looks great! Yes it would be great to handle huge datasets here, and at the same time keep code clean, b/c the framework supports SFT and can have big datasets.

Ratish1 · 2025-12-09T07:40:55Z

Np @fzyzcjy , maybe I can work with @lhoestq and use the HF datasets library here

zhaochenyang20 · 2025-12-10T21:39:07Z

@Ratish1 Have conflicts right now.

PopSoda2002

Hi, thanks for your effort and contribution! Nice work! Can I ask why do not choose memmap in python and can you share some comparison before and after?

miles/utils/data.py

PopSoda2002 · 2025-12-11T01:09:39Z

Hi @Ratish1 , are you working with datasets from HF? It that, that will be nice

Ratish1 · 2025-12-11T06:15:57Z

Hi @Ratish1 , are you working with datasets from HF? It that, that will be nice

No I'm not. I will refactor this code with the datasets library instead now. Thanks.

PopSoda2002

Hi, thanks for the effort, can you please show some results or comparison?

requirements.txt

miles/utils/data.py

Ratish1 · 2025-12-12T07:43:24Z

Hi, thanks for the effort, can you please show some results or comparison?

Hi, I haven't compared since I think datasets would be much faster for large datasets, but I ran a memory benchmark using a 500 MB dummy JSONL file (5 million rows) to verify the lazy loading of the current implementation. This is what I got

BENCHMARK RESULTS (Lazy Loading)
Initial Memory:   371.73 MiB
Peak Memory:      469.66 MiB
Memory Increment: 97.93 MiB

PopSoda2002

It looks good for me but I think it needs a more careful look!

lhoestq

I had a quick look and it looks good yo me, I just had one small nit:

miles/utils/data.py

zhaochenyang20 · 2025-12-30T04:40:32Z

miles/utils/data.py

+        if self.tool_key is not None and self.tool_key in data:
+            tools = data[self.tool_key]
+            if isinstance(tools, str):
+                tools = json.loads(tools)


This is a comment. May not need to be done this time:

You are parsing JSON and building messages dynamically in every getitem call. While this saves RAM, it adds significant CPU overhead during the training loop.

If the JSON parsing is heavy, we might need to use hf_dataset.map() during init to pre-process these fields into a more efficient format (Arrow-native), rather than parsing raw strings on the fly.

Yes this is correct. But I have not done this change yet, since our primary goal for this PR was resolving the RAM spike during initialization, I believe this lazy approach is the safest first step. If we find that JSON parsing becomes a bottleneck for GPU throughput in future benchmarks, we can definitely change it to a .map() based pre-processing step to offload that work to the initialization phase.

miles/utils/data.py

docs/en/get_started/quick_start.md

zhaochenyang20 · 2025-12-31T02:22:08Z

Reproduce/Verification commands:

docker pull radixark/miles:latest

docker run --rm --gpus all --ipc=host --shm-size=16g \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -it radixark/miles:latest /bin/bash

rm -rf /root/miles

git clone -b data-loading https://github.com/Ratish1/miles.git

cd miles

pip install -e .

hf download zai-org/GLM-Z1-9B-0414 --local-dir /data/ratish/GLM-Z1-9B-0414

hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /data/ratish/dapo-math-17k

source scripts/models/glm4-9B.sh

PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py ${MODEL_ARGS[@]} --hf-checkpoint /data/ratish/GLM-Z1-9B-0414 --save /data/ratish/GLM-Z1-9B-0414_torch_dist

export CUDA_DEVICE_MAX_CONNECTIONS=1
export CUDA_VISIBLE_DEVICES=5,7

PYTHONPATH=/root/Megatron-LM python train.py \
    ${MODEL_ARGS[@]} \
    --hf-checkpoint /data/ratish/GLM-Z1-9B-0414 \
    --load /data/ratish/GLM-Z1-9B-0414_torch_dist \
    --prompt-data /data/ratish/dapo-math-17k/dapo-math-17k.jsonl \
    --input-key prompt \
    --label-key label \
    --apply-chat-template \
    --use-miles-router \
    --num-proc 4 \
    --rollout-batch-size 16 \
    --global-batch-size 16 \
    --n-samples-per-prompt 1 \
    --num-rollout 1 \
    --colocate \
    --sglang-mem-fraction-static 0.8

PopSoda2002

LGTM! And I think adding this to a doc which will explain a little bit will be better!

nanjiangwill · 2025-12-31T03:52:53Z

@Ratish1 hi, i am wondering if lazy loading is still working during filtering ds = ds.filter(partial(_filter_func, **filter_kwargs), num_proc=num_proc, desc="Filtering invalid samples") (i.e. no peak memory consumption

zhaochenyang20 · 2025-12-31T03:57:33Z

@Ratish1 hi, i am wondering if lazy loading is still working during filtering ds = ds.filter(partial(_filter_func, **filter_kwargs), num_proc=num_proc, desc="Filtering invalid samples") (i.e. no peak memory consumption

Hi, I've double-checked the underlying mechanism. ds.filter does indeed maintain the lazy loading behavior.

Under the hood, datasets processes the data in batches by reading them from the memory-mapped (mmap) storage into RAM. Once the filter function is executed, the Python objects are immediately released. It does not materialize the entire dataset in memory at once. Therefore, even during the filtering phase, memory consumption stays at a small constant level (determined by the batch_size and num_proc), avoiding any memory peaks that scale with the dataset size. This is exactly why integrating the datasets library is so beneficial for our use case.

This reverts commit 9a3b297.

zhaochenyang20 · 2025-12-31T04:10:34Z

#372 (comment)

Sorry for my bad. This is reverted. We need more unit tests for dataset consistency. 😭

PopSoda2002 reviewed Dec 10, 2025

View reviewed changes

miles/utils/data.py Outdated Show resolved Hide resolved

Ratish1 added 3 commits December 11, 2025 10:58

feat: Implement lazy data loading for Dataset

21fa4d0

optimize Dataset memory usage and Parquet reading

2266c4b

formatting

d5add95

Ratish1 force-pushed the data-loading branch from 037abea to d5add95 Compare December 11, 2025 06:58

implement HF datasets

65c58c4

Ratish1 requested a review from PopSoda2002 December 11, 2025 10:40

PopSoda2002 reviewed Dec 12, 2025

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

miles/utils/data.py Outdated Show resolved Hide resolved

fix assertion

1b803c2

Ratish1 requested a review from PopSoda2002 December 12, 2025 07:45

PopSoda2002 approved these changes Dec 18, 2025

View reviewed changes

lhoestq reviewed Dec 18, 2025

View reviewed changes

miles/utils/data.py Outdated Show resolved Hide resolved

fix shuffle

f7a013f

zhaochenyang20 requested changes Dec 30, 2025

View reviewed changes

Ratish1 added 2 commits December 30, 2025 10:12

address comments

5ebfce4

Merge remote-tracking branch 'upstream/main' into data-loading

82065f2

Ratish1 force-pushed the data-loading branch from f2169a7 to 82065f2 Compare December 30, 2025 06:15

quick start or lazy loading

eaecb71

Ratish1 added 2 commits December 30, 2025 15:00

more fixes

0ee4037

more

9b3c607

zhaochenyang20 requested changes Dec 30, 2025

View reviewed changes

docs/en/get_started/quick_start.md Outdated Show resolved Hide resolved

zhaochen20 added 3 commits December 30, 2025 16:01

remove docs redun

e5872be

refactor lazy loading

bd80ae9

move partial import

89c0b4e

PopSoda2002 approved these changes Dec 31, 2025

View reviewed changes

zhaochenyang20 merged commit 9a3b297 into radixark:main Dec 31, 2025
3 checks passed

zhaochenyang20 added a commit that referenced this pull request Dec 31, 2025

Revert "feat: Implement lazy data loading for Dataset (#246)"

b54146d

This reverts commit 9a3b297.

Ratish1 mentioned this pull request Jan 5, 2026

feat: Implement lazy loading for dataset #382

Open

SwordFaith mentioned this pull request Jan 6, 2026

[data][feat] add large dataset support THUDM/slime#1298

Open

zhaochenyang20 mentioned this pull request Feb 24, 2026

[Template] Code Review Style Guide zhaochenyang20/sglang-diffusion-routing#32

Open

Ratish1 deleted the data-loading branch March 2, 2026 06:38

Conversation

Ratish1 commented Nov 24, 2025

Uh oh!

fzyzcjy commented Nov 24, 2025

Uh oh!

Ratish1 commented Nov 24, 2025

Uh oh!

Ratish1 commented Nov 25, 2025

Uh oh!

fzyzcjy commented Nov 25, 2025

Uh oh!

Ratish1 commented Dec 7, 2025

Uh oh!

lhoestq commented Dec 8, 2025

Uh oh!

fzyzcjy commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ratish1 commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaochenyang20 commented Dec 10, 2025

Uh oh!

PopSoda2002 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PopSoda2002 commented Dec 11, 2025

Uh oh!

Ratish1 commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PopSoda2002 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Ratish1 commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PopSoda2002 left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhaochenyang20 Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Ratish1 Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhaochenyang20 commented Dec 31, 2025

Uh oh!

PopSoda2002 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nanjiangwill commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaochenyang20 commented Dec 31, 2025

Uh oh!

zhaochenyang20 commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

fzyzcjy commented Dec 9, 2025 •

edited

Loading

Ratish1 commented Dec 9, 2025 •

edited

Loading

Ratish1 commented Dec 11, 2025 •

edited

Loading

Ratish1 commented Dec 12, 2025 •

edited

Loading

Ratish1 Dec 30, 2025 •

edited

Loading

nanjiangwill commented Dec 31, 2025 •

edited

Loading