Skip to content

Conversation

@fsiino-nvidia
Copy link
Contributor

@fsiino-nvidia fsiino-nvidia commented Sep 15, 2025

This change updates the train_data_utils via ng_prepare_data to apply data aggregations to the other keys within an example.jsonl. file.

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 15, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@fsiino-nvidia fsiino-nvidia force-pushed the fsiino/prepare-data-aggregations branch from 3c7774c to f696eb7 Compare September 15, 2025 20:35
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@fsiino-nvidia fsiino-nvidia marked this pull request as ready for review September 15, 2025 21:11
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@fsiino-nvidia fsiino-nvidia requested a review from a team as a code owner September 18, 2025 00:00
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…ggregations

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…odel

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
self, dataset_config: DatasetConfig
) -> DatasetValidatorState:
state = DatasetValidatorState()
data = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having data be a list here is going to increase memory consumption like crazy. let's fold the aggregate_other_metrics call into _validate_samples_and_aggregate_metrics_single_sample

def get_aggregate_metrics(data: List[DatasetViewerVerifyResponse], raw_lines: List[str]) -> Dict[str, Any]:
def get_aggregate_metrics(raw_lines: List[str]) -> Dict[str, Any]:
dataset_metrics = DatasetMetrics()
line_dicts = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as below. ideally we would try to save on the memory here

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@bxyu-nvidia bxyu-nvidia merged commit 0c1b1f1 into main Sep 19, 2025
5 checks passed
@bxyu-nvidia bxyu-nvidia deleted the fsiino/prepare-data-aggregations branch September 19, 2025 00:03
abhibha-nvidia pushed a commit that referenced this pull request Sep 28, 2025
This change updates the train_data_utils via `ng_prepare_data` to apply
data aggregations to the other keys within an `example.jsonl`. file.

---------

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Co-authored-by: bxyu-nvidia <bxyu@nvidia.com>
abhibha-nvidia pushed a commit that referenced this pull request Sep 29, 2025
This change updates the train_data_utils via `ng_prepare_data` to apply
data aggregations to the other keys within an `example.jsonl`. file.

---------

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Co-authored-by: bxyu-nvidia <bxyu@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants