Skip to content

Classify Hugging Face tokenizer and runtime artifacts#529

Merged
chlins merged 1 commit into
mainfrom
codex/hf-file-type-classification
May 6, 2026
Merged

Classify Hugging Face tokenizer and runtime artifacts#529
chlins merged 1 commit into
mainfrom
codex/hf-file-type-classification

Conversation

@aftersnow
Copy link
Copy Markdown
Contributor

Summary

  • classify common Hugging Face tokenizer assets as model weight config instead of doc/code
  • classify ONNX external data, Core ML MIL files, and checkpoint tensor shards as model weights
  • update modelfile workspace classification tests for the new Hugging Face file patterns

Tests

  • env GOCACHE=/tmp/modctl-gocache go test ./pkg/modelfile
  • env GOCACHE=/tmp/modctl-gocache go test ./pkg/backend/...

Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request expands the file type classification system by adding support for various tokenizer files (TikToken, SentencePiece, Jinja templates), sharded checkpoint tensors, ONNX external data, and Core ML intermediate language files. It also includes comprehensive unit tests to verify these new patterns and ensures that files like merges.txt are correctly classified as configuration rather than documentation. I have no feedback to provide.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates modelfile workspace file-type classification to treat common Hugging Face tokenizer/runtime artifacts as config/model-weight inputs (rather than docs/code), and extends tests accordingly.

Changes:

  • Reclassify tokenizer assets like merges.txt, vocab.txt, SentencePiece and tiktoken artifacts as config files.
  • Classify additional runtime/weight artifacts as model files (e.g., ONNX external data, Core ML .mil, sharded checkpoint tensors).
  • Expand/adjust modelfile classification tests to cover the new patterns.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
pkg/modelfile/modelfile_test.go Updates workspace classification expectations and adds a new test case for HF tokenizer/runtime artifacts.
pkg/modelfile/constants_test.go Adds/updates tests validating new config/model pattern matches and InferFileType behavior.
pkg/modelfile/constants.go Extends config/model glob patterns to cover additional HF tokenizer and runtime/model-weight artifact filenames.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/modelfile/constants.go
Copy link
Copy Markdown
Member

@chlins chlins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@chlins chlins enabled auto-merge (squash) May 6, 2026 02:41
@chlins chlins merged commit c7a7bf5 into main May 6, 2026
9 checks passed
@chlins chlins deleted the codex/hf-file-type-classification branch May 6, 2026 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants