You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: remove tokenizer.json files from git (757K LOC → GitHub Release)
Removed from git tracking:
data/jina-v3-hdr/tokenizer.json (8.7 MB, XLM-RoBERTa 250K)
data/bge-m3-hdr/tokenizer.json (8.7 MB, XLM-RoBERTa 250K)
data/jina-v5-tokenizer.json (11.4 MB, Qwen3 151K — 757K lines!)
data/xlm-roberta-de/tokenizer.json (8.7 MB, German NER)
Files stay on disk (gitignored) for local development.
tokenizer_registry.rs already has from_pretrained() fallback
that downloads from HuggingFace if local file is missing.
Upload to GitHub Release for offline environments.
https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A
#120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.