This repo generates a model that detects the language of a code snippet.
It's meant as a reference, not something you can clone and run to produce a model (without a bit of effort).
The goal is to do langugage detection (much) better than Highlight.js, using ML/AI, but in a very small size (< 1 MB shipped to browser).
The web/ directory has a simple demo where you can type code in a browser and see the language detected.
Live demo: https://langid.dgapps.io
Local preview:
npm i
npm run dev
main.py calls all the major components:
- Load a dataset
- Extract features from the dataset
- Train a model on the features
- Analyze the model
Running main.py produces assets in models/ and features/ (gitignored). It also outputs a copy of the model to web/public/model.json which is used by the demo site.
These aren't embedded in the repo.
https://huggingface.co/datasets/bigcode/the-stack-v2
Scripts in data/the_stack pull the data and split it into n-line snippets, with some minimal cleaning.
https://github.com/smola/language-dataset.
That repo doesn't work in Windows (invalid dir names) so I've coverted it into a sqlite database and put it in data/local/.
~200 languages, 10-30 samples of each
https://huggingface.co/datasets/code-search-net/code_search_net. Store the unzipped language folders under data/local/csn/<language>/final/jsonl/<split>/*.jsonl.gz.
6 languages, millions of samples
https://huggingface.co/datasets/pharaouk/rosetta-code
I've filtered the Rosetta code dataset to only the languages detected by Medium.com.
Rosetta doesn't have all of them (GraphQL, markdown, yaml) but there's still 29 languages. The problem is that it's all about code, so lacking in scripts, css, html, etc.
features.py processes the data to create features.
train_model.py trains a model to predict the target langugae from the features.
experiments.md tracks various iterations of both feature extraction and training.