Demo (easy path if you just want to use this)

This repo generates a model that detects the language of a code snippet.

It's meant as a reference, not something you can clone and run to produce a model (without a bit of effort).

The goal is to do langugage detection (much) better than Highlight.js, using ML/AI, but in a very small size (< 1 MB shipped to browser).

Demo (easy path if you just want to use this)

The web/ directory has a simple demo where you can type code in a browser and see the language detected.

Live demo: https://langid.dgapps.io

Local preview:

npm i
npm run dev

Generating a model

main.py calls all the major components:

Load a dataset
Extract features from the dataset
Train a model on the features
Analyze the model

Running main.py produces assets in models/ and features/ (gitignored). It also outputs a copy of the model to web/public/model.json which is used by the demo site.

Datasets

These aren't embedded in the repo.

The Stack V2

https://huggingface.co/datasets/bigcode/the-stack-v2

Scripts in data/the_stack pull the data and split it into n-line snippets, with some minimal cleaning.

Smola dataset

https://github.com/smola/language-dataset. That repo doesn't work in Windows (invalid dir names) so I've coverted it into a sqlite database and put it in data/local/.

~200 languages, 10-30 samples of each

CSN dataset

https://huggingface.co/datasets/code-search-net/code_search_net. Store the unzipped language folders under data/local/csn/<language>/final/jsonl/<split>/*.jsonl.gz.

6 languages, millions of samples

Rosetta dataset

https://huggingface.co/datasets/pharaouk/rosetta-code

I've filtered the Rosetta code dataset to only the languages detected by Medium.com.

Rosetta doesn't have all of them (GraphQL, markdown, yaml) but there's still 29 languages. The problem is that it's all about code, so lacking in scripts, css, html, etc.

Feature extraction

features.py processes the data to create features.

Training

train_model.py trains a model to predict the target langugae from the features.

Experiments

experiments.md tracks various iterations of both feature extraction and training.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
data		data
experiments		experiments
feature_lists		feature_lists
features		features
js		js
models		models
web		web
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
CodeBERTa_lid_test.py		CodeBERTa_lid_test.py
README.md		README.md
analyze_model.py		analyze_model.py
eda.py		eda.py
feature_tester.py		feature_tester.py
features.py		features.py
main.py		main.py
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
tools.py		tools.py
train_model.py		train_model.py
uv.lock		uv.lock
vite.config.js		vite.config.js
wrangler.jsonc		wrangler.jsonc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Demo (easy path if you just want to use this)

Generating a model

Datasets

The Stack V2

Smola dataset

CSN dataset

Rosetta dataset

Feature extraction

Training

Experiments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Demo (easy path if you just want to use this)

Generating a model

Datasets

The Stack V2

Smola dataset

CSN dataset

Rosetta dataset

Feature extraction

Training

Experiments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages