Skip to content

davidgilbertson/langid

Repository files navigation

This repo generates a model that detects the language of a code snippet.

It's meant as a reference, not something you can clone and run to produce a model (without a bit of effort).

The goal is to do langugage detection (much) better than Highlight.js, using ML/AI, but in a very small size (< 1 MB shipped to browser).

Demo (easy path if you just want to use this)

The web/ directory has a simple demo where you can type code in a browser and see the language detected.

Live demo: https://langid.dgapps.io

Local preview:

npm i
npm run dev

Generating a model

main.py calls all the major components:

  1. Load a dataset
  2. Extract features from the dataset
  3. Train a model on the features
  4. Analyze the model

Running main.py produces assets in models/ and features/ (gitignored). It also outputs a copy of the model to web/public/model.json which is used by the demo site.

Datasets

These aren't embedded in the repo.

The Stack V2

https://huggingface.co/datasets/bigcode/the-stack-v2

Scripts in data/the_stack pull the data and split it into n-line snippets, with some minimal cleaning.

Smola dataset

https://github.com/smola/language-dataset. That repo doesn't work in Windows (invalid dir names) so I've coverted it into a sqlite database and put it in data/local/.

~200 languages, 10-30 samples of each

CSN dataset

https://huggingface.co/datasets/code-search-net/code_search_net. Store the unzipped language folders under data/local/csn/<language>/final/jsonl/<split>/*.jsonl.gz.

6 languages, millions of samples

Rosetta dataset

https://huggingface.co/datasets/pharaouk/rosetta-code

I've filtered the Rosetta code dataset to only the languages detected by Medium.com.

Rosetta doesn't have all of them (GraphQL, markdown, yaml) but there's still 29 languages. The problem is that it's all about code, so lacking in scripts, css, html, etc.

Feature extraction

features.py processes the data to create features.

Training

train_model.py trains a model to predict the target langugae from the features.

Experiments

experiments.md tracks various iterations of both feature extraction and training.

About

A project to explore the creation of a model to detect programming language.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors