Skip to content

nmdra/Assignment-Metadata-Extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Assignment-Metadata-Extractor

A lightweight pipeline for extracting student assignment metadata from unstructured text into a strict JSON format by fine-tuning a SmolLM2-family base model and exporting an Ollama-ready GGUF model.

Published model

Note

The default training script in this repository currently uses HuggingFaceTB/SmolLM2-135M-Instruct as the base model.

What this repository contains

  • data/generate_dataset.py
    Generates synthetic instruction-tuning examples.
  • training/train.py
    Fine-tunes HuggingFaceTB/SmolLM2-135M-Instruct using Unsloth + LoRA, then exports HF and GGUF artifacts.
  • training/train.ipynb
    Notebook version of the same training workflow.

Tech stack

  • Python 3.10–3.11
  • uv for environment + dependency management
  • PyTorch, Hugging Face Datasets, TRL (SFTTrainer)
  • Unsloth for efficient LoRA fine-tuning and GGUF export

Prerequisites

  • Python 3.10 or 3.11
  • uv installed
  • Recommended for training: NVIDIA GPU with CUDA (CPU training is possible but slow)

Quickstart

1) Set up the environment

uv venv
source .venv/bin/activate   # Linux/macOS
# .venv\Scripts\activate    # Windows PowerShell
uv sync

2) Generate synthetic dataset

uv run python data/generate_dataset.py --size 400 --output data/dataset.json

3) Run fine-tuning

uv run python training/train.py

4) Produced artifacts

  • ./smollm-student-extractor/ (Hugging Face model/tokenizer)
  • ./smollm-student-gguf/ (GGUF export for Ollama)

Ollama Modelfile

Create a Modelfile with the following content:

FROM hf.co/nimendraai/SmolLM2-360M-Assignment-Metadata-Extractor:Q4_K_M

# Apply the strict instruction template used during training
TEMPLATE """### Instruction:
Extract student info as JSON from the following text.

### Input:
{{ .Prompt }}

### Response:
"""

# Set the System constraints
SYSTEM """
You are a precise student assignment data extractor.
Output ONLY a valid JSON object. No explanation. No extra text. No markdown.
Return a JSON object with exactly these keys: "student_number", "student_name", and "assignment_number". All values must be strings extracted from the input text.
"""

# Turn off creativity
PARAMETER temperature 0

# Stop generating once the JSON is closed
PARAMETER stop "}"

Build and run with Ollama:

ollama create assignment-metadata-extractor -f Modelfile
ollama run assignment-metadata-extractor

Dataset format

data/generate_dataset.py creates a JSON list where each item contains:

  • instruction
  • input
  • output (JSON string with keys: student_number, student_name, assignment_number)

Notes

  • training/train.py expects data/dataset.json to exist.
  • If the dataset file is missing or empty, training exits with a clear error.

@misc{nimendra_2026,
	author       = { Nimendra },
	title        = { SmolLM2-360M-Assignment-Metadata-Extractor (Revision 0da34e3) },
	year         = 2026,
	url          = { https://huggingface.co/nimendraai/SmolLM2-360M-Assignment-Metadata-Extractor },
	doi          = { 10.57967/hf/8468 },
	publisher    = { Hugging Face }
}

About

A lightweight, locally hosted LLM pipeline that extracts and normalizes unstructured student assignment text into a strict JSON schema using a fine-tuned SmolLM2-360M model served via Ollama.

Topics

Resources

Stars

Watchers

Forks

Contributors