A simple document clustering utility that cleans text, builds SentenceTransformers embeddings, and clusters with KMeans or HDBSCAN.
Requirements: Python 3.10+.
- Create and activate a virtual environment (recommended):
python -m venv .venv
source .venv/bin/activateOn Windows (PowerShell):
py -m venv .venv
.\\.venv\\Scripts\\Activate.ps1- Install dependencies:
python -m pip install -r requirements.txt- Download the NLTK data used by the cleaner (one-time):
python - <<'PY'
import nltk
for resource in [
"punkt",
"punkt_tab",
"averaged_perceptron_tagger",
"averaged_perceptron_tagger_eng",
"stopwords",
"wordnet",
"omw-1.4",
]:
nltk.download(resource)
PY- (Optional) Extras for the notebook and visualizations:
pip install pandas matplotlib seaborn plotlypip install -r requirements.txtYou can also install the package in editable mode:
pip install -e .document-clusterer clean \
--stories-dir data/cnn-stories \
--word-list data/one-grams.txt \
--output all_stories.json \
--pipeline nltk \
--min-token-length 3Environment variable defaults are available:
STORIES_DIRβ directory containing story.txtfiles (default:data/cnn-stories)WORD_LIST_PATHβ newline-delimited allowed words (default:data/one-grams.txt)STOP_WORDS_PATHβ optional newline-delimited stopword list to merge or override defaultsOUTPUT_JSONβ JSON output path (default:all_stories.json)CLEANING_PIPELINEβ tokenizer pipeline (nltkorspacy, default:nltk)SPACY_MODELβ spaCy model name (default:en_core_web_sm)MIN_TOKEN_LENGTHβ minimum token length to retain (default:3)
Switch pipelines with --pipeline spacy (requires spaCy and a downloaded model). Toggle lowercasing, URL stripping, and number stripping via --no-lowercase, --keep-urls, and --keep-numbers.
For a quick demo, a tiny sample corpus is available in data/sample:
document-clusterer clean --stories-dir data/sample --output data/sample.jsondocument-clusterer cluster \
--input-file all_stories.json \
--stories-dir data/cnn-stories \
--output-dir clusteredDocuments \
--model-name all-MiniLM-L6-v2 \
--cluster-method kmeans \
--clusters 10 \
--reduction umap \
--reduction-dim 2On Windows, you can run the same commands in PowerShell. If document-clusterer is not on your PATH, use the module form instead:
python -m document_clusterer.cli clean --stories-dir data\\sample --output data\\sample.json
python -m document_clusterer.cli cluster --input-file data\\sample_cleaned.json --stories-dir data\\sample --output-dir clusteredDocuments\\sampleEnvironment variable defaults:
INPUT_JSONβ cleaned JSON input (default:all_stories.json)STORIES_DIRβ directory containing original stories (default:data/cnn-stories)CLUSTER_OUTPUT_DIRβ output directory for clustered documents (default:clusteredDocuments)CLUSTER_COUNTβ number of clusters for KMeans (default:10)CLUSTER_METHODβkmeansorhdbscan(default:kmeans)EMBEDDING_MODELβ SentenceTransformers model name (default:all-MiniLM-L6-v2)KMEANS_RANDOM_STATEβ seed for KMeans initialization (default:42)HDBSCAN_MIN_CLUSTER_SIZE/HDBSCAN_MIN_SAMPLESβ HDBSCAN hyperparametersREDUCTION_METHODβumap,pca, ornone(default:umap)REDUCTION_DIM,UMAP_NEIGHBORS,UMAP_MIN_DISTβ dimensionality reduction controlsSUMMARY_TERMSβ number of top terms per cluster in summaries (default:10)ASSIGNMENTS_BASENAMEβ base filename for assignment JSON/CSV outputs (default:cluster_assignments)
Outputs now include:
cluster_assignments.jsonand.csvβ cluster labels (and visualization coordinates if enabled)cluster_summaries.json/.txtβ human-friendly top terms per cluster- Subdirectories under
clusteredDocuments/for each cluster label (withnoisefor HDBSCAN outliers)
# Clean a lightweight demo corpus
document-clusterer clean \
--stories-dir data/sample \
--word-list data/one-grams.txt \
--output data/sample_cleaned.json
# Cluster with KMeans + UMAP for 2D visualization
document-clusterer cluster \
--input-file data/sample_cleaned.json \
--stories-dir data/sample \
--output-dir clusteredDocuments/sample \
--model-name all-MiniLM-L6-v2 \
--cluster-method kmeans \
--clusters 3 \
--reduction umap \
--reduction-dim 2After clustering, check clusteredDocuments/sample/cluster_assignments.json for labels and UMAP coordinates and cluster_summaries.json for the top keywords per cluster.
Both legacy scripts remain as compatibility wrappers:
cleaning.pynow delegates to the packaged cleanermodel.pydelegates to the packaged clustering CLI
Feel free to extend document_clusterer/ to add additional cleaning or clustering options.
- CLI entrypoints β
document_clusterer/cli.pyexposescleanandclustersubcommands plus convenience wrappers (document-clusterer-clean,document-clusterer-cluster). - Cleaning pipeline β
document_clusterer/cleaning.pyloads.txtfiles, normalizes text (URL/number stripping, lowercasing), tokenizes with NLTK or spaCy, removes stop words/short tokens, lemmatizes, and writes structured JSON viasave_documents. - Embedding + clustering β
document_clusterer/model.pyencodes documents with SentenceTransformers (embed_documents), clusters with KMeans or HDBSCAN (run_clustering), optionally reduces dimensions with UMAP/PCA for visualization (reduce_embeddings), and summarizes top terms per cluster (summarize_clusters). - Outputs β assignments and summaries are persisted as JSON/CSV/text, and the original
.txtfiles are copied into per-cluster folders for inspection.
A portfolio-ready Jupyter notebook lives in notebooks/Document_Clusterer_Demo.ipynb. It demonstrates:
- Building a small themed corpus on the fly (tech, sports, health).
- Cleaning the text with
CleaningOptionsand viewing tokenized results. - Running KMeans + UMAP via
cluster_documentsto produce assignments and summaries. - Visualizing clusters in 2D and charting top keywords per cluster with inline matplotlib outputs.
Launch with:
jupyter notebook notebooks/Document_Clusterer_Demo.ipynb