Authors: Abdullah Almalki, Anamaria Berea
This repository provides a reproducible computational pipeline for large-scale topic modeling of astrobiology-related ArXiv preprints (1996-2025).
The workflow covers data collection, preprocessing, BERTopic modeling, validation, temporal analysis, and generation of paper-ready outputs.
scripts/: cleaned and publication-ready analysis scriptsscripts/legacy_methods/: Top2Vec legacy baseline scripts for method comparisonutils/: helper modules for path handling and data loadingdata/: sample data and schema documentationresults/: intermediate and final analysis outputs (generated)figures/: manuscript figures (generated)paper_outputs/: consolidated paper tables/statistics (generated)
Install dependencies:
pip install -r requirements.txtFrom the project root, run the full workflow (manuscript + legacy comparison + final bundle):
make allUseful partial targets:
make manuscript
make legacy
make bundleFrom the project root:
python scripts/topic_modeling_bertopic.py
python scripts/topic_validation.py
python scripts/hierarchical_clustering_topics.py
python scripts/temporal_trend_analysis.py
python scripts/corpus_structure_analysis.py
python scripts/semantic_distance_figure.py
python scripts/unclustered_temporal_figure.py
python scripts/pipeline_diagram.py
python scripts/generate_paper_outputs.pyTo reproduce Top2Vec vs BERTopic comparison table:
python scripts/legacy_methods/top2vec_modeling.py
python scripts/legacy_methods/top2vec_validation.py
python scripts/legacy_methods/generate_table4_method_comparison.pyGenerated files:
paper_outputs/tables/table4_method_comparison.csvpaper_outputs/tables/table4_method_comparison.md
- Put full processed dataset in
data/processed/preprocessed_papers.csv. - Run the same script sequence shown above.
- See
REPRODUCIBILITY.mdfor figure/table-to-script mapping and run order.
Full dataset and archived outputs will be released at:
10.5281/zenodo.18750923
If you use this repository, please cite:
@article{almalki2026astrobiologyarxiv,
title={Nearly Three Decades of Astrobiology on ArXiv: A Large-Scale Topic Modeling},
author={Almalki, Abdullah and Berea, Anamaria},
journal={Astrobiology},
year={2026},
note={In press / accepted manuscript}
}