Timeline Explorer

Authors:

Rahul, HPC with Data Science MSc, University of Edinburgh
Yiyi Wang, HPC with Data Science MSc, University of Edinburgh

Semantic Search & Trend Analysis Engine for Academic Research

Timeline Explorer is a backend, high-performance CLI application designed to automate the ingestion, processing, and vector-based retrieval of scientific literature. It leverages the SPECTER2 embedding model to group academic papers conceptually and tracks their chronological distribution to identify emerging research trends.

📚 Documentation

The documentation is heavily modularized to reflect the system architecture. For detailed steps on execution, refer to:

👉 SETUP.md

For detailed information on how to contribute and structure PRs, please see:

👉 CONTRIBUTING.md

🏗️ Architecture

┌──────────────────────────────────────────────────────────┐
│                       run.py                             │
│            (GPU detection → Docker launch)               │
└────────────────────────┬─────────────────────────────────┘
                         │ docker-compose run --rm app
                         ▼
┌──────────────────────────────────────────────────────────┐
│                   src/main.py (Typer CLI)                │
│  search │ upload │ entity │ import-zip │ stats │ build   │
└─────┬──────┬────────┬──────────┬────────────┬────────────┘
      │      │        │          │            │
      ▼      ▼        ▼          ▼            ▼
┌──────────────────────────────────────────────────────────┐
│                    src/ingestion/                        │
│  arxiv.py │ semantic_scholar.py │ openalex.py │ loader.py│
│                 dropzone_monitor.py                      │
└─────────────────────────┬────────────────────────────────┘
                          │
                          ▼
┌──────────────────────────────────────────────────────────┐
│                    src/storage/                          │
│              db.py (SQLite — research.db)                │
│      papers │ authors │ entities │ embeddings_status     │
└─────────────────────────┬────────────────────────────────┘
                          │
                          ▼
┌──────────────────────────────────────────────────────────┐
│                   src/processing/                        │
│  embeddings.py  │ index_loader.py │ aggregation.py       │
│  analysis.py    │ metrics.py                             │
└──────────────────────────────────────────────────────────┘
                          │
                          ▼
┌──────────────────────────────────────────────────────────┐
│                    scripts/                              │
│  build_embeddings.py  │  build_index.py                  │
│  (SPECTER2 → .npz)    │  (sklearn NN → .pkl)             │
└──────────────────────────────────────────────────────────┘

Tech Stack

Layer	Technology
Runtime	Docker (Python 3.10-slim)
CLI	Typer + Rich
Database	SQLite 3 (relational: papers/authors/entities)
AI Model	`allenai/specter2_base` (768-dim vectors)
Search Index	scikit-learn `NearestNeighbors` (cosine)
Data Sources	ArXiv API, Semantic Scholar, OpenAlex
NLP	spaCy `en_core_web_sm` (NER extraction)
CI/CD	GitLab CI (pytest)
GPU	NVIDIA Docker runtime (auto-detected)

🚀 Future Roadmap & Improvements

This project serves as a foundation for scalable semantic search architectures. Upcoming iterations will focus on HPC, Distributed Systems, MLOps, and robust Data Engineering practices:

HPC & Distributed Systems: Scaling the ingestion pipeline using Apache Spark or Ray for distributed parsing of massive S2ORC datasets. Transitioning the SPECTER2 embedding generator to utilize PyTorch DistributedDataParallel (DDP) and MPI for multi-node/multi-GPU vector computation across high-performance compute clusters.
MLOps Integration: Migration from direct torch inference to a Model Registry (e.g., MLflow) for versioning the SPECTER2 embeddings. Implementation of Airflow/Prefect DAG workflows to automate chron-job paper fetching from ArXiv/OpenAlex without manual CLI triggers.
Backend & Data Engineering: Graduating from SQLite to PostgreSQL + pgvector for enterprise-scale billions-of-vectors approximate nearest neighbor (ANN) search, backed by a Redis Cache layer for repetitive query latency reduction.
DevOps: Complete Helm Chart definitions to migrate the Docker Compose setup into an orchestrator like Kubernetes (K8s). Incorporation of Prometheus & Grafana to monitor embedding latency anomalies and GPU utilization metrics.

📁 Project Structure

Timeline-Explorer/
├── run.py                     # Smart launcher (GPU detect → Docker)
├── docker-compose.yml         # Container configuration
├── docker/Dockerfile          # Python 3.10 + dependencies
├── requirements.txt           # Python dependencies
├── .env                       # API keys (gitignored)
├── .gitlab-ci.yml             # CI pipeline (pytest)
│
├── src/
│   ├── main.py                # Typer CLI (entry point)
│   ├── config.py              # Environment-based configuration
│   │
│   ├── ingestion/             # Data sources
│   │   ├── arxiv.py           # ArXiv API client
│   │   ├── semantic_scholar.py# Semantic Scholar API client
│   │   ├── openalex.py        # OpenAlex API client
│   │   ├── loader.py          # PDF text extraction + NER
│   │   └── dropzone_monitor.py# Auto-import teammate zips
│   │
│   ├── processing/            # Analysis & AI
│   │   ├── embeddings.py      # SPECTER2 model wrapper
│   │   ├── index_loader.py    # Load/search sklearn NN index
│   │   ├── aggregation.py     # Exact phrase semantic matching
│   │   ├── analysis.py        # Year timeline analysis
│   │   └── metrics.py         # Research trend metrics
│   │
│   └── storage/
│       └── db.py              # SQLite database (CRUD + search)
│
├── scripts/
│   ├── build_embeddings.py    # Batch encode papers → .npz
│   └── build_index.py         # Build sklearn NN index from .npz
│
├── tests/                     # pytest test suite
│
└── data/                      # (gitignored — local only)
    ├── research.db            # Main SQLite database
    ├── embeddings/            # SPECTER2 .npz vector files
    ├── index/                 # sklearn NN index (.pkl)
    ├── dropzone/              # Drop teammate zips here
    ├── processed/             # Archived processed zips
    └── papers/Papers/         # Local PDF files

🧪 Testing

# Run all tests (via GitLab CI or locally)
pytest tests/

# Run with coverage
pytest tests/ --cov=src

🔧 Troubleshooting

Issue	Solution
Docker not found	Install Docker Desktop
GPU not detected	Install NVIDIA Container Toolkit
Slow first run	Normal — downloading SPECTER2 model (~440MB)
"No papers found"	Run `python run.py build` to build the index
Import fails	Ensure zip contains `papers.sqlite`
Port conflict	Stop other Docker containers first

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
data		data
docker		docker
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitkeep		.gitkeep
.gitlab-ci.yml		.gitlab-ci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
SECURITY.md		SECURITY.md
SETUP.md		SETUP.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Timeline Explorer

📚 Documentation

🏗️ Architecture

Tech Stack

🚀 Future Roadmap & Improvements

📁 Project Structure

🧪 Testing

🔧 Troubleshooting

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Timeline Explorer

📚 Documentation

🏗️ Architecture

Tech Stack

🚀 Future Roadmap & Improvements

📁 Project Structure

🧪 Testing

🔧 Troubleshooting

About

Resources

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages