Authors:
- Rahul, HPC with Data Science MSc, University of Edinburgh
- Yiyi Wang, HPC with Data Science MSc, University of Edinburgh
Semantic Search & Trend Analysis Engine for Academic Research
Timeline Explorer is a backend, high-performance CLI application designed to automate the ingestion, processing, and vector-based retrieval of scientific literature. It leverages the SPECTER2 embedding model to group academic papers conceptually and tracks their chronological distribution to identify emerging research trends.
The documentation is heavily modularized to reflect the system architecture. For detailed steps on execution, refer to:
👉 SETUP.md
For detailed information on how to contribute and structure PRs, please see:
┌──────────────────────────────────────────────────────────┐
│ run.py │
│ (GPU detection → Docker launch) │
└────────────────────────┬─────────────────────────────────┘
│ docker-compose run --rm app
▼
┌──────────────────────────────────────────────────────────┐
│ src/main.py (Typer CLI) │
│ search │ upload │ entity │ import-zip │ stats │ build │
└─────┬──────┬────────┬──────────┬────────────┬────────────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌──────────────────────────────────────────────────────────┐
│ src/ingestion/ │
│ arxiv.py │ semantic_scholar.py │ openalex.py │ loader.py│
│ dropzone_monitor.py │
└─────────────────────────┬────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ src/storage/ │
│ db.py (SQLite — research.db) │
│ papers │ authors │ entities │ embeddings_status │
└─────────────────────────┬────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ src/processing/ │
│ embeddings.py │ index_loader.py │ aggregation.py │
│ analysis.py │ metrics.py │
└──────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ scripts/ │
│ build_embeddings.py │ build_index.py │
│ (SPECTER2 → .npz) │ (sklearn NN → .pkl) │
└──────────────────────────────────────────────────────────┘
| Layer | Technology |
|---|---|
| Runtime | Docker (Python 3.10-slim) |
| CLI | Typer + Rich |
| Database | SQLite 3 (relational: papers/authors/entities) |
| AI Model | allenai/specter2_base (768-dim vectors) |
| Search Index | scikit-learn NearestNeighbors (cosine) |
| Data Sources | ArXiv API, Semantic Scholar, OpenAlex |
| NLP | spaCy en_core_web_sm (NER extraction) |
| CI/CD | GitLab CI (pytest) |
| GPU | NVIDIA Docker runtime (auto-detected) |
This project serves as a foundation for scalable semantic search architectures. Upcoming iterations will focus on HPC, Distributed Systems, MLOps, and robust Data Engineering practices:
- HPC & Distributed Systems: Scaling the ingestion pipeline using Apache Spark or Ray for distributed parsing of massive S2ORC datasets. Transitioning the SPECTER2 embedding generator to utilize PyTorch
DistributedDataParallel(DDP) and MPI for multi-node/multi-GPU vector computation across high-performance compute clusters. - MLOps Integration: Migration from direct
torchinference to a Model Registry (e.g., MLflow) for versioning theSPECTER2embeddings. Implementation of Airflow/Prefect DAG workflows to automate chron-job paper fetching from ArXiv/OpenAlex without manual CLI triggers. - Backend & Data Engineering: Graduating from SQLite to PostgreSQL + pgvector for enterprise-scale billions-of-vectors approximate nearest neighbor (ANN) search, backed by a Redis Cache layer for repetitive query latency reduction.
- DevOps: Complete Helm Chart definitions to migrate the Docker Compose setup into an orchestrator like Kubernetes (K8s). Incorporation of Prometheus & Grafana to monitor embedding latency anomalies and GPU utilization metrics.
Timeline-Explorer/
├── run.py # Smart launcher (GPU detect → Docker)
├── docker-compose.yml # Container configuration
├── docker/Dockerfile # Python 3.10 + dependencies
├── requirements.txt # Python dependencies
├── .env # API keys (gitignored)
├── .gitlab-ci.yml # CI pipeline (pytest)
│
├── src/
│ ├── main.py # Typer CLI (entry point)
│ ├── config.py # Environment-based configuration
│ │
│ ├── ingestion/ # Data sources
│ │ ├── arxiv.py # ArXiv API client
│ │ ├── semantic_scholar.py# Semantic Scholar API client
│ │ ├── openalex.py # OpenAlex API client
│ │ ├── loader.py # PDF text extraction + NER
│ │ └── dropzone_monitor.py# Auto-import teammate zips
│ │
│ ├── processing/ # Analysis & AI
│ │ ├── embeddings.py # SPECTER2 model wrapper
│ │ ├── index_loader.py # Load/search sklearn NN index
│ │ ├── aggregation.py # Exact phrase semantic matching
│ │ ├── analysis.py # Year timeline analysis
│ │ └── metrics.py # Research trend metrics
│ │
│ └── storage/
│ └── db.py # SQLite database (CRUD + search)
│
├── scripts/
│ ├── build_embeddings.py # Batch encode papers → .npz
│ └── build_index.py # Build sklearn NN index from .npz
│
├── tests/ # pytest test suite
│
└── data/ # (gitignored — local only)
├── research.db # Main SQLite database
├── embeddings/ # SPECTER2 .npz vector files
├── index/ # sklearn NN index (.pkl)
├── dropzone/ # Drop teammate zips here
├── processed/ # Archived processed zips
└── papers/Papers/ # Local PDF files
# Run all tests (via GitLab CI or locally)
pytest tests/
# Run with coverage
pytest tests/ --cov=src| Issue | Solution |
|---|---|
| Docker not found | Install Docker Desktop |
| GPU not detected | Install NVIDIA Container Toolkit |
| Slow first run | Normal — downloading SPECTER2 model (~440MB) |
| "No papers found" | Run python run.py build to build the index |
| Import fails | Ensure zip contains papers.sqlite |
| Port conflict | Stop other Docker containers first |