Skip to content

CosmicAlgo/Timeline-Explorer

Repository files navigation

Timeline Explorer

Authors:

  • Rahul, HPC with Data Science MSc, University of Edinburgh
  • Yiyi Wang, HPC with Data Science MSc, University of Edinburgh

Semantic Search & Trend Analysis Engine for Academic Research

Timeline Explorer is a backend, high-performance CLI application designed to automate the ingestion, processing, and vector-based retrieval of scientific literature. It leverages the SPECTER2 embedding model to group academic papers conceptually and tracks their chronological distribution to identify emerging research trends.


📚 Documentation

The documentation is heavily modularized to reflect the system architecture. For detailed steps on execution, refer to:

👉 SETUP.md

For detailed information on how to contribute and structure PRs, please see:

👉 CONTRIBUTING.md


🏗️ Architecture

┌──────────────────────────────────────────────────────────┐
│                       run.py                             │
│            (GPU detection → Docker launch)               │
└────────────────────────┬─────────────────────────────────┘
                         │ docker-compose run --rm app
                         ▼
┌──────────────────────────────────────────────────────────┐
│                   src/main.py (Typer CLI)                │
│  search │ upload │ entity │ import-zip │ stats │ build   │
└─────┬──────┬────────┬──────────┬────────────┬────────────┘
      │      │        │          │            │
      ▼      ▼        ▼          ▼            ▼
┌──────────────────────────────────────────────────────────┐
│                    src/ingestion/                        │
│  arxiv.py │ semantic_scholar.py │ openalex.py │ loader.py│
│                 dropzone_monitor.py                      │
└─────────────────────────┬────────────────────────────────┘
                          │
                          ▼
┌──────────────────────────────────────────────────────────┐
│                    src/storage/                          │
│              db.py (SQLite — research.db)                │
│      papers │ authors │ entities │ embeddings_status     │
└─────────────────────────┬────────────────────────────────┘
                          │
                          ▼
┌──────────────────────────────────────────────────────────┐
│                   src/processing/                        │
│  embeddings.py  │ index_loader.py │ aggregation.py       │
│  analysis.py    │ metrics.py                             │
└──────────────────────────────────────────────────────────┘
                          │
                          ▼
┌──────────────────────────────────────────────────────────┐
│                    scripts/                              │
│  build_embeddings.py  │  build_index.py                  │
│  (SPECTER2 → .npz)    │  (sklearn NN → .pkl)             │
└──────────────────────────────────────────────────────────┘

Tech Stack

Layer Technology
Runtime Docker (Python 3.10-slim)
CLI Typer + Rich
Database SQLite 3 (relational: papers/authors/entities)
AI Model allenai/specter2_base (768-dim vectors)
Search Index scikit-learn NearestNeighbors (cosine)
Data Sources ArXiv API, Semantic Scholar, OpenAlex
NLP spaCy en_core_web_sm (NER extraction)
CI/CD GitLab CI (pytest)
GPU NVIDIA Docker runtime (auto-detected)

🚀 Future Roadmap & Improvements

This project serves as a foundation for scalable semantic search architectures. Upcoming iterations will focus on HPC, Distributed Systems, MLOps, and robust Data Engineering practices:

  • HPC & Distributed Systems: Scaling the ingestion pipeline using Apache Spark or Ray for distributed parsing of massive S2ORC datasets. Transitioning the SPECTER2 embedding generator to utilize PyTorch DistributedDataParallel (DDP) and MPI for multi-node/multi-GPU vector computation across high-performance compute clusters.
  • MLOps Integration: Migration from direct torch inference to a Model Registry (e.g., MLflow) for versioning the SPECTER2 embeddings. Implementation of Airflow/Prefect DAG workflows to automate chron-job paper fetching from ArXiv/OpenAlex without manual CLI triggers.
  • Backend & Data Engineering: Graduating from SQLite to PostgreSQL + pgvector for enterprise-scale billions-of-vectors approximate nearest neighbor (ANN) search, backed by a Redis Cache layer for repetitive query latency reduction.
  • DevOps: Complete Helm Chart definitions to migrate the Docker Compose setup into an orchestrator like Kubernetes (K8s). Incorporation of Prometheus & Grafana to monitor embedding latency anomalies and GPU utilization metrics.

📁 Project Structure

Timeline-Explorer/
├── run.py                     # Smart launcher (GPU detect → Docker)
├── docker-compose.yml         # Container configuration
├── docker/Dockerfile          # Python 3.10 + dependencies
├── requirements.txt           # Python dependencies
├── .env                       # API keys (gitignored)
├── .gitlab-ci.yml             # CI pipeline (pytest)
│
├── src/
│   ├── main.py                # Typer CLI (entry point)
│   ├── config.py              # Environment-based configuration
│   │
│   ├── ingestion/             # Data sources
│   │   ├── arxiv.py           # ArXiv API client
│   │   ├── semantic_scholar.py# Semantic Scholar API client
│   │   ├── openalex.py        # OpenAlex API client
│   │   ├── loader.py          # PDF text extraction + NER
│   │   └── dropzone_monitor.py# Auto-import teammate zips
│   │
│   ├── processing/            # Analysis & AI
│   │   ├── embeddings.py      # SPECTER2 model wrapper
│   │   ├── index_loader.py    # Load/search sklearn NN index
│   │   ├── aggregation.py     # Exact phrase semantic matching
│   │   ├── analysis.py        # Year timeline analysis
│   │   └── metrics.py         # Research trend metrics
│   │
│   └── storage/
│       └── db.py              # SQLite database (CRUD + search)
│
├── scripts/
│   ├── build_embeddings.py    # Batch encode papers → .npz
│   └── build_index.py         # Build sklearn NN index from .npz
│
├── tests/                     # pytest test suite
│
└── data/                      # (gitignored — local only)
    ├── research.db            # Main SQLite database
    ├── embeddings/            # SPECTER2 .npz vector files
    ├── index/                 # sklearn NN index (.pkl)
    ├── dropzone/              # Drop teammate zips here
    ├── processed/             # Archived processed zips
    └── papers/Papers/         # Local PDF files

🧪 Testing

# Run all tests (via GitLab CI or locally)
pytest tests/

# Run with coverage
pytest tests/ --cov=src

🔧 Troubleshooting

Issue Solution
Docker not found Install Docker Desktop
GPU not detected Install NVIDIA Container Toolkit
Slow first run Normal — downloading SPECTER2 model (~440MB)
"No papers found" Run python run.py build to build the index
Import fails Ensure zip contains papers.sqlite
Port conflict Stop other Docker containers first

About

An AI-powered knowledge base that transforms unstructured academic PDFs into an interactive timeline. Features hybrid semantic search, entity extraction (NER), and automated metadata heuristics using Python, Docker, and Vector Embeddings

Resources

Contributing

Security policy

Stars

Watchers

Forks

Contributors