Skip to content

Latest commit

 

History

History
616 lines (522 loc) · 24.1 KB

File metadata and controls

616 lines (522 loc) · 24.1 KB

FDA CRL Explorer - Current State & Future Plans

Project Overview

A full-stack Python application for exploring FDA Complete Response Letters with AI-powered summarization.

The webapp is in a really great place! This document focuses on realistic, achievable goals.


Technology Stack

Backend

  • Framework: FastAPI (async, modern, auto-generated API docs)
  • Database: DuckDB (embedded, excellent JSON support, analytics-optimized, no separate server needed)
  • AI Services: OpenAI API (GPT-4o-mini for summarization)
  • Data Processing: pandas (data manipulation), httpx (async HTTP client)

Frontend

  • Framework: React + Vite (fast, modern, component-based)
  • UI Library: Tailwind CSS + shadcn/ui (beautiful, accessible components)
  • Data Fetching: TanStack Query (React Query v5 - caching, background updates)
  • Table Component: TanStack Table (powerful filtering, sorting, pagination)
  • State Management: Zustand (lightweight, simple)

Why This Stack?

  • DuckDB: Perfect for this use case - handles JSON natively, fast analytics queries, embedded (no server)
  • FastAPI: Modern async Python framework, auto-generated OpenAPI docs, excellent validation with Pydantic
  • Small scale: ~400 records, simple and efficient
  • Cost-effective: No separate database server, minimal infrastructure

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                     Frontend (React)                         │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                 │
│  │  Table   │  │  Filters │  │  Detail  │                 │
│  │  View    │  │    UI    │  │  Modal   │                 │
│  └──────────┘  └──────────┘  └──────────┘                 │
└───────────────────────────┬─────────────────────────────────┘
                            │ HTTP/JSON
                            │
┌───────────────────────────▼─────────────────────────────────┐
│                  Backend API (FastAPI)                       │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  REST Endpoints: /crls, /stats                       │  │
│  └──────────────────────────────────────────────────────┘  │
│                            │                                 │
│  ┌────────────────┬────────┴────────────────────────────┐  │
│  │  Data          │  AI Services                        │  │
│  │  Ingestion     │  - Summarization                    │  │
│  │  Service       │                                     │  │
│  └────────────────┴─────────────────────────────────────┘  │
└───────────────────────────┬─────────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────────┐
│                    DuckDB Database                           │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │ crls        │  │ crl_         │  │ metadata         │  │
│  │ (raw data)  │  │ summaries    │  │ (job status)     │  │
│  └─────────────┘  └──────────────┘  └──────────────────┘  │
└─────────────────────────────────────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────────┐
│                    External Services                         │
│  ┌──────────────────┐  ┌────────────────────────────────┐  │
│  │ FDA API          │  │ OpenAI API                     │  │
│  │ (bulk download)  │  │ (summarization)                │  │
│  └──────────────────┘  └────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Database Schema

Table: crls

Stores raw CRL data from FDA API.

CREATE TABLE crls (
    id VARCHAR PRIMARY KEY,                    -- Generated from application_number + letter_date
    application_number VARCHAR[],              -- e.g., ["NDA 215818"]
    letter_date DATE,                          -- Parsed from MM/DD/YYYY
    letter_year VARCHAR,
    letter_type VARCHAR,                       -- "COMPLETE RESPONSE"
    approval_status VARCHAR,                   -- "Approved" or "Unapproved"
    company_name VARCHAR,
    company_address VARCHAR,
    company_rep VARCHAR,
    approver_name VARCHAR,
    approver_center VARCHAR[],                 -- e.g., ["Center for Drug Evaluation and Research"]
    approver_title VARCHAR,
    file_name VARCHAR,
    text TEXT,                                 -- Full letter content
    raw_json JSON,                             -- Original JSON for reference
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Table: crl_summaries

Stores AI-generated summaries.

CREATE TABLE crl_summaries (
    id VARCHAR PRIMARY KEY,
    crl_id VARCHAR REFERENCES crls(id),
    summary TEXT,                              -- AI-generated paragraph summary
    model VARCHAR,                             -- e.g., "gpt-4o-mini"
    generated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    tokens_used INTEGER
);

Table: processing_metadata

Tracks processing status.

CREATE TABLE processing_metadata (
    key VARCHAR PRIMARY KEY,
    value VARCHAR,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Examples: last_download_date, last_processing_date, total_crls_processed

Project Structure

crl-app/
├── backend/
│   ├── app/
│   │   ├── __init__.py
│   │   ├── main.py                    # FastAPI application entry point
│   │   ├── config.py                  # Configuration (env vars, settings)
│   │   ├── database.py                # DuckDB connection & initialization
│   │   ├── models.py                  # Pydantic models (request/response)
│   │   ├── schemas.py                 # Database schemas (SQL)
│   │   │
│   │   ├── api/                       # API route handlers
│   │   │   ├── __init__.py
│   │   │   ├── crls.py                # CRL CRUD endpoints
│   │   │   ├── stats.py               # Statistics endpoints
│   │   │   ├── qa.py                  # Q&A endpoints
│   │   │   └── export.py              # Export endpoints
│   │   │
│   │   ├── services/                  # Business logic
│   │   │   ├── __init__.py
│   │   │   ├── data_ingestion.py      # Download & parse FDA data
│   │   │   ├── data_processor.py      # Process & store CRLs
│   │   │   ├── summarization.py       # OpenAI summarization
│   │   │   ├── embeddings.py          # OpenAI embeddings generation
│   │   │   ├── rag.py                 # RAG: retrieval + generation
│   │   │   └── export_service.py      # Export to CSV/Excel
│   │   │
│   │   ├── tasks/                     # Scheduled tasks
│   │   │   ├── __init__.py
│   │   │   └── scheduler.py           # APScheduler setup & jobs
│   │   │
│   │   └── utils/                     # Utilities
│   │       ├── __init__.py
│   │       ├── openai_client.py       # OpenAI API wrapper
│   │       ├── vector_utils.py        # Vector similarity functions
│   │       └── logging_config.py      # Logging setup
│   │
│   ├── data/                          # Local data directory
│   │   ├── raw/                       # Downloaded bulk data
│   │   ├── processed/                 # Processed data
│   │   └── crl_explorer.duckdb        # DuckDB database file
│   │
│   ├── tests/                         # Backend tests
│   │   ├── __init__.py
│   │   ├── test_api.py
│   │   ├── test_services.py
│   │   └── test_rag.py
│   │
│   ├── requirements.txt               # Python dependencies
│   ├── .env.example                   # Example environment variables
│   └── README.md                      # Backend documentation
│
├── frontend/
│   ├── src/
│   │   ├── components/
│   │   │   ├── CRLTable.jsx           # Main table component
│   │   │   ├── FilterPanel.jsx        # Search & filter UI
│   │   │   ├── CRLDetailModal.jsx     # Detail view modal
│   │   │   ├── QAPanel.jsx            # Q&A interface
│   │   │   ├── StatsCards.jsx         # Summary statistics
│   │   │   └── ExportButton.jsx       # Export functionality
│   │   │
│   │   ├── services/
│   │   │   ├── api.js                 # API client
│   │   │   └── queries.js             # React Query hooks
│   │   │
│   │   ├── store/
│   │   │   └── filterStore.js         # Zustand store for filters
│   │   │
│   │   ├── pages/
│   │   │   ├── HomePage.jsx           # Main application page
│   │   │   └── AboutPage.jsx          # About/documentation
│   │   │
│   │   ├── App.jsx                    # Root component
│   │   ├── main.jsx                   # Entry point
│   │   └── index.css                  # Global styles
│   │
│   ├── public/
│   ├── package.json
│   ├── vite.config.js
│   ├── tailwind.config.js
│   └── README.md
│
├── .gitignore
├── FDA_CRL_API.md                     # API documentation (created)
├── TODO.md                            # This file
└── README.md                          # Project overview

Implementation Plan

Phase 1: Foundation & Setup

1.1 Project Initialization ✅ COMPLETED

  • Create project directory structure
  • Initialize Git repository
  • Set up .gitignore (exclude .env, data/, *.duckdb, node_modules/)
  • Create backend virtual environment
  • Create requirements.txt with initial dependencies
  • Create .env.example template
  • Initialize frontend with Vite + React

Dependencies (requirements.txt):

fastapi==0.104.1
uvicorn[standard]==0.24.0
duckdb==0.9.2
pydantic==2.5.0
pydantic-settings==2.1.0
python-dotenv==1.0.0
httpx==0.25.1
pandas==2.1.3
openai==1.3.5
apscheduler==3.10.4
python-multipart==0.0.6
openpyxl==3.1.2
numpy==1.26.2

1.2 Configuration Management ✅ COMPLETED

  • Create app/config.py with Pydantic Settings
  • Define environment variables:
    • OPENAI_API_KEY (required, never exposed to frontend)
    • DATABASE_PATH (default: ./data/crl_explorer.duckdb)
    • FDA_BULK_APPROVED_URL
    • FDA_BULK_UNAPPROVED_URL
    • SCHEDULE_HOUR (default: 2 AM)
    • LOG_LEVEL (default: INFO)
    • CORS_ORIGINS (allowed frontend origins)
  • Add validation for required environment variables
  • Create .env.example with placeholder values

1.3 Logging Setup ✅ COMPLETED

  • Create app/utils/logging_config.py
  • Configure structured logging (JSON format)
  • Set up file rotation
  • Add request ID tracking for API calls

Phase 2: Database Layer ✅ COMPLETED

2.1 Database Setup ✅ COMPLETED

  • Create app/database.py with DuckDB connection manager
  • Implement connection pooling/singleton pattern
  • Create app/schemas.py with all table definitions (SQL)
  • Write database initialization function (init_db())
  • Create indexes for common queries:
    • approval_status
    • letter_year
    • company_name
    • letter_date

2.2 Data Access Layer ✅ COMPLETED

  • Create repository pattern classes in app/database.py:
    • CRLRepository (CRUD operations for CRLs)
    • SummaryRepository
    • EmbeddingRepository
    • QARepository
  • Implement query builders for filtering/sorting
  • Add pagination support
  • Write helper functions for vector similarity search

2.3 Database Testing

  • Create test database setup
  • Write unit tests for repositories
  • Test data insertion and retrieval
  • Verify index performance

Phase 2.5: Testing Infrastructure Setup

2.5.1 Test Framework Setup

  • Install pytest and pytest-cov
  • Create pytest.ini configuration file
  • Set up test directory structure
  • Configure coverage reporting (HTML, terminal, and XML)
  • Add pytest-asyncio for async test support
  • Add pytest-mock for mocking support

2.5.2 Code Coverage Configuration

  • Set up .coveragerc configuration
  • Configure minimum coverage thresholds
  • Set up coverage reports in CI/CD (future)
  • Add coverage badges to README (future)

2.5.3 Initial Unit Tests

  • Write tests for app/config.py (settings validation)
  • Write tests for app/utils/logging_config.py
  • Write tests for app/database.py (connection management)
  • Write tests for app/schemas.py (table creation)
  • Achieve >80% coverage for core modules

2.5.4 Test Utilities

  • Create test fixtures for database connections
  • Create mock factories for test data
  • Set up test database (in-memory or temporary)
  • Create helper functions for common test scenarios

Testing Dependencies to Add:

pytest==7.4.3
pytest-cov==4.1.0
pytest-asyncio==0.21.1
pytest-mock==3.12.0
coverage[toml]==7.3.2

Phase 3: Data Ingestion Pipeline ✅ COMPLETED

3.1 Download Service ✅ COMPLETED

  • Create app/services/data_ingestion.py
  • Implement download_bulk_data() function:
    • Download CRL JSON ZIP file from FDA
    • Extract JSON files
    • Validate downloads
  • Add retry logic with exponential backoff (using tenacity)
  • Implement progress logging
  • Store raw files in data/raw/

3.2 Data Processing ✅ COMPLETED

  • Create app/services/data_processor.py
  • Implement parse_crl_data() function:
    • Parse JSON structure
    • Validate required fields
    • Transform dates (MM/DD/YYYY and YYYYMMDD → YYYY-MM-DD)
    • Generate unique CRL IDs
  • Implement detect_new_crls() function:
    • Compare with existing database
    • Identify new records
    • Identify updated records
  • Implement store_crls() function:
    • Bulk insert new CRLs
    • Update existing CRLs if changed
    • Track metadata (last_download_date)

3.3 Testing ✅ COMPLETED

  • Create sample CRL data for testing (47 comprehensive tests)
  • Test download error handling (network errors, timeouts, retry logic)
  • Test duplicate detection (ID collision resolution with hashing)
  • Verify data integrity after processing (date parsing, edge cases)
  • Test all Phase 3 services with 100% test pass rate

Phase 4: AI Services ✅ COMPLETED

4.1 OpenAI Client ✅ COMPLETED

  • Create app/utils/openai_client.py
  • Implement wrapper class with:
    • API key management (from config)
    • Rate limiting (exponential backoff via OpenAI client)
    • Error handling and retries
    • Token usage tracking
    • Timeout configuration
  • Add logging for all API calls
  • Implement cost estimation

4.2 Summarization Service ✅ COMPLETED

  • Create app/services/summarization.py
  • Design prompt template focused on deficiencies and key points
  • Implement summarize_crl(text) function:
    • Call GPT-4o-mini API
    • Extract summary from response
    • Track tokens used
  • Implement batch processing with progress tracking
  • Add retry logic for failures (via OpenAI client)
  • Created CLI script generate_summaries.py for batch processing

4.3 Embedding Service ✅ COMPLETED

  • Create app/services/embeddings.py
  • Implement generate_embedding(text) function:
    • Use text-embedding-3-small (1536 dimensions)
    • Handle text truncation (max 8191 tokens)
    • Return numpy array
  • Implement batch embedding generation
  • Create embeddings for summaries
  • Normalize vectors for cosine similarity
  • Created CLI script generate_embeddings.py for batch processing

4.4 Testing ✅ COMPLETED

  • Test summarization with sample CRLs (dry-run mode)
  • Verify embedding dimensions
  • Test batch processing
  • Validate error handling and retries
  • Monitor token usage and costs
  • 23 comprehensive tests covering all AI services

Phase 7: Backend API (FastAPI) ✅ COMPLETED

7.1 Main Application ✅ COMPLETED

  • Create app/main.py with FastAPI app
  • Configure CORS (restrict to frontend origin)
  • Add health check endpoint: GET /health with database stats
  • Include API routers
  • Add startup event to initialize database
  • Configure exception handlers (404, 500)
  • Add lifespan context manager for startup/shutdown
  • Auto-generated OpenAPI documentation at /docs

7.2 Pydantic Models ✅ COMPLETED

  • Create app/models.py with request/response models:
    • CRLListItem (list view)
    • CRLWithSummary (detail view with AI summary)
    • CRLWithText (full text view)
    • CRLListResponse (paginated list)
    • StatsOverview (statistics)
    • CompanyStats (company statistics)
    • QARequest (question with validation)
    • QAResponse (answer with citations)
    • QAHistoryItem and QAHistoryResponse
    • HealthResponse (health check)

7.3 CRL Endpoints ✅ COMPLETED

  • Create app/api/crls.py
  • Implement endpoints:
    • GET /api/crls - List CRLs with filtering, sorting, pagination
      • Query params: approval_status, letter_year, company_name, search_text, limit (1-100), offset, sort_by, sort_order
      • Return: paginated list with total count and has_more flag
    • GET /api/crls/{crl_id} - Get single CRL with summary
      • Return: full CRL metadata + AI summary
    • GET /api/crls/{crl_id}/text - Get full letter text
      • Return: CRL with complete letter text

7.4 Statistics Endpoints ✅ COMPLETED

  • Create app/api/stats.py
  • Implement endpoints:
    • GET /api/stats/overview - Overall statistics
      • Total CRLs, by_status breakdown, by_year breakdown
    • GET /api/stats/companies - Top companies by CRL count
      • Sorted by CRL count with approved/unapproved breakdown

7.6 Export Endpoints

  • Create app/api/export.py (deferred - not MVP critical)
  • Create app/services/export_service.py
  • Implement CSV/Excel export functionality

7.7 Testing ✅ COMPLETED

  • Write API tests for all endpoints (comprehensive tests)
  • Test filtering and pagination
  • Test error responses (404, 400, 422, 500)
  • Test CORS configuration
  • Test health check with database stats
  • Test API documentation accessibility
  • Tests use in-memory database with mock data

Current State - What's Working Great! ✅

Backend ✅ COMPLETED

  • FastAPI application with health checks
  • DuckDB database with optimized schema
  • Data ingestion pipeline from FDA API
  • AI-powered summarization service
  • REST API endpoints for CRLs and statistics
  • Comprehensive test coverage
  • Docker deployment setup

Frontend ✅ COMPLETED

  • React + Vite application
  • Beautiful UI with Tailwind CSS + shadcn/ui
  • Interactive CRL table with filtering and sorting
  • Detail modal with AI summaries
  • Statistics dashboard
  • Responsive design
  • Production deployment ready

Automated Data Updates ✅ IMPLEMENTED

How It Works

A GitHub Actions workflow runs daily to check if FDA data has changed. If changes are detected, it:

  1. Downloads fresh FDA data
  2. Runs the complete AI processing pipeline
  3. Uploads the new database to GitHub Releases

Components

backend/check_for_updates.py - Change Detection Script

  • Downloads FDA bulk data and computes SHA256 hash
  • Compares with previously stored hash in database
  • Exit code 0 = data changed, exit code 1 = no change

backend/ingest_data_ci.py - Non-Interactive Pipeline

  • CI/CD-friendly version of ingest_data.py
  • No prompts, no confirmations
  • Runs all 7 pipeline steps automatically

.github/workflows/update-data.yml - GitHub Actions Workflow

  • Runs daily at 2 AM UTC
  • Checks for FDA data changes
  • If changed: runs pipeline and creates new GitHub Release
  • Database uploaded to: https://github.com/{repo}/releases/download/data-{date}/crl_explorer.duckdb

Setup Required

  1. Add GitHub Secret: OPENAI_API_KEY

    • Go to repo Settings → Secrets and variables → Actions
    • Add new secret named OPENAI_API_KEY
  2. Manual Trigger (optional):

    • Go to Actions → "Update CRL Data" → Run workflow
    • Check "Force update" to bypass change detection

Usage

Download latest database:

# Find latest release
gh release list --repo armish/crl.help

# Download database
gh release download data-2025-01-15 --pattern "*.duckdb" --repo armish/crl.help

Use with Docker:

docker run -d -p 80:80 \
  -e DATABASE_URL=https://github.com/armish/crl.help/releases/download/data-2025-01-15/crl_explorer.duckdb \
  ghcr.io/armish/crl.help:latest

Security Checklist ✅

  • OpenAI API key stored in environment variable
  • API key never exposed to frontend
  • API key never logged
  • .env file in .gitignore
  • Pre-commit hook to prevent accidental secret exposure
  • CORS restricted to specific origins
  • All user inputs validated
  • HTTPS in production (via hosting platform)
  • No hardcoded secrets in code

Key Design Decisions

Why DuckDB?

  • Perfect for analytics: Optimized for read-heavy workloads
  • Embedded: No separate database server needed
  • JSON support: Native JSON columns for raw data storage
  • Small footprint: Perfect for ~400 records
  • Serverless-friendly: Single file database, easy backups

Summarization Approach

  • Single paragraph (3-5 sentences)
  • Focus on deficiencies, not boilerplate
  • GPT-4o-mini for cost efficiency
  • Cache summaries to avoid re-processing

Data Update Strategy (Current)

  • Manual bulk download when needed
  • Full re-ingestion is fast (~5 minutes)
  • Simple and reliable
  • Automate only if data changes frequently

Cost Estimation

OpenAI API Costs (~400 CRLs)

Summarization (GPT-4o-mini):

  • Total cost for initial setup: < $0.10
  • Ongoing costs: Minimal (few new CRLs per week/month)

Note: Very cost-effective due to small dataset size.