A full-stack Python application for exploring FDA Complete Response Letters with AI-powered summarization.
The webapp is in a really great place! This document focuses on realistic, achievable goals.
- Framework: FastAPI (async, modern, auto-generated API docs)
- Database: DuckDB (embedded, excellent JSON support, analytics-optimized, no separate server needed)
- AI Services: OpenAI API (GPT-4o-mini for summarization)
- Data Processing: pandas (data manipulation), httpx (async HTTP client)
- Framework: React + Vite (fast, modern, component-based)
- UI Library: Tailwind CSS + shadcn/ui (beautiful, accessible components)
- Data Fetching: TanStack Query (React Query v5 - caching, background updates)
- Table Component: TanStack Table (powerful filtering, sorting, pagination)
- State Management: Zustand (lightweight, simple)
- DuckDB: Perfect for this use case - handles JSON natively, fast analytics queries, embedded (no server)
- FastAPI: Modern async Python framework, auto-generated OpenAPI docs, excellent validation with Pydantic
- Small scale: ~400 records, simple and efficient
- Cost-effective: No separate database server, minimal infrastructure
┌─────────────────────────────────────────────────────────────┐
│ Frontend (React) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Table │ │ Filters │ │ Detail │ │
│ │ View │ │ UI │ │ Modal │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└───────────────────────────┬─────────────────────────────────┘
│ HTTP/JSON
│
┌───────────────────────────▼─────────────────────────────────┐
│ Backend API (FastAPI) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ REST Endpoints: /crls, /stats │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────┬────────┴────────────────────────────┐ │
│ │ Data │ AI Services │ │
│ │ Ingestion │ - Summarization │ │
│ │ Service │ │ │
│ └────────────────┴─────────────────────────────────────┘ │
└───────────────────────────┬─────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────┐
│ DuckDB Database │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ crls │ │ crl_ │ │ metadata │ │
│ │ (raw data) │ │ summaries │ │ (job status) │ │
│ └─────────────┘ └──────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────┐
│ External Services │
│ ┌──────────────────┐ ┌────────────────────────────────┐ │
│ │ FDA API │ │ OpenAI API │ │
│ │ (bulk download) │ │ (summarization) │ │
│ └──────────────────┘ └────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Stores raw CRL data from FDA API.
CREATE TABLE crls (
id VARCHAR PRIMARY KEY, -- Generated from application_number + letter_date
application_number VARCHAR[], -- e.g., ["NDA 215818"]
letter_date DATE, -- Parsed from MM/DD/YYYY
letter_year VARCHAR,
letter_type VARCHAR, -- "COMPLETE RESPONSE"
approval_status VARCHAR, -- "Approved" or "Unapproved"
company_name VARCHAR,
company_address VARCHAR,
company_rep VARCHAR,
approver_name VARCHAR,
approver_center VARCHAR[], -- e.g., ["Center for Drug Evaluation and Research"]
approver_title VARCHAR,
file_name VARCHAR,
text TEXT, -- Full letter content
raw_json JSON, -- Original JSON for reference
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);Stores AI-generated summaries.
CREATE TABLE crl_summaries (
id VARCHAR PRIMARY KEY,
crl_id VARCHAR REFERENCES crls(id),
summary TEXT, -- AI-generated paragraph summary
model VARCHAR, -- e.g., "gpt-4o-mini"
generated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
tokens_used INTEGER
);Tracks processing status.
CREATE TABLE processing_metadata (
key VARCHAR PRIMARY KEY,
value VARCHAR,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Examples: last_download_date, last_processing_date, total_crls_processedcrl-app/
├── backend/
│ ├── app/
│ │ ├── __init__.py
│ │ ├── main.py # FastAPI application entry point
│ │ ├── config.py # Configuration (env vars, settings)
│ │ ├── database.py # DuckDB connection & initialization
│ │ ├── models.py # Pydantic models (request/response)
│ │ ├── schemas.py # Database schemas (SQL)
│ │ │
│ │ ├── api/ # API route handlers
│ │ │ ├── __init__.py
│ │ │ ├── crls.py # CRL CRUD endpoints
│ │ │ ├── stats.py # Statistics endpoints
│ │ │ ├── qa.py # Q&A endpoints
│ │ │ └── export.py # Export endpoints
│ │ │
│ │ ├── services/ # Business logic
│ │ │ ├── __init__.py
│ │ │ ├── data_ingestion.py # Download & parse FDA data
│ │ │ ├── data_processor.py # Process & store CRLs
│ │ │ ├── summarization.py # OpenAI summarization
│ │ │ ├── embeddings.py # OpenAI embeddings generation
│ │ │ ├── rag.py # RAG: retrieval + generation
│ │ │ └── export_service.py # Export to CSV/Excel
│ │ │
│ │ ├── tasks/ # Scheduled tasks
│ │ │ ├── __init__.py
│ │ │ └── scheduler.py # APScheduler setup & jobs
│ │ │
│ │ └── utils/ # Utilities
│ │ ├── __init__.py
│ │ ├── openai_client.py # OpenAI API wrapper
│ │ ├── vector_utils.py # Vector similarity functions
│ │ └── logging_config.py # Logging setup
│ │
│ ├── data/ # Local data directory
│ │ ├── raw/ # Downloaded bulk data
│ │ ├── processed/ # Processed data
│ │ └── crl_explorer.duckdb # DuckDB database file
│ │
│ ├── tests/ # Backend tests
│ │ ├── __init__.py
│ │ ├── test_api.py
│ │ ├── test_services.py
│ │ └── test_rag.py
│ │
│ ├── requirements.txt # Python dependencies
│ ├── .env.example # Example environment variables
│ └── README.md # Backend documentation
│
├── frontend/
│ ├── src/
│ │ ├── components/
│ │ │ ├── CRLTable.jsx # Main table component
│ │ │ ├── FilterPanel.jsx # Search & filter UI
│ │ │ ├── CRLDetailModal.jsx # Detail view modal
│ │ │ ├── QAPanel.jsx # Q&A interface
│ │ │ ├── StatsCards.jsx # Summary statistics
│ │ │ └── ExportButton.jsx # Export functionality
│ │ │
│ │ ├── services/
│ │ │ ├── api.js # API client
│ │ │ └── queries.js # React Query hooks
│ │ │
│ │ ├── store/
│ │ │ └── filterStore.js # Zustand store for filters
│ │ │
│ │ ├── pages/
│ │ │ ├── HomePage.jsx # Main application page
│ │ │ └── AboutPage.jsx # About/documentation
│ │ │
│ │ ├── App.jsx # Root component
│ │ ├── main.jsx # Entry point
│ │ └── index.css # Global styles
│ │
│ ├── public/
│ ├── package.json
│ ├── vite.config.js
│ ├── tailwind.config.js
│ └── README.md
│
├── .gitignore
├── FDA_CRL_API.md # API documentation (created)
├── TODO.md # This file
└── README.md # Project overview
- Create project directory structure
- Initialize Git repository
- Set up
.gitignore(exclude.env,data/,*.duckdb,node_modules/) - Create backend virtual environment
- Create
requirements.txtwith initial dependencies - Create
.env.exampletemplate - Initialize frontend with Vite + React
Dependencies (requirements.txt):
fastapi==0.104.1
uvicorn[standard]==0.24.0
duckdb==0.9.2
pydantic==2.5.0
pydantic-settings==2.1.0
python-dotenv==1.0.0
httpx==0.25.1
pandas==2.1.3
openai==1.3.5
apscheduler==3.10.4
python-multipart==0.0.6
openpyxl==3.1.2
numpy==1.26.2
- Create
app/config.pywith Pydantic Settings - Define environment variables:
OPENAI_API_KEY(required, never exposed to frontend)DATABASE_PATH(default:./data/crl_explorer.duckdb)FDA_BULK_APPROVED_URLFDA_BULK_UNAPPROVED_URLSCHEDULE_HOUR(default: 2 AM)LOG_LEVEL(default: INFO)CORS_ORIGINS(allowed frontend origins)
- Add validation for required environment variables
- Create
.env.examplewith placeholder values
- Create
app/utils/logging_config.py - Configure structured logging (JSON format)
- Set up file rotation
- Add request ID tracking for API calls
- Create
app/database.pywith DuckDB connection manager - Implement connection pooling/singleton pattern
- Create
app/schemas.pywith all table definitions (SQL) - Write database initialization function (
init_db()) - Create indexes for common queries:
approval_statusletter_yearcompany_nameletter_date
- Create repository pattern classes in
app/database.py:CRLRepository(CRUD operations for CRLs)SummaryRepositoryEmbeddingRepositoryQARepository
- Implement query builders for filtering/sorting
- Add pagination support
- Write helper functions for vector similarity search
- Create test database setup
- Write unit tests for repositories
- Test data insertion and retrieval
- Verify index performance
- Install pytest and pytest-cov
- Create pytest.ini configuration file
- Set up test directory structure
- Configure coverage reporting (HTML, terminal, and XML)
- Add pytest-asyncio for async test support
- Add pytest-mock for mocking support
- Set up .coveragerc configuration
- Configure minimum coverage thresholds
- Set up coverage reports in CI/CD (future)
- Add coverage badges to README (future)
- Write tests for
app/config.py(settings validation) - Write tests for
app/utils/logging_config.py - Write tests for
app/database.py(connection management) - Write tests for
app/schemas.py(table creation) - Achieve >80% coverage for core modules
- Create test fixtures for database connections
- Create mock factories for test data
- Set up test database (in-memory or temporary)
- Create helper functions for common test scenarios
Testing Dependencies to Add:
pytest==7.4.3
pytest-cov==4.1.0
pytest-asyncio==0.21.1
pytest-mock==3.12.0
coverage[toml]==7.3.2
- Create
app/services/data_ingestion.py - Implement
download_bulk_data()function:- Download CRL JSON ZIP file from FDA
- Extract JSON files
- Validate downloads
- Add retry logic with exponential backoff (using tenacity)
- Implement progress logging
- Store raw files in
data/raw/
- Create
app/services/data_processor.py - Implement
parse_crl_data()function:- Parse JSON structure
- Validate required fields
- Transform dates (MM/DD/YYYY and YYYYMMDD → YYYY-MM-DD)
- Generate unique CRL IDs
- Implement
detect_new_crls()function:- Compare with existing database
- Identify new records
- Identify updated records
- Implement
store_crls()function:- Bulk insert new CRLs
- Update existing CRLs if changed
- Track metadata (last_download_date)
- Create sample CRL data for testing (47 comprehensive tests)
- Test download error handling (network errors, timeouts, retry logic)
- Test duplicate detection (ID collision resolution with hashing)
- Verify data integrity after processing (date parsing, edge cases)
- Test all Phase 3 services with 100% test pass rate
- Create
app/utils/openai_client.py - Implement wrapper class with:
- API key management (from config)
- Rate limiting (exponential backoff via OpenAI client)
- Error handling and retries
- Token usage tracking
- Timeout configuration
- Add logging for all API calls
- Implement cost estimation
- Create
app/services/summarization.py - Design prompt template focused on deficiencies and key points
- Implement
summarize_crl(text)function:- Call GPT-4o-mini API
- Extract summary from response
- Track tokens used
- Implement batch processing with progress tracking
- Add retry logic for failures (via OpenAI client)
- Created CLI script
generate_summaries.pyfor batch processing
- Create
app/services/embeddings.py - Implement
generate_embedding(text)function:- Use
text-embedding-3-small(1536 dimensions) - Handle text truncation (max 8191 tokens)
- Return numpy array
- Use
- Implement batch embedding generation
- Create embeddings for summaries
- Normalize vectors for cosine similarity
- Created CLI script
generate_embeddings.pyfor batch processing
- Test summarization with sample CRLs (dry-run mode)
- Verify embedding dimensions
- Test batch processing
- Validate error handling and retries
- Monitor token usage and costs
- 23 comprehensive tests covering all AI services
- Create
app/main.pywith FastAPI app - Configure CORS (restrict to frontend origin)
- Add health check endpoint:
GET /healthwith database stats - Include API routers
- Add startup event to initialize database
- Configure exception handlers (404, 500)
- Add lifespan context manager for startup/shutdown
- Auto-generated OpenAPI documentation at
/docs
- Create
app/models.pywith request/response models:CRLListItem(list view)CRLWithSummary(detail view with AI summary)CRLWithText(full text view)CRLListResponse(paginated list)StatsOverview(statistics)CompanyStats(company statistics)QARequest(question with validation)QAResponse(answer with citations)QAHistoryItemandQAHistoryResponseHealthResponse(health check)
- Create
app/api/crls.py - Implement endpoints:
GET /api/crls- List CRLs with filtering, sorting, pagination- Query params: approval_status, letter_year, company_name, search_text, limit (1-100), offset, sort_by, sort_order
- Return: paginated list with total count and has_more flag
GET /api/crls/{crl_id}- Get single CRL with summary- Return: full CRL metadata + AI summary
GET /api/crls/{crl_id}/text- Get full letter text- Return: CRL with complete letter text
- Create
app/api/stats.py - Implement endpoints:
GET /api/stats/overview- Overall statistics- Total CRLs, by_status breakdown, by_year breakdown
GET /api/stats/companies- Top companies by CRL count- Sorted by CRL count with approved/unapproved breakdown
- Create
app/api/export.py(deferred - not MVP critical) - Create
app/services/export_service.py - Implement CSV/Excel export functionality
- Write API tests for all endpoints (comprehensive tests)
- Test filtering and pagination
- Test error responses (404, 400, 422, 500)
- Test CORS configuration
- Test health check with database stats
- Test API documentation accessibility
- Tests use in-memory database with mock data
- FastAPI application with health checks
- DuckDB database with optimized schema
- Data ingestion pipeline from FDA API
- AI-powered summarization service
- REST API endpoints for CRLs and statistics
- Comprehensive test coverage
- Docker deployment setup
- React + Vite application
- Beautiful UI with Tailwind CSS + shadcn/ui
- Interactive CRL table with filtering and sorting
- Detail modal with AI summaries
- Statistics dashboard
- Responsive design
- Production deployment ready
A GitHub Actions workflow runs daily to check if FDA data has changed. If changes are detected, it:
- Downloads fresh FDA data
- Runs the complete AI processing pipeline
- Uploads the new database to GitHub Releases
backend/check_for_updates.py - Change Detection Script
- Downloads FDA bulk data and computes SHA256 hash
- Compares with previously stored hash in database
- Exit code 0 = data changed, exit code 1 = no change
backend/ingest_data_ci.py - Non-Interactive Pipeline
- CI/CD-friendly version of
ingest_data.py - No prompts, no confirmations
- Runs all 7 pipeline steps automatically
.github/workflows/update-data.yml - GitHub Actions Workflow
- Runs daily at 2 AM UTC
- Checks for FDA data changes
- If changed: runs pipeline and creates new GitHub Release
- Database uploaded to:
https://github.com/{repo}/releases/download/data-{date}/crl_explorer.duckdb
-
Add GitHub Secret:
OPENAI_API_KEY- Go to repo Settings → Secrets and variables → Actions
- Add new secret named
OPENAI_API_KEY
-
Manual Trigger (optional):
- Go to Actions → "Update CRL Data" → Run workflow
- Check "Force update" to bypass change detection
Download latest database:
# Find latest release
gh release list --repo armish/crl.help
# Download database
gh release download data-2025-01-15 --pattern "*.duckdb" --repo armish/crl.helpUse with Docker:
docker run -d -p 80:80 \
-e DATABASE_URL=https://github.com/armish/crl.help/releases/download/data-2025-01-15/crl_explorer.duckdb \
ghcr.io/armish/crl.help:latest- OpenAI API key stored in environment variable
- API key never exposed to frontend
- API key never logged
- .env file in .gitignore
- Pre-commit hook to prevent accidental secret exposure
- CORS restricted to specific origins
- All user inputs validated
- HTTPS in production (via hosting platform)
- No hardcoded secrets in code
- Perfect for analytics: Optimized for read-heavy workloads
- Embedded: No separate database server needed
- JSON support: Native JSON columns for raw data storage
- Small footprint: Perfect for ~400 records
- Serverless-friendly: Single file database, easy backups
- Single paragraph (3-5 sentences)
- Focus on deficiencies, not boilerplate
- GPT-4o-mini for cost efficiency
- Cache summaries to avoid re-processing
- Manual bulk download when needed
- Full re-ingestion is fast (~5 minutes)
- Simple and reliable
- Automate only if data changes frequently
Summarization (GPT-4o-mini):
- Total cost for initial setup: < $0.10
- Ongoing costs: Minimal (few new CRLs per week/month)
Note: Very cost-effective due to small dataset size.