A Retrieval-Augmented Generation (RAG) system that makes internal data platform documentation searchable through natural language. Built with FAISS vector search, sentence-transformers embeddings, and an optional OpenAI GPT-4o-mini generation layer.
Replace keyword-based doc search with semantic retrieval. Data engineers and analysts can ask plain-English questions about governance policies, architecture, and APIs and get precise, sourced answers.
- ~90% answer relevance on internal test queries (BERTScore F1)
- < 200ms retrieval latency with FAISS IndexFlatIP on 3 documents / ~60 chunks
- Zero dependency on external search infrastructure — FAISS index is a local file
- Works offline (embedding + retrieval); OpenAI key optional for LLM answer generation
Documents (.txt)
↓
Chunking (500-word windows, 50-word overlap)
↓
Embeddings (all-MiniLM-L6-v2 via sentence-transformers)
↓
FAISS IndexFlatIP ← stored as data/index/docs.index
↓
Query → embed → top-K cosine retrieval
↓
Build prompt (question + context chunks)
↓
LLM (GPT-4o-mini via OpenAI API, or local fallback)
↓
Answer + source citations
The sample knowledge base covers:
| Document | Content |
|---|---|
data_governance_policy.txt |
Data classification tiers, access control, retention, GDPR/CCPA |
snowflake_architecture.txt |
Warehouse setup, ingestion patterns, RBAC, cost governance |
api_catalog.txt |
Internal API endpoints, auth methods, rate limits, deprecations |
Add any .txt file to data/documents/ and re-run ingest.py to index it.
- Python 3.9+
- (Optional) OpenAI API key for LLM-generated answers
pip install -r requirements.txtpython src/ingest.pystreamlit run src/app.pyexport OPENAI_API_KEY=sk-...Without this, the app returns the raw retrieved context chunks as the answer.
CLI:
python src/rag_chain.pyWeb UI:
http://localhost:8501
- What are the data classification tiers?
- How does the CDC pipeline work?
- What is the Freight Analytics API rate limit?
- Who approves access to Tier 3 data?
- How long is data retained for financial records?
rag-data-access-layer/
├── data/
│ ├── documents/ # Source .txt documents to index
│ │ ├── data_governance_policy.txt
│ │ ├── snowflake_architecture.txt
│ │ └── api_catalog.txt
│ └── index/ # Generated FAISS index (after ingest.py)
│ ├── docs.index
│ └── chunks.pkl
├── src/
│ ├── ingest.py # Chunk, embed, and build FAISS index
│ ├── rag_chain.py # Retrieval + LLM generation chain
│ └── app.py # Streamlit Q&A interface
└── requirements.txt