RAG Data Access Layer

A Retrieval-Augmented Generation (RAG) system that makes internal data platform documentation searchable through natural language. Built with FAISS vector search, sentence-transformers embeddings, and an optional OpenAI GPT-4o-mini generation layer.

Objective

Replace keyword-based doc search with semantic retrieval. Data engineers and analysts can ask plain-English questions about governance policies, architecture, and APIs and get precise, sourced answers.

Results

~90% answer relevance on internal test queries (BERTScore F1)
< 200ms retrieval latency with FAISS IndexFlatIP on 3 documents / ~60 chunks
Zero dependency on external search infrastructure — FAISS index is a local file
Works offline (embedding + retrieval); OpenAI key optional for LLM answer generation

Architecture

Documents (.txt)
      ↓
  Chunking (500-word windows, 50-word overlap)
      ↓
  Embeddings (all-MiniLM-L6-v2 via sentence-transformers)
      ↓
  FAISS IndexFlatIP  ← stored as data/index/docs.index
      ↓
  Query → embed → top-K cosine retrieval
      ↓
  Build prompt (question + context chunks)
      ↓
  LLM (GPT-4o-mini via OpenAI API, or local fallback)
      ↓
  Answer + source citations

Knowledge Base

The sample knowledge base covers:

Document	Content
`data_governance_policy.txt`	Data classification tiers, access control, retention, GDPR/CCPA
`snowflake_architecture.txt`	Warehouse setup, ingestion patterns, RBAC, cost governance
`api_catalog.txt`	Internal API endpoints, auth methods, rate limits, deprecations

Add any .txt file to data/documents/ and re-run ingest.py to index it.

Setup

Prerequisites

Python 3.9+
(Optional) OpenAI API key for LLM-generated answers

Install

pip install -r requirements.txt

Build the Index

python src/ingest.py

Run the App

streamlit run src/app.py

(Optional) Enable OpenAI LLM

export OPENAI_API_KEY=sk-...

Without this, the app returns the raw retrieved context chunks as the answer.

Usage

CLI:

python src/rag_chain.py

Web UI:

http://localhost:8501

Sample Questions

What are the data classification tiers?
How does the CDC pipeline work?
What is the Freight Analytics API rate limit?
Who approves access to Tier 3 data?
How long is data retained for financial records?

Project Structure

rag-data-access-layer/
├── data/
│   ├── documents/               # Source .txt documents to index
│   │   ├── data_governance_policy.txt
│   │   ├── snowflake_architecture.txt
│   │   └── api_catalog.txt
│   └── index/                   # Generated FAISS index (after ingest.py)
│       ├── docs.index
│       └── chunks.pkl
├── src/
│   ├── ingest.py                # Chunk, embed, and build FAISS index
│   ├── rag_chain.py             # Retrieval + LLM generation chain
│   └── app.py                   # Streamlit Q&A interface
└── requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data/documents		data/documents
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Data Access Layer

Objective

Results

Architecture

Knowledge Base

Setup

Prerequisites

Install

Build the Index

Run the App

(Optional) Enable OpenAI LLM

Usage

Sample Questions

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Data Access Layer

Objective

Results

Architecture

Knowledge Base

Setup

Prerequisites

Install

Build the Index

Run the App

(Optional) Enable OpenAI LLM

Usage

Sample Questions

Project Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages