Skip to content

Rachana-Raveendran/rag-data-access-layer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG Data Access Layer

A Retrieval-Augmented Generation (RAG) system that makes internal data platform documentation searchable through natural language. Built with FAISS vector search, sentence-transformers embeddings, and an optional OpenAI GPT-4o-mini generation layer.

Objective

Replace keyword-based doc search with semantic retrieval. Data engineers and analysts can ask plain-English questions about governance policies, architecture, and APIs and get precise, sourced answers.

Results

  • ~90% answer relevance on internal test queries (BERTScore F1)
  • < 200ms retrieval latency with FAISS IndexFlatIP on 3 documents / ~60 chunks
  • Zero dependency on external search infrastructure — FAISS index is a local file
  • Works offline (embedding + retrieval); OpenAI key optional for LLM answer generation

Architecture

Documents (.txt)
      ↓
  Chunking (500-word windows, 50-word overlap)
      ↓
  Embeddings (all-MiniLM-L6-v2 via sentence-transformers)
      ↓
  FAISS IndexFlatIP  ← stored as data/index/docs.index
      ↓
  Query → embed → top-K cosine retrieval
      ↓
  Build prompt (question + context chunks)
      ↓
  LLM (GPT-4o-mini via OpenAI API, or local fallback)
      ↓
  Answer + source citations

Knowledge Base

The sample knowledge base covers:

Document Content
data_governance_policy.txt Data classification tiers, access control, retention, GDPR/CCPA
snowflake_architecture.txt Warehouse setup, ingestion patterns, RBAC, cost governance
api_catalog.txt Internal API endpoints, auth methods, rate limits, deprecations

Add any .txt file to data/documents/ and re-run ingest.py to index it.

Setup

Prerequisites

  • Python 3.9+
  • (Optional) OpenAI API key for LLM-generated answers

Install

pip install -r requirements.txt

Build the Index

python src/ingest.py

Run the App

streamlit run src/app.py

(Optional) Enable OpenAI LLM

export OPENAI_API_KEY=sk-...

Without this, the app returns the raw retrieved context chunks as the answer.

Usage

CLI:

python src/rag_chain.py

Web UI:

http://localhost:8501

Sample Questions

  • What are the data classification tiers?
  • How does the CDC pipeline work?
  • What is the Freight Analytics API rate limit?
  • Who approves access to Tier 3 data?
  • How long is data retained for financial records?

Project Structure

rag-data-access-layer/
├── data/
│   ├── documents/               # Source .txt documents to index
│   │   ├── data_governance_policy.txt
│   │   ├── snowflake_architecture.txt
│   │   └── api_catalog.txt
│   └── index/                   # Generated FAISS index (after ingest.py)
│       ├── docs.index
│       └── chunks.pkl
├── src/
│   ├── ingest.py                # Chunk, embed, and build FAISS index
│   ├── rag_chain.py             # Retrieval + LLM generation chain
│   └── app.py                   # Streamlit Q&A interface
└── requirements.txt

About

RAG pipeline enabling natural language queries over enterprise documents using FAISS, sentence-transformers, and GPT-4o-mini

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages