NLP Housing Vector DB

A local RAG (Retrieval-Augmented Generation) chatbot that answers housing-related questions using a vector database and a local or containerized LLM. It is designed to help users understand programs like rental assistance programs, affordable housing listings, tenant rights, and affordable housing options using real local documents instead of internet data.

Overview

This project is a housing-focused chatbot built on a RAG pipeline:

A parsing and ingestion pipeline processes local housing PDFs / web pages into structured chunks.
A vector store (backed by precomputed embeddings in all_parsed_docs.joblib) is used for semantic retrieval.
A backend service builds prompts with citations from the retrieved chunks and calls a local or containerized LLM.
A simple web UI exposes a chat interface where users can ask housing questions and see cited sources.

The goal is to provide grounded, citation-rich answers using official local documents, improving reliability over generic LLM responses.

How It Works

2.1 Data Ingestion

The ingestion step runs in a notebook: ingesting/Ingest_pipeline.ipynb.

At a high level, it:

Loads source materials (PDFs, web archives, etc.).
Cleans and normalizes the text (removing boilerplate where possible).
Splits the content into overlapping chunks suitable for semantic search.
Computes embeddings for each chunk.
Stores the resulting list of chunks, metadata, and embeddings into a single file:

all_parsed_docs.joblib

This file is later loaded by the backend to initialize or populate the vector index.

2.2 Retrieval & Generation Flow

When a user asks a question in the chat UI:

The backend receives the user message and recent conversation history.
A semantic search step retrieves the top-k most relevant chunks from the vector data (backed by all_parsed_docs.joblib).
The system constructs a prompt that includes:
- A system message describing the assistant’s role (housing-focused, grounded in documents).
- The current conversation history.
- The retrieved chunks, numbered so the model can cite them as [1], [2], etc.
The LLM generates an answer using only the provided context for policy details, and includes inline citations like [1].
The backend returns both:
- The answer text.
- The list of citations (with title, path, and snippet) so the UI can show “View details” for each source.

2.3 Frontend

The frontend (HTML + CSS + JS) provides:

A chat window for messages from the user and the bot.
A “View sources” area listing the retrieved documents.
Clickable citation markers like [1], [2] within the bot message that open a modal with:
- Title
- Archive/source type
- Source path / URL
- A short snippet of the original text

This makes it easy for users to verify where each answer came from.

Prerequisites

Docker and Docker Compose installed
(For GPU mode) NVIDIA GPU + drivers + NVIDIA Container Toolkit
(Optional) Google Colab account if you want to run the ingestion pipeline there

Local Setup (CPU)

# Clone the repository
git clone <repo-url>
cd NLP_Housing_Vector_DB   # adjust if your repo folder name is different

# Build and start the app (CPU)
docker compose -f docker-compose.cpu.yml up --build

Once the containers are up, open your browser at the port defined in docker-compose.cpu.yml (for example: http://localhost:<port>).

Run Ingestion Pipeline in Google Colab (Optional)

The ingestion pipeline parses the housing PDFs/web data and creates a single joblib file with all chunks.

Open ingesting/Ingest_pipeline.ipynb in Google Colab.
Run all cells.

It will generate:

all_parsed_docs.joblib

Now move this file into the project data/ directory so the app can load it.

If your repo is cloned in Colab as NLP_Housing_Vector_DB/:

%cd /content/NLP_Housing_Vector_DB
!mkdir -p data
!mv all_parsed_docs.joblib data/

If you instead downloaded the file locally, move it on your machine into:

NLP_Housing_Vector_DB/data/all_parsed_docs.joblib

Then restart the app so it picks up the new data:

docker compose -f docker-compose.cpu.yml restart app

GPU Environment (Optional)

Use this if you have a GPU-capable machine and want faster LLM inference.

# Clone the repository
git clone <repo-url>
cd NLP_Housing_Vector_DB

# Build and start the app (GPU)
docker compose -f docker-compose.gpu.yml up --build

Make sure data/all_parsed_docs.joblib exists in the same project directory on the host:

mkdir -p data
mv all_parsed_docs.joblib data/   # if you haven’t moved it already

Then restart the GPU app:

docker compose -f docker-compose.gpu.yml restart app

Open your browser at the port defined in docker-compose.gpu.yml.

Troubleshooting

Containers won’t start / crash immediately

Run:
```
docker compose -f docker-compose.cpu.yml logs app
```
or for GPU:
```
docker compose -f docker-compose.gpu.yml logs app
```
and check for missing files or configuration errors.
App says it cannot find all_parsed_docs.joblib
- Confirm the file exists at:
```
NLP_Housing_Vector_DB/data/all_parsed_docs.joblib
```
- Ensure the data/ directory is mounted correctly in the relevant docker-compose.*.yml file.
- Restart the app container after adding the file.
GPU not being used
- Verify nvidia-smi works on the host.
- Check that the GPU service in docker-compose.gpu.yml is configured for the NVIDIA runtime or equivalent.
- Make sure you started the project with the GPU compose file:
```
docker compose -f docker-compose.gpu.yml up --build
```

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.ipynb_checkpoints		.ipynb_checkpoints
app		app
ingesting		ingesting
rag_chatbot		rag_chatbot
rag_pipeline		rag_pipeline
static		static
templates		templates
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
docker-compose.cpu.yml		docker-compose.cpu.yml
docker-compose.gpu.yml		docker-compose.gpu.yml
requirements.txt		requirements.txt
wsgi.py		wsgi.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Housing Vector DB

Table of Contents

Overview

How It Works

2.1 Data Ingestion

2.2 Retrieval & Generation Flow

2.3 Frontend

Prerequisites

Local Setup (CPU)

Run Ingestion Pipeline in Google Colab (Optional)

GPU Environment (Optional)

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLP Housing Vector DB

Table of Contents

Overview

How It Works

2.1 Data Ingestion

2.2 Retrieval & Generation Flow

2.3 Frontend

Prerequisites

Local Setup (CPU)

Run Ingestion Pipeline in Google Colab (Optional)

GPU Environment (Optional)

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages