Skip to content

lindaperez/NLP_Housing_Vector_DB

Repository files navigation

NLP Housing Vector DB

A local RAG (Retrieval-Augmented Generation) chatbot that answers housing-related questions using a vector database and a local or containerized LLM. It is designed to help users understand programs like rental assistance programs, affordable housing listings, tenant rights, and affordable housing options using real local documents instead of internet data.


Table of Contents

  1. Overview
  2. How It Works
  3. Prerequisites
  4. Local Setup (CPU)
  5. Run Ingestion Pipeline in Google Colab (Optional)
  6. GPU Environment (Optional)
  7. Troubleshooting

Overview

This project is a housing-focused chatbot built on a RAG pipeline:

  • A parsing and ingestion pipeline processes local housing PDFs / web pages into structured chunks.
  • A vector store (backed by precomputed embeddings in all_parsed_docs.joblib) is used for semantic retrieval.
  • A backend service builds prompts with citations from the retrieved chunks and calls a local or containerized LLM.
  • A simple web UI exposes a chat interface where users can ask housing questions and see cited sources.

The goal is to provide grounded, citation-rich answers using official local documents, improving reliability over generic LLM responses.


How It Works

2.1 Data Ingestion

The ingestion step runs in a notebook: ingesting/Ingest_pipeline.ipynb.

At a high level, it:

  1. Loads source materials (PDFs, web archives, etc.).
  2. Cleans and normalizes the text (removing boilerplate where possible).
  3. Splits the content into overlapping chunks suitable for semantic search.
  4. Computes embeddings for each chunk.
  5. Stores the resulting list of chunks, metadata, and embeddings into a single file:
all_parsed_docs.joblib

This file is later loaded by the backend to initialize or populate the vector index.

2.2 Retrieval & Generation Flow

When a user asks a question in the chat UI:

  1. The backend receives the user message and recent conversation history.
  2. A semantic search step retrieves the top-k most relevant chunks from the vector data (backed by all_parsed_docs.joblib).
  3. The system constructs a prompt that includes:
    • A system message describing the assistant’s role (housing-focused, grounded in documents).
    • The current conversation history.
    • The retrieved chunks, numbered so the model can cite them as [1], [2], etc.
  4. The LLM generates an answer using only the provided context for policy details, and includes inline citations like [1].
  5. The backend returns both:
    • The answer text.
    • The list of citations (with title, path, and snippet) so the UI can show “View details” for each source.

2.3 Frontend

The frontend (HTML + CSS + JS) provides:

  • A chat window for messages from the user and the bot.
  • A “View sources” area listing the retrieved documents.
  • Clickable citation markers like [1], [2] within the bot message that open a modal with:
    • Title
    • Archive/source type
    • Source path / URL
    • A short snippet of the original text

This makes it easy for users to verify where each answer came from.


Prerequisites

  • Docker and Docker Compose installed
  • (For GPU mode) NVIDIA GPU + drivers + NVIDIA Container Toolkit
  • (Optional) Google Colab account if you want to run the ingestion pipeline there

Local Setup (CPU)

# Clone the repository
git clone <repo-url>
cd NLP_Housing_Vector_DB   # adjust if your repo folder name is different

# Build and start the app (CPU)
docker compose -f docker-compose.cpu.yml up --build

Once the containers are up, open your browser at the port defined in docker-compose.cpu.yml (for example: http://localhost:<port>).


Run Ingestion Pipeline in Google Colab (Optional)

The ingestion pipeline parses the housing PDFs/web data and creates a single joblib file with all chunks.

  1. Open ingesting/Ingest_pipeline.ipynb in Google Colab.
  2. Run all cells.

It will generate:

all_parsed_docs.joblib

Now move this file into the project data/ directory so the app can load it.

If your repo is cloned in Colab as NLP_Housing_Vector_DB/:

%cd /content/NLP_Housing_Vector_DB
!mkdir -p data
!mv all_parsed_docs.joblib data/

If you instead downloaded the file locally, move it on your machine into:

NLP_Housing_Vector_DB/data/all_parsed_docs.joblib

Then restart the app so it picks up the new data:

docker compose -f docker-compose.cpu.yml restart app

GPU Environment (Optional)

Use this if you have a GPU-capable machine and want faster LLM inference.

# Clone the repository
git clone <repo-url>
cd NLP_Housing_Vector_DB

# Build and start the app (GPU)
docker compose -f docker-compose.gpu.yml up --build

Make sure data/all_parsed_docs.joblib exists in the same project directory on the host:

mkdir -p data
mv all_parsed_docs.joblib data/   # if you haven’t moved it already

Then restart the GPU app:

docker compose -f docker-compose.gpu.yml restart app

Open your browser at the port defined in docker-compose.gpu.yml.


Troubleshooting

  • Containers won’t start / crash immediately

    Run:

    docker compose -f docker-compose.cpu.yml logs app

    or for GPU:

    docker compose -f docker-compose.gpu.yml logs app

    and check for missing files or configuration errors.

  • App says it cannot find all_parsed_docs.joblib

    • Confirm the file exists at:
      NLP_Housing_Vector_DB/data/all_parsed_docs.joblib
      
    • Ensure the data/ directory is mounted correctly in the relevant docker-compose.*.yml file.
    • Restart the app container after adding the file.
  • GPU not being used

    • Verify nvidia-smi works on the host.
    • Check that the GPU service in docker-compose.gpu.yml is configured for the NVIDIA runtime or equivalent.
    • Make sure you started the project with the GPU compose file:
      docker compose -f docker-compose.gpu.yml up --build

About

A Retrieval-Augmented Generation (RAG) system designed to provide information contextually relevant housing assistance information (e.g., rental assistance programs, affordable housing listings, tenant rights) specifically tailored for students and low-income populations in some California cities.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors