CorpusAI: A RAG-Powered Knowledge Engine

An intelligent engine for unifying and querying your internal knowledge corpus—support tickets, knowledge bases, and other documents—using the power of Retrieval-Augmented Generation.

Key Features • Architecture • Getting Started • Technology Stack • Contributing

📖 Overview

CorpusAI transforms your scattered internal documents into a centralized, intelligent knowledge base. It moves beyond simple keyword search by leveraging a Retrieval-Augmented Generation (RAG) pipeline to understand the context and intent of natural language queries.

Whether you're resolving support tickets, onboarding new team members, or searching for specific information in a vast sea of documents, CorpusAI provides precise, context-aware answers by consulting the original source material.

✨ Key Features

Unified Knowledge Access: Ingest and search across diverse document types (PDFs, text files, etc.) in a single, unified space.
Natural Language Querying: Ask questions in plain English, just as you would ask a human expert.
Context-Aware Answers: Get direct answers synthesized by an LLM, grounded in the specific information found within your documents.
Source Verification: Responses are based directly on your provided documents, minimizing hallucinations and ensuring accuracy.
Local & Private: Powered by Ollama, the entire pipeline can run locally on your machine, ensuring your sensitive data remains secure.
Performance Monitoring: Built-in logging and performance utilities to track and optimize query times and resource usage.
Intuitive Web Interface: A clean and interactive UI built with Streamlit for easy document management and querying.

🏗️ Architecture: How It Works

CorpusAI operates on a robust RAG pipeline, which can be broken down into two core phases:

Phase 1: Ingestion (Indexing the Corpus)

Document Loading: Your documents (support tickets, knowledge bases, etc.) are loaded into the system.
Intelligent Chunking: The documents are segmented into smaller, semantically meaningful chunks.
Embedding & Storage: Each chunk is converted into a vector embedding and stored in a ChromaDB vector database alongside its source metadata. This creates a searchable index of your knowledge corpus.

Phase 2: Retrieval & Generation (Answering a Query)

User Query: A user asks a question in the Streamlit UI.
Semantic Retrieval: The query is embedded, and a similarity search is performed against the ChromaDB index to retrieve the most relevant document chunks.
Context Augmentation: The retrieved chunks are injected as context into a prompt template.
LLM Generation: The augmented prompt is sent to a local LLM (via Ollama), which synthesizes a coherent, human-readable answer based only on the provided context.
Response Display: The final answer is displayed in the UI.

🛠️ Technology Stack

Backend & UI: Streamlit
RAG Orchestration: LangChain
Vector Database: ChromaDB
Local LLM Server: Ollama (for running models like Llama 3, Mistral)
Embeddings Model: Hugging Face Sentence Transformers (all-MiniLM-L6-v2)

🚀 Getting Started

Follow these steps to set up and run CorpusAI on your local machine.

Prerequisites

Python 3.9+
Ollama installed and running.

Installation

Clone the Repository:

git clone https://github.com/your-username/CorpusAI.git
cd CorpusAI

Create a Virtual Environment (Recommended):

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install Dependencies:
```
pip install -r requirements.txt
```
Set Up Your Local LLM with Ollama:
- First, ensure the Ollama application is running in the background.
- Pull a model from the command line. We recommend starting with Llama 3.
```
ollama pull llama3
```

Running the Application

Launch the Streamlit app:
```
streamlit run app.py
```
Open your web browser and navigate to the local URL provided (usually http://localhost:8501).

💡 How to Use

Ingest Documents: Use the interface to upload your PDF files or connect to a data source. CorpusAI will automatically process and index them.
Ask Questions: Once a vector database is loaded, use the chat input to ask questions about the content of your documents.
Receive Answers: Get concise, AI-generated answers directly in the chat interface.

🤝 Contributing

Contributions are welcome! If you have ideas for new features, bug fixes, or improvements, please feel free to open an issue or submit a pull request.

Fork the repository.
Create a new branch (git checkout -b feature/your-feature-name).
Commit your changes (git commit -m 'Add some feature').
Push to the branch (git push origin feature/your-feature-name).
Open a Pull Request.

📜 License

This project is licensed under the MIT License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
utils		utils
LICENSE		LICENSE
README.md		README.md
app.py		app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CorpusAI: A RAG-Powered Knowledge Engine

📖 Overview

✨ Key Features

🏗️ Architecture: How It Works

Phase 1: Ingestion (Indexing the Corpus)

Phase 2: Retrieval & Generation (Answering a Query)

🛠️ Technology Stack

🚀 Getting Started

Prerequisites

Installation

Running the Application

💡 How to Use

🤝 Contributing

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CorpusAI: A RAG-Powered Knowledge Engine

📖 Overview

✨ Key Features

🏗️ Architecture: How It Works

Phase 1: Ingestion (Indexing the Corpus)

Phase 2: Retrieval & Generation (Answering a Query)

🛠️ Technology Stack

🚀 Getting Started

Prerequisites

Installation

Running the Application

💡 How to Use

🤝 Contributing

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages