An intelligent engine for unifying and querying your internal knowledge corpus—support tickets, knowledge bases, and other documents—using the power of Retrieval-Augmented Generation.
Key Features • Architecture • Getting Started • Technology Stack • Contributing
CorpusAI transforms your scattered internal documents into a centralized, intelligent knowledge base. It moves beyond simple keyword search by leveraging a Retrieval-Augmented Generation (RAG) pipeline to understand the context and intent of natural language queries.
Whether you're resolving support tickets, onboarding new team members, or searching for specific information in a vast sea of documents, CorpusAI provides precise, context-aware answers by consulting the original source material.
- Unified Knowledge Access: Ingest and search across diverse document types (PDFs, text files, etc.) in a single, unified space.
- Natural Language Querying: Ask questions in plain English, just as you would ask a human expert.
- Context-Aware Answers: Get direct answers synthesized by an LLM, grounded in the specific information found within your documents.
- Source Verification: Responses are based directly on your provided documents, minimizing hallucinations and ensuring accuracy.
- Local & Private: Powered by Ollama, the entire pipeline can run locally on your machine, ensuring your sensitive data remains secure.
- Performance Monitoring: Built-in logging and performance utilities to track and optimize query times and resource usage.
- Intuitive Web Interface: A clean and interactive UI built with Streamlit for easy document management and querying.
CorpusAI operates on a robust RAG pipeline, which can be broken down into two core phases:
- Document Loading: Your documents (support tickets, knowledge bases, etc.) are loaded into the system.
- Intelligent Chunking: The documents are segmented into smaller, semantically meaningful chunks.
- Embedding & Storage: Each chunk is converted into a vector embedding and stored in a ChromaDB vector database alongside its source metadata. This creates a searchable index of your knowledge corpus.
- User Query: A user asks a question in the Streamlit UI.
- Semantic Retrieval: The query is embedded, and a similarity search is performed against the ChromaDB index to retrieve the most relevant document chunks.
- Context Augmentation: The retrieved chunks are injected as context into a prompt template.
- LLM Generation: The augmented prompt is sent to a local LLM (via Ollama), which synthesizes a coherent, human-readable answer based only on the provided context.
- Response Display: The final answer is displayed in the UI.
- Backend & UI: Streamlit
- RAG Orchestration: LangChain
- Vector Database: ChromaDB
- Local LLM Server: Ollama (for running models like Llama 3, Mistral)
- Embeddings Model: Hugging Face Sentence Transformers (
all-MiniLM-L6-v2)
Follow these steps to set up and run CorpusAI on your local machine.
- Python 3.9+
- Ollama installed and running.
-
Clone the Repository:
git clone https://github.com/your-username/CorpusAI.git cd CorpusAI -
Create a Virtual Environment (Recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install Dependencies:
pip install -r requirements.txt
-
Set Up Your Local LLM with Ollama:
- First, ensure the Ollama application is running in the background.
- Pull a model from the command line. We recommend starting with Llama 3.
ollama pull llama3
- Launch the Streamlit app:
streamlit run app.py
- Open your web browser and navigate to the local URL provided (usually
http://localhost:8501).
- Ingest Documents: Use the interface to upload your PDF files or connect to a data source. CorpusAI will automatically process and index them.
- Ask Questions: Once a vector database is loaded, use the chat input to ask questions about the content of your documents.
- Receive Answers: Get concise, AI-generated answers directly in the chat interface.
Contributions are welcome! If you have ideas for new features, bug fixes, or improvements, please feel free to open an issue or submit a pull request.
- Fork the repository.
- Create a new branch (
git checkout -b feature/your-feature-name). - Commit your changes (
git commit -m 'Add some feature'). - Push to the branch (
git push origin feature/your-feature-name). - Open a Pull Request.
This project is licensed under the MIT License. See the LICENSE file for more details.