Turn any AI coding agent into a self-learning knowledge engine.
A modular, instruction-driven framework that teaches AI agents (Claude, GPT, Cursor, Codex, Gemini) to build, maintain, and evolve a structured personal knowledge base β powered by local-first NLP and automatic context indexing.
Quick Start Β· Features Β· Architecture Β· Examples Β· Π ΡΡΡΠΊΠΈΠΉ
AI Knowledge Engine is a set of modular Markdown instructions that any AI coding agent can read and execute to:
- ποΈ Index your codebase β Pack your project into a single AI-readable context file (5 min setup)
- π§ Build a knowledge base β Create a full Raw-First Knowledge Pipeline with NLP enrichment, provenance tracking, self-learning, and automated maintenance
No SaaS. No API keys. No cloud. Everything runs locally on your machine.
Indexer-agnostic: Ships with Repomix support out of the box, but the architecture is designed to work with any codebase-to-context tool.
| Mode | What you get | Setup time |
|---|---|---|
Lite β quick-start/ |
AI-optimized codebase index with auto-update on git commit | ~5 minutes |
Full β knowledge-base/ |
Personal knowledge engine with NLP, self-learning loop, AI review queue, health checks, and smart scheduling | ~30 minutes |
# 1. Install the indexer
npm install -g repomix
# 2. Copy quick-start/ into your project
cp -r quick-start/ /path/to/your-project/docs/ai-init/
# 3. Tell your AI agent:
"Read docs/ai-init/INIT_GUIDE.md and set up context indexing for this project"Your AI agent will analyze the project structure, configure the indexer, set up git hooks, and generate the first context snapshot.
# 1. Install dependencies
npm install -g repomix
pip install pyyaml python-slugify python-docx python-pptx pypdf pandas openpyxl
pip install spacy rake-nltk keybert
python3 -m spacy download ru_core_news_md # Russian NLP (swap for your language)
# 2. Copy knowledge-base/ into your project
cp -r knowledge-base/ /path/to/your-project/setup/
# 3. Tell your AI agent:
"Read setup/README.md and deploy a knowledge base for [your role]"The agent will ask clarifying questions about your role, create the folder structure, configure NLP pipelines, and run the first indexing pass.
- π¦ One-command setup β AI agent handles the entire configuration
- π Auto-update β Git hooks regenerate context on every commit
- π― Stack-aware β Pre-configured patterns for React, Rust, Python, Go, Node.js, and more
- π Security scanning β Detects leaked secrets before indexing
- π Token budget control β Tree-sitter compression reduces token count by 50-70%
- π Profile support β Separate context files per subsystem (backend, frontend, infra)
- π¬ Raw-First Pipeline β Drop PDFs, DOCX, PPTX into
raw/β auto-convert to Markdown β NLP enrichment β clean knowledge - π§ Self-Learning Loop β
!savesessions,!reflectfor higher-level insights,!auditfor comprehensive review - π Cross-Linked Knowledge β
[[wikilinks]]+ routing tables for scalable navigation across hundreds of pages - π NLP Enrichment β Named Entity Recognition, keyword extraction, entity resolution (spaCy + KeyBERT) β zero tokens, pure CPU
- π Provenance Tracking β Every fact traced to its original source with hash verification and span-level citations
- π Surprise Filter β Anti-duplication: only genuinely new information enters the knowledge base
- βοΈ Health Checks β Python-based lint for stale pages, orphans, broken links, and contradictions
- β° Smart Scheduling β Auto-reflection when importance threshold is reached; skips when idle to save tokens
- π Privacy-by-Default β Raw data, reviews, and interaction logs are never indexed
- π Fully Portable β Pure Markdown files, no databases, no servers, works on any machine with
rsync
βββββββββββββββ
User ββββββββββββββ β raw/ β β PDFs, DOCX, notes, chats, screenshots
ββββββββ¬βββββββ
β kb_ingest.py (Python + NLP)
βΌ
βββββββββββββββ
β processed/ β β Markdown + NLP metadata (0 tokens)
ββββββββ¬βββββββ
β
ββββββββββββββΌβββββββββββββ
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββ
βknowledge/β β review/ β β assets/ β
β (clean) β β(complex) β β(binary) β
ββββββ¬ββββββ ββββββββββββ ββββββββββββ
β
βΌ context indexer
ββββββββββββββββ
β output.xml β β AI-ready context snapshot
ββββββββββββββββ
Token consumption depends on the operating mode (mode in kb.config.yml):
| Tier | Operations | default | super |
|---|---|---|---|
| Python (free) | NLP, lint L1, conversion | 0 tok | 0 tok |
| Light AI | Importance scoring | ~500 | ~1-2K |
| Mode-switched | Surprise filter, annotations, entity resolution | 0 tok (Python) | ~3-9K tok (AI) |
| Heavy AI | Reflection, deep review, writeback | ~15-100K | ~15-100K |
| Mode | Daily (active) | Weekly |
|---|---|---|
default |
~3-4K tokens | ~20-30K |
super |
~50-200K+ tokens | ~500K-1.5M |
The full mode creates a rich, opinionated folder hierarchy:
your-project/
βββ AGENTS.md # AI agent instructions (auto-generated)
βββ kb.config.yml # KB configuration (role, entities, rules)
βββ watcher.sh # Watch mode (auto-process new files)
βββ reindex.sh # Manual reindex trigger
βββ scripts/ # Python automation
β βββ kb_ingest.py # Raw β processed β knowledge pipeline
β βββ kb_lint.py # Health check (Level 1: Python)
β βββ kb_reflect.py # Reflection (synthesize insights)
β βββ kb_watch.py # File watcher daemon
β βββ kb_nlp_batch.py # Batch NLP re-enrichment
β
βββ raw/ # Raw materials (NOT indexed)
β βββ documents/unsorted/
β βββ reference/unsorted/
β βββ media/unsorted/
β
βββ knowledge/ # β
Clean knowledge (INDEXED)
β βββ domain/ # Facts, market, technology
β βββ playbooks/ # Repeatable workflows
β βββ decisions/ # Immutable decision records
β βββ principles/ # Rules, beliefs, standards
β βββ insights/ # Higher-level synthesis
β βββ opinions/ # Subjective assessments with confidence
β βββ routing/ # Navigation tables for large bases
β βββ open-questions/ # Unresolved questions
β
βββ interactions/ # Session logs (NOT indexed directly)
β βββ sessions/
β
βββ review/ # AI review queue (NOT indexed)
βββ needs-classification/
βββ needs-ai-decision/
βββ needs-redaction/
Commands you tell your AI agent in the IDE chat:
| Command | What it does | Cost | When to use |
|---|---|---|---|
!save |
Save session summary with key decisions and insights | ~2K tokens | After productive sessions (45+ min) |
!reflect |
Synthesize higher-level insights from accumulated facts | ~15K tokens | Auto-triggered or on demand |
!audit |
Full AI review: contradictions, gaps, merge candidates | ~50β100K tokens | Every 2β4 weeks |
!super |
Toggle operating mode: default β super | 0 tokens | When you need maximum learning speed |
!super on/off |
Explicitly enable/disable super mode | 0 tokens | See Operating Modes |
!super status |
Show current operating mode | 0 tokens | Quick check |
The system supports two operating modes, controlled by !super command:
| Mode | Paradigm | Token Cost | Best For |
|---|---|---|---|
| default | Python-first, throttled | ~3-4K/day | Limited token budgets, daily use |
| super | AI-first, on-demand | ~50-200K+/day | Unlimited plans, intensive knowledge building |
Default mode uses Python-first processing (NLP, heuristic filters) and throttled AI schedules. AI is called only for importance scoring and large document surprise checks.
Super mode replaces Python heuristics with full AI reasoning for every operation:
- π Semantic surprise detection β AI evaluates every ingest for genuinely new information (+40% accuracy vs Python NLP overlap)
- π Intelligent annotations β AI generates meaningful cross-references with suggested edits (+60% usefulness vs template annotations)
- π Cross-language entity resolution β AI understands synonyms and multilingual variants (+30% coverage)
- β‘ On-demand reflection β Triggers after every significant ingest (importance β₯5) instead of weekly schedule
- π§ͺ Daily AI audit β Lint Level 2 runs automatically during consolidation
- π₯ Auto review processing β
review/needs-ai-decision/is processed without waiting for!audit
β οΈ Warning: Super mode can consume your entire daily token budget in a single active session. Use only with unlimited or high-limit AI plans.
The system tracks an importance score for each ingested item. When the cumulative score exceeds a threshold (default: 25, super: 5), reflection runs automatically. If nothing changed β no tokens are spent.
Days without reflection: 1 2 3 4 5 6 7 8 9
Changes? - - - - - - - - β
β
Trigger! (>7 days + changes exist)
The knowledge base system is built from modular instruction files that the AI agent reads sequentially:
| # | Module | Purpose |
|---|---|---|
| 00 | Overview | Deployment map: what to read, what to copy, in what order (read first) |
| 01 | Prerequisites | Environment check: Node.js, Python, Git, indexer |
| 02 | Init | Role clarification, entity selection, folder creation |
| 03 | Pipeline | Python ingest script: conversion + NLP + source hashing |
| 04 | Review | AI review workflow for complex/ambiguous materials |
| 05 | Index | Context indexing, [[wikilinks]], routing tables |
| 06 | Agents Template | AGENTS.md template with token budget |
| 07 | Interaction Loop | Self-learning + session capture + query writeback |
| 08 | Portable | Portability + Dynamic Context Enrichment |
| 09 | Lint | Health checks: Level 1 (Python) + Level 2 (AI) + --metrics |
| 10 | Log | Append-only operation chronology |
| 11 | Provenance | Source hash, span citations, regression tests |
| 12 | NLP Preprocess | NER + keyword extraction + entity resolution |
| 13 | Autorun | File watcher, git hooks, smart scheduling |
| 14 | Initial Population | Generate role-specific DATA_PLACEMENT_EXAMPLES.md |
Pre-configured examples in knowledge-base/examples/:
| Template | Role | Highlights |
|---|---|---|
programmer-senior.yml |
Senior Software Engineer | Architecture, debugging, tech stack, code principles |
marketing-director.yml |
Marketing Director | Strategy, brand, campaigns, audience analysis |
creative-hybrid.yml |
Creative Hybrid | Code + music production + indie gamedev |
product-manager.yml |
Product Manager | Prioritization, metrics, user research, PRDs |
researcher.yml |
Researcher / Analyst | Literature graph, hypotheses, methodology |
founder.yml |
Startup founder | Investors, hiring, product, decision logs |
content-creator.yml |
Content creator | Voice fingerprinting, audience, monetization |
fiction-writer.yml |
Fiction writer | Craft theory, voice training from influences, draft critique |
Don't see your role? Tell the AI agent your profession β it will generate a custom configuration with relevant entities, knowledge paths, and example workflows.
Tested and designed for:
| Agent | Status | Notes |
|---|---|---|
| Claude (Anthropic) | β Fully supported | Cursor, API, Claude Desktop |
| GPT-4 / GPT-4o | β Fully supported | Cursor, Copilot, ChatGPT |
| Codex CLI | β Fully supported | OpenAI Codex |
| Gemini | β Fully supported | JetBrains AI, Google AI Studio |
| Any Markdown-capable agent | β Compatible | Must read .md and run shell commands |
| Component | Minimum | Required for |
|---|---|---|
| Node.js | 20.0+ | Context indexer (Repomix) |
| Python | 3.11+ | Ingest pipeline, NLP, lint |
| Git | any | Hooks, history tracking |
| IDE with AI | required | Agent interaction |
# Ubuntu / Debian
sudo apt install -y pandoc poppler-utils tesseract-ocr
# macOS
brew install pandoc poppler tesseractContributions are welcome! Here's how you can help:
- π Translations β Translate instruction modules to other languages
- π Role Templates β Add
examples/*.ymlfor new professions - π§ Pipeline Scripts β Improve Python ingest, NLP, and lint scripts
- π Documentation β Clarify instructions, add diagrams, fix typos
- π§ͺ Testing β Try with different AI agents and report compatibility
Please open an issue before starting major work to discuss the approach.
MIT β Free for personal and commercial use.
Built for humans who talk to AI.
If this project helps you build a better knowledge workflow β β give it a star.