Skip to content

bionicle12/ai-knowledge-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 AI Knowledge Engine

Turn any AI coding agent into a self-learning knowledge engine.

A modular, instruction-driven framework that teaches AI agents (Claude, GPT, Cursor, Codex, Gemini) to build, maintain, and evolve a structured personal knowledge base β€” powered by local-first NLP and automatic context indexing.

License: MIT Python 3.11+ Node.js 20+ Tests Coverage Version No Cloud Required

Quick Start Β· Features Β· Architecture Β· Examples Β· Русский


What is this?

AI Knowledge Engine is a set of modular Markdown instructions that any AI coding agent can read and execute to:

  1. πŸ—‚οΈ Index your codebase β€” Pack your project into a single AI-readable context file (5 min setup)
  2. 🧠 Build a knowledge base β€” Create a full Raw-First Knowledge Pipeline with NLP enrichment, provenance tracking, self-learning, and automated maintenance

No SaaS. No API keys. No cloud. Everything runs locally on your machine.

Indexer-agnostic: Ships with Repomix support out of the box, but the architecture is designed to work with any codebase-to-context tool.


Two Modes

Mode What you get Setup time
Lite β†’ quick-start/ AI-optimized codebase index with auto-update on git commit ~5 minutes
Full β†’ knowledge-base/ Personal knowledge engine with NLP, self-learning loop, AI review queue, health checks, and smart scheduling ~30 minutes

Quick Start

Lite Mode: Codebase Indexing

# 1. Install the indexer
npm install -g repomix

# 2. Copy quick-start/ into your project
cp -r quick-start/ /path/to/your-project/docs/ai-init/

# 3. Tell your AI agent:
"Read docs/ai-init/INIT_GUIDE.md and set up context indexing for this project"

Your AI agent will analyze the project structure, configure the indexer, set up git hooks, and generate the first context snapshot.

Full Mode: Knowledge Base

# 1. Install dependencies
npm install -g repomix
pip install pyyaml python-slugify python-docx python-pptx pypdf pandas openpyxl
pip install spacy rake-nltk keybert
python3 -m spacy download ru_core_news_md  # Russian NLP (swap for your language)

# 2. Copy knowledge-base/ into your project
cp -r knowledge-base/ /path/to/your-project/setup/

# 3. Tell your AI agent:
"Read setup/README.md and deploy a knowledge base for [your role]"

The agent will ask clarifying questions about your role, create the folder structure, configure NLP pipelines, and run the first indexing pass.


Features

Lite Mode

  • πŸ“¦ One-command setup β€” AI agent handles the entire configuration
  • πŸ”„ Auto-update β€” Git hooks regenerate context on every commit
  • 🎯 Stack-aware β€” Pre-configured patterns for React, Rust, Python, Go, Node.js, and more
  • πŸ”’ Security scanning β€” Detects leaked secrets before indexing
  • πŸ“Š Token budget control β€” Tree-sitter compression reduces token count by 50-70%
  • πŸ“‚ Profile support β€” Separate context files per subsystem (backend, frontend, infra)

Full Mode

  • πŸ”¬ Raw-First Pipeline β€” Drop PDFs, DOCX, PPTX into raw/ β†’ auto-convert to Markdown β†’ NLP enrichment β†’ clean knowledge
  • 🧠 Self-Learning Loop β€” !save sessions, !reflect for higher-level insights, !audit for comprehensive review
  • πŸ”— Cross-Linked Knowledge β€” [[wikilinks]] + routing tables for scalable navigation across hundreds of pages
  • πŸ“Š NLP Enrichment β€” Named Entity Recognition, keyword extraction, entity resolution (spaCy + KeyBERT) β€” zero tokens, pure CPU
  • πŸ“œ Provenance Tracking β€” Every fact traced to its original source with hash verification and span-level citations
  • πŸ” Surprise Filter β€” Anti-duplication: only genuinely new information enters the knowledge base
  • βš•οΈ Health Checks β€” Python-based lint for stale pages, orphans, broken links, and contradictions
  • ⏰ Smart Scheduling β€” Auto-reflection when importance threshold is reached; skips when idle to save tokens
  • πŸ” Privacy-by-Default β€” Raw data, reviews, and interaction logs are never indexed
  • 🌍 Fully Portable β€” Pure Markdown files, no databases, no servers, works on any machine with rsync

Architecture

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
User ─────────────→ β”‚   raw/      β”‚  ← PDFs, DOCX, notes, chats, screenshots
                    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                           β”‚ kb_ingest.py (Python + NLP)
                           β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚ processed/  β”‚  ← Markdown + NLP metadata (0 tokens)
                    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β–Ό            β–Ό            β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚knowledge/β”‚ β”‚ review/  β”‚ β”‚ assets/  β”‚
        β”‚ (clean)  β”‚ β”‚(complex) β”‚ β”‚(binary)  β”‚
        β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β–Ό context indexer
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚  output.xml  β”‚  ← AI-ready context snapshot
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Cost Model

Token consumption depends on the operating mode (mode in kb.config.yml):

Tier Operations default super
Python (free) NLP, lint L1, conversion 0 tok 0 tok
Light AI Importance scoring ~500 ~1-2K
Mode-switched Surprise filter, annotations, entity resolution 0 tok (Python) ~3-9K tok (AI)
Heavy AI Reflection, deep review, writeback ~15-100K ~15-100K
Mode Daily (active) Weekly
default ~3-4K tokens ~20-30K
super ~50-200K+ tokens ~500K-1.5M

Knowledge Structure

The full mode creates a rich, opinionated folder hierarchy:

your-project/
β”œβ”€β”€ AGENTS.md                   # AI agent instructions (auto-generated)
β”œβ”€β”€ kb.config.yml               # KB configuration (role, entities, rules)
β”œβ”€β”€ watcher.sh                  # Watch mode (auto-process new files)
β”œβ”€β”€ reindex.sh                  # Manual reindex trigger
β”œβ”€β”€ scripts/                    # Python automation
β”‚   β”œβ”€β”€ kb_ingest.py            # Raw β†’ processed β†’ knowledge pipeline
β”‚   β”œβ”€β”€ kb_lint.py              # Health check (Level 1: Python)
β”‚   β”œβ”€β”€ kb_reflect.py           # Reflection (synthesize insights)
β”‚   β”œβ”€β”€ kb_watch.py             # File watcher daemon
β”‚   └── kb_nlp_batch.py         # Batch NLP re-enrichment
β”‚
β”œβ”€β”€ raw/                        # Raw materials (NOT indexed)
β”‚   β”œβ”€β”€ documents/unsorted/
β”‚   β”œβ”€β”€ reference/unsorted/
β”‚   └── media/unsorted/
β”‚
β”œβ”€β”€ knowledge/                  # βœ… Clean knowledge (INDEXED)
β”‚   β”œβ”€β”€ domain/                 # Facts, market, technology
β”‚   β”œβ”€β”€ playbooks/              # Repeatable workflows
β”‚   β”œβ”€β”€ decisions/              # Immutable decision records
β”‚   β”œβ”€β”€ principles/             # Rules, beliefs, standards
β”‚   β”œβ”€β”€ insights/               # Higher-level synthesis
β”‚   β”œβ”€β”€ opinions/               # Subjective assessments with confidence
β”‚   β”œβ”€β”€ routing/                # Navigation tables for large bases
β”‚   └── open-questions/         # Unresolved questions
β”‚
β”œβ”€β”€ interactions/               # Session logs (NOT indexed directly)
β”‚   └── sessions/
β”‚
└── review/                     # AI review queue (NOT indexed)
    β”œβ”€β”€ needs-classification/
    β”œβ”€β”€ needs-ai-decision/
    └── needs-redaction/

User Commands

Commands you tell your AI agent in the IDE chat:

Command What it does Cost When to use
!save Save session summary with key decisions and insights ~2K tokens After productive sessions (45+ min)
!reflect Synthesize higher-level insights from accumulated facts ~15K tokens Auto-triggered or on demand
!audit Full AI review: contradictions, gaps, merge candidates ~50–100K tokens Every 2–4 weeks
!super Toggle operating mode: default ↔ super 0 tokens When you need maximum learning speed
!super on/off Explicitly enable/disable super mode 0 tokens See Operating Modes
!super status Show current operating mode 0 tokens Quick check

Operating Modes

The system supports two operating modes, controlled by !super command:

Mode Paradigm Token Cost Best For
default Python-first, throttled ~3-4K/day Limited token budgets, daily use
super AI-first, on-demand ~50-200K+/day Unlimited plans, intensive knowledge building

Default mode uses Python-first processing (NLP, heuristic filters) and throttled AI schedules. AI is called only for importance scoring and large document surprise checks.

Super mode replaces Python heuristics with full AI reasoning for every operation:

  • πŸ” Semantic surprise detection β€” AI evaluates every ingest for genuinely new information (+40% accuracy vs Python NLP overlap)
  • πŸ“ Intelligent annotations β€” AI generates meaningful cross-references with suggested edits (+60% usefulness vs template annotations)
  • 🌐 Cross-language entity resolution β€” AI understands synonyms and multilingual variants (+30% coverage)
  • ⚑ On-demand reflection β€” Triggers after every significant ingest (importance β‰₯5) instead of weekly schedule
  • πŸ§ͺ Daily AI audit β€” Lint Level 2 runs automatically during consolidation
  • πŸ“₯ Auto review processing β€” review/needs-ai-decision/ is processed without waiting for !audit

⚠️ Warning: Super mode can consume your entire daily token budget in a single active session. Use only with unlimited or high-limit AI plans.

Smart Triggers

The system tracks an importance score for each ingested item. When the cumulative score exceeds a threshold (default: 25, super: 5), reflection runs automatically. If nothing changed β€” no tokens are spent.

Days without reflection:  1  2  3  4  5  6  7  8  9
Changes?                  -  -  -  -  -  -  -  -  βœ“
                                                   ↑
                                          Trigger! (>7 days + changes exist)

Instruction Modules

The knowledge base system is built from modular instruction files that the AI agent reads sequentially:

# Module Purpose
00 Overview Deployment map: what to read, what to copy, in what order (read first)
01 Prerequisites Environment check: Node.js, Python, Git, indexer
02 Init Role clarification, entity selection, folder creation
03 Pipeline Python ingest script: conversion + NLP + source hashing
04 Review AI review workflow for complex/ambiguous materials
05 Index Context indexing, [[wikilinks]], routing tables
06 Agents Template AGENTS.md template with token budget
07 Interaction Loop Self-learning + session capture + query writeback
08 Portable Portability + Dynamic Context Enrichment
09 Lint Health checks: Level 1 (Python) + Level 2 (AI) + --metrics
10 Log Append-only operation chronology
11 Provenance Source hash, span citations, regression tests
12 NLP Preprocess NER + keyword extraction + entity resolution
13 Autorun File watcher, git hooks, smart scheduling
14 Initial Population Generate role-specific DATA_PLACEMENT_EXAMPLES.md

Role Templates

Pre-configured examples in knowledge-base/examples/:

Template Role Highlights
programmer-senior.yml Senior Software Engineer Architecture, debugging, tech stack, code principles
marketing-director.yml Marketing Director Strategy, brand, campaigns, audience analysis
creative-hybrid.yml Creative Hybrid Code + music production + indie gamedev
product-manager.yml Product Manager Prioritization, metrics, user research, PRDs
researcher.yml Researcher / Analyst Literature graph, hypotheses, methodology
founder.yml Startup founder Investors, hiring, product, decision logs
content-creator.yml Content creator Voice fingerprinting, audience, monetization
fiction-writer.yml Fiction writer Craft theory, voice training from influences, draft critique

Don't see your role? Tell the AI agent your profession β€” it will generate a custom configuration with relevant entities, knowledge paths, and example workflows.


Supported AI Agents

Tested and designed for:

Agent Status Notes
Claude (Anthropic) βœ… Fully supported Cursor, API, Claude Desktop
GPT-4 / GPT-4o βœ… Fully supported Cursor, Copilot, ChatGPT
Codex CLI βœ… Fully supported OpenAI Codex
Gemini βœ… Fully supported JetBrains AI, Google AI Studio
Any Markdown-capable agent βœ… Compatible Must read .md and run shell commands

Requirements

Component Minimum Required for
Node.js 20.0+ Context indexer (Repomix)
Python 3.11+ Ingest pipeline, NLP, lint
Git any Hooks, history tracking
IDE with AI required Agent interaction

Optional system tools

# Ubuntu / Debian
sudo apt install -y pandoc poppler-utils tesseract-ocr

# macOS
brew install pandoc poppler tesseract

Contributing

Contributions are welcome! Here's how you can help:

  • 🌍 Translations β€” Translate instruction modules to other languages
  • πŸ“ Role Templates β€” Add examples/*.yml for new professions
  • πŸ”§ Pipeline Scripts β€” Improve Python ingest, NLP, and lint scripts
  • πŸ“– Documentation β€” Clarify instructions, add diagrams, fix typos
  • πŸ§ͺ Testing β€” Try with different AI agents and report compatibility

Please open an issue before starting major work to discuss the approach.


License

MIT β€” Free for personal and commercial use.


Built for humans who talk to AI.

If this project helps you build a better knowledge workflow β€” ⭐ give it a star.

About

🧠 AI Knowledge Engine. Turn any AI coding agent into a self-learning knowledge engine. Modular instruction framework for building personal knowledge bases with local NLP, provenance tracking, and smart scheduling.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors