Skip to content

Latest commit

 

History

History

README.md

Data Directory

Storage for application data, models, and knowledge base files.

Structure

data/
├── app_state.enc        # Encrypted application state
├── knowledge_base/      # RAG knowledge base
│   ├── documentation/   # ODCS documentation
│   ├── examples/        # Contract examples
│   ├── best_practices/  # Best practices guides
│   ├── schemas/         # Schema definitions
│   ├── templates/       # Template examples
│   ├── tutorials/       # Tutorial content
│   ├── faiss_index/     # FAISS vector index
│   ├── generated/       # Generated documentation
│   ├── documents.json   # Document metadata
│   └── index.json       # Knowledge base index
├── models/              # ML models
│   └── models--sentence-transformers--all-MiniLM-L6-v2/
│       └── Sentence transformer embeddings model
└── templates/           # Contract templates
    ├── custom/          # Custom templates
    ├── education/       # Education industry
    ├── finance/         # Finance industry
    ├── government/      # Government sector
    ├── healthcare/      # Healthcare industry
    ├── manufacturing/   # Manufacturing
    ├── retail/          # Retail industry
    └── technology/      # Technology sector

Knowledge Base

The knowledge base contains:

  • Documentation: ODCS field descriptions and usage guides
  • Examples: Complete contract examples for reference
  • Best Practices: Validation rules and recommendations
  • Schemas: JSON schema definitions
  • Templates: Industry-specific templates

FAISS Index

Vector search index for semantic retrieval:

  • faiss.index - FAISS index file
  • metadata.pkl - Document metadata

Generated Files

Auto-generated documentation:

  • field_documentation.json - Field-level docs
  • searchable_documents.json - Indexed documents
  • faiss_validation.json - Index validation results

Models

Sentence Transformers

Pre-trained embedding model for semantic search:

  • Model: all-MiniLM-L6-v2
  • Dimension: 384
  • Used for: Document embedding and retrieval

Downloaded automatically on first use.

Templates

Contract templates organized by industry:

  • Custom: User-defined templates
  • Education: Educational data contracts
  • Finance: Financial data contracts
  • Government: Government sector contracts
  • Healthcare: Healthcare data contracts
  • Manufacturing: Manufacturing data contracts
  • Retail: Retail data contracts
  • Technology: Technology sector contracts

Each template includes:

  • Metadata section
  • Scheduling configuration
  • Technical ingestion setup
  • Functional ingestion setup

Application State

app_state.enc stores encrypted application state:

  • User preferences
  • Session information
  • Cached data

Usage

Loading Knowledge Base

from backend.rag.knowledge_base import KnowledgeBase

kb = KnowledgeBase(knowledge_base_path="data/knowledge_base")
results = kb.search("contract metadata", top_k=5)

Accessing Templates

from backend.storage.template_storage import TemplateStorage

storage = TemplateStorage(base_path="data/templates")
templates = storage.list_templates(category="finance")

Vector Search

from backend.rag.faiss_store import FAISSVectorStore

store = FAISSVectorStore(index_path="data/knowledge_base/faiss_index")
results = store.search(query_embedding, top_k=10)

Maintenance

Rebuilding FAISS Index

python backend/scripts/build_faiss_index.py

Regenerating Documentation

python backend/scripts/generate_field_docs.py

Updating Knowledge Base

python backend/scripts/generate_rag_knowledge.py

Storage

  • Local Development: Files stored in data/ directory
  • Production: Files stored in S3 bucket (configured via environment)

Backup

Important files to backup:

  • knowledge_base/faiss_index/ - Vector index
  • templates/ - Custom templates
  • app_state.enc - Application state

Size Considerations

  • FAISS index: ~50-100 MB
  • Models: ~100-200 MB
  • Templates: ~10-20 MB
  • Total: ~200-300 MB

Performance

  • Vector search: <100ms for top-10 results
  • Template loading: <50ms
  • Knowledge base initialization: <1s

Documentation