Headless REST API for CSI MasterFormat specification document automation with round-trip DOCX support.
SpecR treats construction specification documents as structured data with true parent/child paragraph relationships — not opaque Word files. It parses DOCX and UFGS .SEC specifications into a canonical CSI AST, stores them in PostgreSQL, and will regenerate them with full numbering fidelity. It targets git-style 3-way merge when edited documents come back from reviewers.
The target: In a Web UI, a spec writer connects a Revit model, sees their Part 2 (Products) sections auto-populate from equipment families, is able to export clean DOCX files, receives a redlined version from the Owner, and merges accepted changes back into the database — all without manual transcription; but still with full control and manual bi-directional editing of paragraph language in the database.
Active development — Phase 2b complete, Phase 2c next.
| Phase | Description | Status |
|---|---|---|
| 0 | Foundation — scaffolding, DB schema, seed data, CRUD API, CI | ✅ Complete |
| 1a | UFGS .SEC parser + cross-reference model |
✅ Complete |
| 1b | Project + TOC management API | ✅ Complete |
| 1c-i | DOCX numbering.xml + styles.xml analyzers (Clippit-ported) |
✅ Complete (PR #17) |
| 1c-ii | 5-signal hierarchy inference engine + POST /parse async endpoint |
✅ Complete (PR #21) |
| 2a | MCP server (Streamable HTTP, read-only tools + resources) + Markdown renderer | ✅ Complete (PR #24) |
| 2b-i | AST → DOCX generator + 7-level CSI multilevel numbering | ✅ Complete (PR #26) |
| 2b-ii | w:sdt content control UUID injection (round-trip anchors) |
✅ Complete (PR #51) |
| 2b-iii | MCP tools: get_paragraph, parse_document, generate_docx |
✅ Complete (PR #55) |
| 2b-iv | Universal file loader: load:files, seed:corpus, load_files MCP tool |
✅ Complete (PR #58) |
| 1c-iii | Plaintext .txt parser — 4-signal hierarchy inference, read-only ingest |
✅ Complete (PR #66) |
| 2c | Firm style template engine (issue #20) | Planned |
| 3 | Round-trip merge engine | Planned |
| 4 | Revit integration | Planned |
| 5 | Web UI | Planned |
See ARCHITECTURE.md for the full specification and docs/research-executive-summary.md for the landscape analysis.
- UFGS
.SECparser — SpecsIntact XML → canonicalCsiTreewithCsiNodehierarchy. Extracts<PRT>/<SPT>/<TXT>elements into Part → Article → PR1–PR5 levels. Parses cross-references between sections at ingest time. - Encoding-transparent ingest —
.secfiles are decoded via chardet + iconv-lite before parsing:windows-1252,latin-1, UTF-8, and ~100 other encodings detected and transcoded automatically. No manual encoding flag needed. (.docxis a binary ZIP — encoding is not a concern.) - DOCX
numbering.xmlanalyzer — builds the completeabstractNum → num → paragraph stylelinkage map. HandlesbasedOninheritance chains,lvlOverrideoverrides, and the ClippitListItemRetrieversentinel:numId=0as explicit numbering suppression (haltsbasedOntraversal rather than inheriting parent numbering). This correctly handles CPI continuation styles (PR1lc–PR5lc) which represent roughly one-third of document content in CPI samples. - DOCX
styles.xmlanalyzer — resolves fullbasedOnchains, identifiesnumPr-carrying styles, and propagatessuppressesNumberingthrough style inheritance. Produces the style map consumed by the inference engine. - DOCX
word/document.xmlextractor — walks paragraph sequence via JSZip + fast-xml-parser, extracts text (multi-run concat), styleId, numId/ilvl, left indent, outlineLvl, and vanish flag. Merges style-inherited numPr when paragraph has no ownw:numPr. - 5-signal hierarchy inference engine — two-pass pipeline: Pass 1 classifies each paragraph using a priority chain (numbering XML > style chain > text regex > indentation), logging signal conflicts into
meta.conflictsfor MCP surfacing. Pass 2 builds the parent/child tree using a stack algorithm (handles ilvl gaps, jumps, continuation paragraphs, and hidden note nodes). Source template (arcat/cpi/unknown) auto-detected from style names and numbering.xml heuristics. - Extraction rules as typed data constants — numbering, style, and signal rules are defined as MCP-readable data structures, not code, enabling LLM agent exploration and parse explainability.
- Plaintext
.txtparser — infers CSI hierarchy from text signals:PART Nheadings,N.Narticle numbers,A./1./a./1)/a)prefix patterns, and leading-whitespace indentation depth as fallback. Section and title extracted fromSECTION XX XX XXheader line (scans first 10 non-blank lines); falls back toinferSectionMeta. Read-only — no round-trip merge anchors.POST /parseaccepts.txtuploads;load_filesMCP tool andpnpm load:filesCLI accept**/*.txtglobs;parse_documentMCP tool accepts.txtfilenames. Parse job result and MCP response includecapabilities: ["read-only"].
POST /specs/:id/generate→ streams DOCX buffer with 7-level CSI multilevel numbering- Each paragraph wrapped in
w:sdtcontent control withspecr-uuid-<id>UUID tag — round-trip merge anchors per ADR-004. Phase 3 merge engine reads these tags to map owner-redlined paragraphs back toparagraphs.id. - Title paragraph intentionally bare (synthetic, no DB id) — Phase 3 merge skips unwrapped paragraphs.
GET /health— liveness checkPOST /parse— upload a.docxor.secfile; returns202 { jobId }immediately (async)GET /parse/jobs/:jobId— poll parse progress:{ status, progress: { stage, pct }, result?, error? }GET /specs/:id— retrieve a spec with its paragraph treePOST /specs/:id/generate— generate DOCX from stored spec ASTPATCH /specs/:id— update spec metadataPOST /projects— create a projectGET /projects/:id— retrieve project with TOCPOST /projects/:id/specs— add a spec section to a project TOCDELETE /projects/:id/specs/:specId— remove a section, cascades dangling cross-referencesGET /projects/:id/references/broken— surface broken cross-references for spec writer review
The async POST /parse pattern (202 + poll) is intentional — inference over large DOCX files takes measurable time, and the job endpoint is designed for Phase 5 Web UI progress bars without further backend changes.
POST /mcp— MCP JSON-RPC endpoint (Streamable HTTP, stateless, integrated into Express)- Tool:
search_library(query, division?, limit?)— ILIKE paragraph search with optional CSI division filter. Returns{ paragraphId, text, nodeType, specId, specSection, specTitle }[] - Tool:
get_spec(specId)— full spec tree + cross-reference resolution. Returns{ tree: CsiTree, references: SpecReference[] }where each reference hasisResolved: boolean(whether target spec is loaded in DB) - Tool:
list_sections(division?)— CSI MasterFormat section index withinDatabaseflag - Tool:
get_paragraph(paragraphId)— returns{ node, ancestors }for a single paragraph.nodeand each ancestor are{ id, nodeType, text, vanish }. Ancestors ordered root → immediate parent. - Tool:
parse_document(filename, contentBase64)— base64-decode a DOCX or SEC file, parse it, insert into the database, return{ specId, section, title, nodeCount }. Max 10 MB decoded. Encoding-transparent for.secfiles. - Tool:
generate_docx(specId)— generate DOCX from a stored spec, returned as base64 in{ specId, section, title, sizeBytes, contentBase64 }. Each paragraph wrapped inw:sdtUUID content control. On-demand from current DB state — not cached. - Tool:
load_files(glob?, paths?, dry_run?)— bulk-load specs from a glob pattern or file path list. Accepts.SECand.docxformats. Returns{ total, succeeded, failed, errors[] }. Idempotent — re-loading an existing spec updates it. - Resource:
specr://specs/{id}— full spec as LLM-readable Markdown. Note/vanish nodes rendered as> **[NOTE]**blockquotes (editor instructions visible to spec writer, hidden from published output) - Resource:
specr://sections— full CSI section index as Markdown table with loaded (✓) flag
Configure in Claude Code via .mcp.json in the repo root (points to http://localhost:3000/mcp when pnpm dev is running).
- PostgreSQL schema:
specs,paragraphs(recursive parent/child),versions,projects,project_specs,spec_references - 31 CSI MasterFormat divisions seeded from UFGS corpus as reference data
- Migration runner with reversible up/down migrations
- Style template engine — firm-specific fonts, spacing, numbering formats (Phase 2c, issue #20)
- Round-trip merge engine (Phase 3)
- Revit integration (Phase 4)
- Web UI with progress bars, live preview, diff/merge review (Phase 5)
- DOCX cross-reference extraction (Phase 1c-iii)
- Security hardening: concurrency cap on parse workers (piscina) — follow-up to issue #22
- MCP write tools (
add_paragraph,update_paragraph, etc.) — Phase 5 - MCP stateful sessions — Phase 5 upgrade
- MCP prompts (
review_spec,suggest_paragraphs) — Phase 6
DOCX files store paragraphs flat — parent/child hierarchy must be inferred. No single signal is reliable across all firms and authoring conventions. The inference engine combines five signals in a priority chain:
| Signal | Source | Reliability |
|---|---|---|
| 1. Numbering XML | numbering.xml abstractNum→num→pStyle map |
Highest — what Word actually respects |
| 2. Style chain | styles.xml basedOn traversal + numPr identification |
High for clean documents |
| 3. Document order | Continuation fallback when no other signal fires | Always present |
| 4. Text content | Anchored regex for leading patterns (^A\.\s, ^1\.\s, ^PART\s+\d+) |
Medium — guards against mid-word false positives |
| 5. Indentation | Left indent ÷ 576 twips ≈ CSI hierarchy level | Low-confidence fallback |
Signals that disagree with the winner are recorded in meta.conflicts per node — available for MCP surfacing and future confidence scoring. Built as a TypeScript port of Clippit's ListItemRetriever (C#, MIT), extended with signals 4 and 5 for real-world messy documents.
pnpm tsx scripts/parse-debug.ts <file.docx>Parses a DOCX file locally (no server, no DB) and prints the inferred hierarchy with signal attribution:
Parsed: unknown — unknown
Source: arcat
Nodes: 57
GENERAL [part, src:arcat]
SECTION INCLUDES [article, src:arcat]
Project Identification: ((Name and location)). [pr1, src:arcat]
Existing site conditions and restrictions: (()) [pr2, src:arcat]
Coordination: [pr1, src:arcat]
Coordinate the work of all trades. [continuation, src:arcat]
Note: section and title show as unknown when docProps/core.xml is absent from the file — common in vendor-generated ARCAT specs. The Source: field and node type inference are unaffected.
| Level | CSI Role | Format |
|---|---|---|
| Part | Part heading | PART 1 - GENERAL |
| Article | Section heading | 1.1 REFERENCES |
| PR1 | First tier | A. text |
| PR2 | Second tier | 1. text |
| PR3 | Third tier | a. text |
| PR4 | Fourth tier | 1) text |
| PR5 | Fifth tier | a) text |
| Component | Technology |
|---|---|
| Language | TypeScript (strict mode) |
| Runtime | Node.js 22 LTS |
| API framework | Express |
| Database | PostgreSQL (recursive CTEs, JSONB) |
| Input validation | Zod |
| DOCX generation | dolanmiu/docx (Phase 2b) |
| DOCX parsing | JSZip + raw OOXML (no TS library does style inheritance) |
| SEC parsing | fast-xml-parser |
| MCP server | @modelcontextprotocol/sdk (Streamable HTTP, stateless) |
| Logging | pino |
pnpm install
# Requires PostgreSQL — start via Docker:
docker compose up -d postgres
pnpm dev # Development server (hot reload)
pnpm test # Unit tests (no DB required)
pnpm test:integration # Integration tests (requires PostgreSQL)
pnpm lint # ESLint + tsc --noEmit
pnpm format # Prettier write
pnpm migrate # Run pending DB migrations| Script | Description |
|---|---|
pnpm load:files <glob> |
Bulk-load spec files matching a glob pattern (.SEC, .docx) into the library |
pnpm seed:corpus |
Load all 665 UFGS .SEC files into the library — idempotent, safe to re-run |
docs/references/UFGS/— Unified Facilities Guide Specifications (665.SECfiles, public domain)docs/references/ARCAT/README.md— Download instructions for ARCAT guide specs (copyrighted, not included)docs/references/MANUFACTURER_CPI/README.md— Download instructions for Chatsworth Products Inc. (CPI) telecom equipment manufacturer specs (copyrighted, not included)