eDOCr2 engineering-drawing OCR converted to CoreML for Apple Silicon.
Three CoreML ML-Program packages (FP16) targeting the Apple Neural Engine on M-series Macs, covering the full eDOCr2 OCR cascade:
| Stage | Architecture | Input | Output | .mlpackage size |
|---|---|---|---|---|
| Detector | CRAFT (VGG backbone) | (1, 1280, 1280, 3) RGB, ImageNet-normalised |
(1, 640, 640, 2) region + affinity heatmap |
~40 MB |
| Recogniser | CRNN + STN | (1, 31, 200, 1) grayscale |
(1, 48, 39) CTC softmax (38-char dim alphabet + blank) |
~17 MB |
| GD&T classifier | CRNN + STN | (1, 31, 200, 1) grayscale |
(1, 48, 40) CTC softmax (39-char GD&T alphabet + blank) |
~17 MB |
The "GD&T classifier" is architecturally identical to the recogniser — it uses a different alphabet and weights trained on engineering GD&T symbols (∅⌖⌒⌓⏤⏥⏊⌭⫽◎↗⌰⌯ and datum letters ⒺⒻⓁⓂⓅⓈⓉⓊ).
Post-processing (heatmap → bounding boxes, CTC decoding) runs on the
Swift side; see test_ane.swift for a complete loader + greedy decoder.
From test_ane.swift, 20 iterations after a 3-iteration warm-up:
| Stage | Mean | Min | Std |
|---|---|---|---|
| detector | 102.3 ms | 97.6 ms | 3.3 ms |
| recogniser | 4.0 ms | 3.9 ms | 0.07 ms |
| gdt_classifier | 4.1 ms | 4.0 ms | 0.06 ms |
Worst-case end-to-end for a single text-bearing crop (one detector pass + ~5 recogniser calls): ~125 ms, well under the 200 ms target in the conversion plan.
The detector exceeds its original 40 ms per-stage target because
1280×1280 is a large input for a 21 M-parameter CNN on ANE. On smaller
tile sizes (e.g. 640×640) it drops to under 40 ms — if you need that,
rerun convert.py with DETECTOR_H = DETECTOR_W = 640.
Max-abs-difference between tf.keras (FP32, CPU) and CoreML (FP16, ANE):
| Stage | Max abs diff | Mean rel diff |
|---|---|---|
| detector | 5.2 e-3 | 1.6 e-3 |
| recogniser | 3.1 e-2 | 5.8 e-3 |
| gdt_classifier | 6.9 e-3 | 2.1 e-3 |
Recogniser error is higher because the CTC softmax amplifies small logit differences. Greedy-decoded output on real test crops is identical between Keras and CoreML in spot-checks.
import CoreML
let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine
let detector = try MLModel(contentsOf: MLModel.compileModel(
at: URL(fileURLWithPath: "artefacts/edocr2_detector.mlpackage")),
configuration: config)
let recogniser = try MLModel(contentsOf: MLModel.compileModel(
at: URL(fileURLWithPath: "artefacts/edocr2_recogniser.mlpackage")),
configuration: config)
// feed: 1×1280×1280×3 float32, ImageNet-normalised
let det = try detector.prediction(from: ...)
// heatmap → bboxes in Swift/Python (see eDOCr2 upstream tools.getBoxes)
// for each detected bbox, crop to 31×200 grayscale, feed:
let rec = try recogniser.prediction(from: ...)
// rec output shape: [1, 48, 39]; run CTC greedy decode against
// the alphabet stored in mlmodel.metadata[.creatorDefinedKey]["alphabet"].The full working example is in test_ane.swift — run it with
swift test_ane.swift from the repo root (macOS 15+, Xcode CLT
installed). It loads all three models on the ANE, benchmarks them,
and prints a CTC-decoded sample.
python3.12 -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt
python convert.pyDownloads upstream weights (~220 MB total) into weights/ and writes
three .mlpackage bundles into artefacts/. End-to-end conversion
takes ~2 minutes on an M4.
See convert.py for the full pipeline:
- Detector — upstream
build_keras_model(VGG backbone) loaded withcraft_mlt_25k.h5from thefaustomorales/keras-ocrv0.8.4 release, then monkey-patched to a fixed(1, 1280, 1280, 3)input so Core ML / ANE can dispatch it. Converted viacoremltools.convert(source="tensorflow", compute_precision=FLOAT16). - Recogniser / GD&T classifier — upstream
Recognizerclass loaded with the matching.kerasweight file from the eDOCr2 v1.0.0 release, using the alphabet from the sibling.txtfile. The converted model stops at the CTC softmax layer; theprediction_model's built-inCTCDecoderLambda is dropped becausetf.keras.backend.ctc_decodehas no Core ML equivalent.
Both recognisers include a Spatial Transformer Network (STN) sub-graph
which converts cleanly via the TensorFlow frontend — no hand-rewriting
was needed. The two Lambda(_transform, …) and Lambda(image-flip)
layers traced as raw TF ops (matmul, gather, reshape, slice, tile,
add_n, clip_by_value) and go through Core ML intact.
Every model's alphabet, expected input shape, and blank-index are
stashed in mlmodel.user_defined_metadata so the Swift side can
inspect them rather than hard-coding values.
Stored in each .mlpackage under metadata[.creatorDefinedKey]["alphabet"].
- Dimensions (
edocr2_recogniser):0123456789AaBCDRGHhMmnxZt(),.+-±:/°"⌀=(38 chars + blank) - GD&T (
edocr2_gdt_classifier):0123456789,.⌀ABCD⏤⏥○⌭⌒⌓⏊∠⫽⌯⌖◎↗⌰ⒺⒻⓁⓂⓅⓈⓉⓊ(39 chars + blank)
Both end with a CTC blank token at index len(alphabet).
This repository converts eDOCr2 to CoreML format.
- Original work: eDOCr2 by Javier Villena Toro
- Paper: "eDOCr2: Engineering Drawing OCR" (MDPI Machines, 2025), DOI 10.2139/ssrn.5045921
- CRAFT detector weights: faustomorales/keras-ocr v0.8.4, originally clovaai/CRAFT-pytorch
- License: MIT (see
LICENSEfile)
The .mlpackage files in artefacts/ are derived from the upstream
Keras weights; the conversion scripts under convert.py and
test_ane.swift are MIT-licensed.