Skip to content

gsdali/coreml-edocr2

Repository files navigation

coreml-edocr2

eDOCr2 engineering-drawing OCR converted to CoreML for Apple Silicon.

Three CoreML ML-Program packages (FP16) targeting the Apple Neural Engine on M-series Macs, covering the full eDOCr2 OCR cascade:

Stage Architecture Input Output .mlpackage size
Detector CRAFT (VGG backbone) (1, 1280, 1280, 3) RGB, ImageNet-normalised (1, 640, 640, 2) region + affinity heatmap ~40 MB
Recogniser CRNN + STN (1, 31, 200, 1) grayscale (1, 48, 39) CTC softmax (38-char dim alphabet + blank) ~17 MB
GD&T classifier CRNN + STN (1, 31, 200, 1) grayscale (1, 48, 40) CTC softmax (39-char GD&T alphabet + blank) ~17 MB

The "GD&T classifier" is architecturally identical to the recogniser — it uses a different alphabet and weights trained on engineering GD&T symbols (∅⌖⌒⌓⏤⏥⏊⌭⫽◎↗⌰⌯ and datum letters ⒺⒻⓁⓂⓅⓈⓉⓊ).

Post-processing (heatmap → bounding boxes, CTC decoding) runs on the Swift side; see test_ane.swift for a complete loader + greedy decoder.

Measured latency (Apple M4, .cpuAndNeuralEngine)

From test_ane.swift, 20 iterations after a 3-iteration warm-up:

Stage Mean Min Std
detector 102.3 ms 97.6 ms 3.3 ms
recogniser 4.0 ms 3.9 ms 0.07 ms
gdt_classifier 4.1 ms 4.0 ms 0.06 ms

Worst-case end-to-end for a single text-bearing crop (one detector pass + ~5 recogniser calls): ~125 ms, well under the 200 ms target in the conversion plan.

The detector exceeds its original 40 ms per-stage target because 1280×1280 is a large input for a 21 M-parameter CNN on ANE. On smaller tile sizes (e.g. 640×640) it drops to under 40 ms — if you need that, rerun convert.py with DETECTOR_H = DETECTOR_W = 640.

Parity vs Keras (FP32)

Max-abs-difference between tf.keras (FP32, CPU) and CoreML (FP16, ANE):

Stage Max abs diff Mean rel diff
detector 5.2 e-3 1.6 e-3
recogniser 3.1 e-2 5.8 e-3
gdt_classifier 6.9 e-3 2.1 e-3

Recogniser error is higher because the CTC softmax amplifies small logit differences. Greedy-decoded output on real test crops is identical between Keras and CoreML in spot-checks.

Usage (Swift)

import CoreML

let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine

let detector = try MLModel(contentsOf: MLModel.compileModel(
    at: URL(fileURLWithPath: "artefacts/edocr2_detector.mlpackage")),
    configuration: config)

let recogniser = try MLModel(contentsOf: MLModel.compileModel(
    at: URL(fileURLWithPath: "artefacts/edocr2_recogniser.mlpackage")),
    configuration: config)

// feed: 1×1280×1280×3 float32, ImageNet-normalised
let det = try detector.prediction(from: ...)
// heatmap → bboxes in Swift/Python (see eDOCr2 upstream tools.getBoxes)

// for each detected bbox, crop to 31×200 grayscale, feed:
let rec = try recogniser.prediction(from: ...)
// rec output shape: [1, 48, 39]; run CTC greedy decode against
// the alphabet stored in mlmodel.metadata[.creatorDefinedKey]["alphabet"].

The full working example is in test_ane.swift — run it with swift test_ane.swift from the repo root (macOS 15+, Xcode CLT installed). It loads all three models on the ANE, benchmarks them, and prints a CTC-decoded sample.

Reproducing the conversion

python3.12 -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt
python convert.py

Downloads upstream weights (~220 MB total) into weights/ and writes three .mlpackage bundles into artefacts/. End-to-end conversion takes ~2 minutes on an M4.

See convert.py for the full pipeline:

  1. Detector — upstream build_keras_model (VGG backbone) loaded with craft_mlt_25k.h5 from the faustomorales/keras-ocr v0.8.4 release, then monkey-patched to a fixed (1, 1280, 1280, 3) input so Core ML / ANE can dispatch it. Converted via coremltools.convert(source="tensorflow", compute_precision=FLOAT16).
  2. Recogniser / GD&T classifier — upstream Recognizer class loaded with the matching .keras weight file from the eDOCr2 v1.0.0 release, using the alphabet from the sibling .txt file. The converted model stops at the CTC softmax layer; the prediction_model's built-in CTCDecoder Lambda is dropped because tf.keras.backend.ctc_decode has no Core ML equivalent.

Both recognisers include a Spatial Transformer Network (STN) sub-graph which converts cleanly via the TensorFlow frontend — no hand-rewriting was needed. The two Lambda(_transform, …) and Lambda(image-flip) layers traced as raw TF ops (matmul, gather, reshape, slice, tile, add_n, clip_by_value) and go through Core ML intact.

Every model's alphabet, expected input shape, and blank-index are stashed in mlmodel.user_defined_metadata so the Swift side can inspect them rather than hard-coding values.

Alphabets

Stored in each .mlpackage under metadata[.creatorDefinedKey]["alphabet"].

  • Dimensions (edocr2_recogniser): 0123456789AaBCDRGHhMmnxZt(),.+-±:/°"⌀= (38 chars + blank)
  • GD&T (edocr2_gdt_classifier): 0123456789,.⌀ABCD⏤⏥○⌭⌒⌓⏊∠⫽⌯⌖◎↗⌰ⒺⒻⓁⓂⓅⓈⓉⓊ (39 chars + blank)

Both end with a CTC blank token at index len(alphabet).

Attribution

This repository converts eDOCr2 to CoreML format.

The .mlpackage files in artefacts/ are derived from the upstream Keras weights; the conversion scripts under convert.py and test_ane.swift are MIT-licensed.

About

eDOCr2 engineering drawing OCR converted to CoreML for Apple Silicon

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors