Detects columns and indented lines in an hOCR file. This Python 3 script is used in the NYPL's NYC Space/Time Directory project to extract data from digitized city directories.
Most OCR tools can produce hOCR files — we are using OCRopus. See https://github.com/nypl-spacetime/ocr-scripts for more details.
hocr-detect-columns was built and tested using Python 3.5, and depends on the following packages:
python3 detect_columns.py /path/to/hocr.html
hocr-detect-columns will parse hocr.html and create three files in path/to:
bboxes.jsonlines.txtvisualization.html
COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON!
