Skip to content

hamlet82/hocr-detect-columns

 
 

Repository files navigation

hocr-detect-columns

Detects columns and indented lines in an hOCR file. This Python 3 script is used in the NYPL's NYC Space/Time Directory project to extract data from digitized city directories.

hOCR column detection

Most OCR tools can produce hOCR files — we are using OCRopus. See https://github.com/nypl-spacetime/ocr-scripts for more details.

Installation

hocr-detect-columns was built and tested using Python 3.5, and depends on the following packages:

Usage

python3 detect_columns.py /path/to/hocr.html

hocr-detect-columns will parse hocr.html and create three files in path/to:

  • bboxes.json
  • lines.txt
  • visualization.html

How does it work?

COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON! COMING SOON!

About

Detect columns and indentations in HOCR files

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • HTML 99.8%
  • Python 0.2%