Skip to content

DataExMachina/ONU-Digitization-Challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is my solution to the CrowdAI challenge about lang detection on ONU documents.
Please follow the following instructions.
Data can be found here.

Initialization

  • git clone https://github.com/DataExMachina/ONU-Digitization-Challenge
  • cd ONU-Digitization-Challenge
  • make setup

Put images in this project

  • make cp_300dpi FROM=<path_from> TO=<path_to>

Example: make cp_300dpi FROM=/media/data/train/en/ TO=./data/img/train/en/
It will copy data resizing images before tesseract use.

Get orientation from images

  • make ocr_metadata (get orientation of an image)

Rotate

  • make create_rotation_requests (will create bash instruction to rotate images)
  • make auto_rotate (will rotate images if needed)

BE CAREFUL: you should run auto_rotate once !

Extract text

Grab a coffee and watch Netflix, this is long.

  • make extract_train_fr
  • make extract_train_en
  • make extract_test

Text to score

  • source crowdai-venv/bin/activate
  • jupyter notebook

And then in ./app, run the two notebooks (in the right order please). The first one convert text extracted into dataframes. The second one will preprocess data with TFIDF sklearn implementation, and predict new documents using Follow The Regularized Leader, wordbatch package.

Copyrights

  • textcleaner: Fred Weinhaus, and more globally ImageMagick

  • ImageMagick:

    Copyright [2018]

    Licensed under the ImageMagick License (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

    https://imagemagick.org/script/license.php

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

  • tesseract: https://github.com/tesseract-ocr/tesseract

Support

  • Python 3.6
  • Tesseract V4

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors