Initialization

This is my solution to the CrowdAI challenge about lang detection on ONU documents.
Please follow the following instructions.
Data can be found here.

Initialization

git clone https://github.com/DataExMachina/ONU-Digitization-Challenge
cd ONU-Digitization-Challenge
make setup

Put images in this project

make cp_300dpi FROM=<path_from> TO=<path_to>

Example: make cp_300dpi FROM=/media/data/train/en/ TO=./data/img/train/en/
It will copy data resizing images before tesseract use.

Get orientation from images

make ocr_metadata (get orientation of an image)

Rotate

make create_rotation_requests (will create bash instruction to rotate images)
make auto_rotate (will rotate images if needed)

BE CAREFUL: you should run auto_rotate once !

Extract text

Grab a coffee and watch Netflix, this is long.

make extract_train_fr
make extract_train_en
make extract_test

Text to score

source crowdai-venv/bin/activate
jupyter notebook

And then in ./app, run the two notebooks (in the right order please). The first one convert text extracted into dataframes. The second one will preprocess data with TFIDF sklearn implementation, and predict new documents using Follow The Regularized Leader, wordbatch package.

Copyrights

textcleaner: Fred Weinhaus, and more globally ImageMagick
ImageMagick:

Copyright [2018]

Licensed under the ImageMagick License (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://imagemagick.org/script/license.php

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
tesseract: https://github.com/tesseract-ocr/tesseract

Support

Python 3.6
Tesseract V4

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
app		app
scripts		scripts
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Initialization

Put images in this project

Get orientation from images

Rotate

Extract text

Text to score

Copyrights

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Initialization

Put images in this project

Get orientation from images

Rotate

Extract text

Text to score

Copyrights

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages