This is my solution to the CrowdAI challenge about lang detection on ONU documents.
Please follow the following instructions.
Data can be found here.
git clone https://github.com/DataExMachina/ONU-Digitization-Challengecd ONU-Digitization-Challengemake setup
make cp_300dpi FROM=<path_from> TO=<path_to>
Example: make cp_300dpi FROM=/media/data/train/en/ TO=./data/img/train/en/
It will copy data resizing images before tesseract use.
make ocr_metadata(get orientation of an image)
make create_rotation_requests(will create bash instruction to rotate images)make auto_rotate(will rotate images if needed)
BE CAREFUL: you should run auto_rotate once !
Grab a coffee and watch Netflix, this is long.
make extract_train_frmake extract_train_enmake extract_test
source crowdai-venv/bin/activatejupyter notebook
And then in ./app, run the two notebooks (in the right order please). The first one convert text extracted into dataframes. The second one will preprocess data with TFIDF sklearn implementation, and predict new documents using Follow The Regularized Leader, wordbatch package.
-
textcleaner: Fred Weinhaus, and more globally ImageMagick
-
ImageMagick:
Copyright [2018]
Licensed under the ImageMagick License (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
https://imagemagick.org/script/license.php
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
- Python 3.6
- Tesseract V4