Merged
Conversation
for more information, see https://pre-commit.ci
….com:badGarnet/pytesseract into yao/allow-multiple-output-formats-in-one-run
for more information, see https://pre-commit.ci
….com:badGarnet/pytesseract into yao/allow-multiple-output-formats-in-one-run
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
summary
This PR resolves #304 by adding a new function
run_and_get_multiple_outputthat can take multiple extensions (output formats) and return them after one invocation oftesseract. This saves compute time when the user tries to get multiple outputs from one input, e.g.,walkthrough
The main addition in this PR is the function
run_and_get_multiple_output. It accepts a list of extensions like['pdf', 'txt']. Internally this function:EXTENTION_TO_CONFIG).tesseractjust once to generate all the files neededextensionsNote that this PR only allows a subset of all supported extensions. This is to limit the config to those that are compatible to assemble. E.g., the extension
osdrequires a different command line param--psminstead of-ctherefore is not supported yet by this new function.This PR refactors the function
run_tesseractso it can handle multiple extensions: the key change is to filter out extensions that do not need to be appended to the command line arguments.This PR also refactors the code that reads the output into a helper
_read_outputso it can be reused by both the newrun_and_get_multiple_outputand existingrun_and_get_output.test
This PR adds a unit test to test a few combinations of different extension lists. I'd encourage the reviewer to run the function locally with a simple example of
and compare its runtime to
The above example can a common usage pattern for followup analysis on the OCR results.