I'm running into issues trying to use " in the tessedit_char_whitelist config flag. This is most likely because " is also used by pytesseract to know when the config ends.
I have no idea if this should be considered a bug.
I'm mostly looking for alternative solutions, found no info in the documentation on whether you can just pass a config file instead.
charwhitelist = r'ABCDEFGHIJKLMNOPQRSTUVWZYXÅÄÖabcdefghijklmnopqrstuvwxyzåäö0123456789-()/=&%!?:;.,é ' + '\"'
#Import path to tesseract executable
with open('tesseract_install.txt', 'r') as file:
install_path = file.read()
pytesseract.pytesseract.tesseract_cmd = install_path
files = list(filter(IsImage, input_dir))
with about_time() as t1:
total_iterations = len(files)
remaining_iterations = len(files)
completed_iterations = 0
print(f'Starting Tesseract using PSM {psm_nr}, there are {total_iterations} pages to read.')
for file in files:
print(f'Starting work on {file}')
try:
img_cv = cv2.imread(str(file))
img_rgb = cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB)
hocr = pytesseract.image_to_pdf_or_hocr(img_rgb, extension='hocr', lang='swe+script/Latin', config=f"--oem 3 --psm {psm_nr} -c tessedit_char_whitelist='{charwhitelist}'")`
Example output while trying to use this whitelist:
Smhllshjlp1922
ArfreningenSmhllshjlpklsskmpsorgnistion?
Example output without whitelist (and also expected result):
Samhällshjälp 1922
Är föreningen Samhällshjälp klasskampsorganisation?
python version: 3.10.6 run via bundled interpreter in an executable
pytesseract version: 0.3.10
tesseract version: UB Mannheim windows binary, v5.3.0.20221214