Skip to content

Can't pass citation mark character into tessedit_char_whitelist #501

@natsukashiixo

Description

@natsukashiixo

I'm running into issues trying to use " in the tessedit_char_whitelist config flag. This is most likely because " is also used by pytesseract to know when the config ends.
I have no idea if this should be considered a bug.
I'm mostly looking for alternative solutions, found no info in the documentation on whether you can just pass a config file instead.

charwhitelist = r'ABCDEFGHIJKLMNOPQRSTUVWZYXÅÄÖabcdefghijklmnopqrstuvwxyzåäö0123456789-()/=&%!?:;.,é ' + '\"'

  #Import path to tesseract executable
  with open('tesseract_install.txt', 'r') as file:
      install_path = file.read()

  pytesseract.pytesseract.tesseract_cmd = install_path

  files = list(filter(IsImage, input_dir))

  with about_time() as t1:
      total_iterations = len(files)
      remaining_iterations = len(files)
      completed_iterations = 0 
      print(f'Starting Tesseract using PSM {psm_nr}, there are {total_iterations} pages to read.')
      for file in files:
          print(f'Starting work on {file}')
          try: 
              img_cv = cv2.imread(str(file)) 
              img_rgb = cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB)
              hocr = pytesseract.image_to_pdf_or_hocr(img_rgb, extension='hocr', lang='swe+script/Latin', config=f"--oem 3 --psm {psm_nr} -c tessedit_char_whitelist='{charwhitelist}'")`

Example output while trying to use this whitelist:
Smhllshjlp1922
ArfreningenSmhllshjlpklsskmpsorgnistion?

Example output without whitelist (and also expected result):
Samhällshjälp 1922
Är föreningen Samhällshjälp klasskampsorganisation?

python version: 3.10.6 run via bundled interpreter in an executable
pytesseract version: 0.3.10
tesseract version: UB Mannheim windows binary, v5.3.0.20221214

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions