Skip to content

cpljames269/2DMatrixScanner

Repository files navigation

PDF Barcode Processor This Python script is a utility for automatically processing PDF documents to extract structured information from barcodes (such as QR codes) located in the corners of each page. It is designed to work with documents that use a specific barcode format to define logical "sets" of pages, allowing for automated processing, splitting, or data extraction.

The script uses a combination of powerful libraries to:

Identify a single PDF file in its directory.

Scan each page for barcodes in a specific format.

Interpret the decoded barcode data to understand page numbering, set size, and global page numbers.

Generate detailed reports in JSON format for further use.

Features Automatic PDF Detection: Automatically finds a single PDF file in the script's directory.

Barcode Scanning: Scans all four corners of each page for 1D or 2D barcodes.

High-Resolution Processing: Renders PDF pages at 600 DPI to ensure reliable barcode decoding.

Structured Output: Generates multiple JSON files with extracted data for easy integration with other systems.

sets_info.json: Detailed information on each document set.

pages_per_set.json: A list of page counts for each set.

global_page_numbers.json: A list of the global page numbers where each set begins.

check_to_fix.json: Flags potential errors where a barcode might not be on the first page of a set.

Intelligent Navigation: Skips pages within a set after successfully decoding the starting barcode, greatly speeding up the process for multi-page documents.

Error Handling: Provides clear output for missing or invalid PDFs and for decoding failures.

Self-Cleaning: Creates and manages an cropped_images folder to store temporary files, ensuring a clean working environment.

Requirements The script depends on several Python libraries. You can install them using pip:

pip install PyMuPDF Pillow pyzxing

PyMuPDF (fitz): For processing PDF documents.

Pillow (PIL): For image manipulation and cropping.

PyZXing: A wrapper for the ZXing barcode decoding library. This requires a working Java Runtime Environment (JRE) to be installed on your system.

Usage Placement: Place the Python script (.py file) in the same directory as the PDF file you want to process.

Execution: Open a terminal or command prompt in that directory and run the script:

python your_script_name.py

Review Output: The script will print its progress to the console and generate several JSON files and a cropped_images folder containing the temporary image files.

Barcode Format The script is configured to look for a specific numeric barcode format. The expected format is a 7 or 8-digit number structured as follows:

[2 digits] - Page number within the current set (e.g., 01, 02).

[2 digits] - Total pages in the current set (e.g., 12, 04).

[3-4 digits] - Global page number where the set starts.

For example, a barcode decoded as 01120005 would be interpreted as:

Page 1 of 12.

The set starts on global page 5.

Configuration You can easily adjust the following parameters within the script:

output_folder: The name of the folder for temporary images.

corner_crop_size: The size of the square region (in pixels) to crop from each corner. Increasing this value (900 pixels or more) can improve decoding accuracy on low-quality PDFs, while a smaller value (400 pixels) is faster.

Contribution If you find a bug or have a suggestion for an improvement, feel free to open an issue or submit a pull request on GitHub.

About

This Python script is designed to process a single PDF file in the same directory, specifically looking for QR codes (or other 1D/2D barcodes) in the corners of each page. It decodes these barcodes to extract specific numeric information related to page sets within the PDF.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages