GitHub - BUPT-Reasoning-Lab/FinMMDocR: [AAAI 2026] Official resources of "FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation".

FinMMDocR

This project was developed with the assistance of modern AI-powered development tools, including Cursor IDE and Tongyi Qianwen. All code has been carefully reviewed to ensure originality and compliance with best practices. The implementation represents original work by the authors.

The data and code for the paper FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation.

FinMMDocR is a new bilingual multimodal benchmark for evaluating MLLMs in financial numerical reasoning, featuring real-world scenarios, visually-rich documents, and multi-step computations.

Environment Setup

You can install the dependencies by the following command:

pip install -r requirements.txt

FinMMDocR Dataset

The FinMMDocR dataset is a comprehensive multimodal benchmark containing 1,200 carefully crafted financial reasoning questions derived from real-world financial research documents. Each question requires multimodal understanding, combining text analysis, visual document interpretation, and numerical reasoning.

Data Availability

Current Release: Due to AAAI submission requirements limiting supplementary materials to 50MB, we currently provide:

Complete Dataset: Full test.json with all 1,200 questions and complete data structure
Sample Images: One document (0000) with 10 associated images and all OCR results
Full Dataset Images: Complete image dataset for all 1,200 questions will be publicly released after paper acceptance

What Will Be Released After Acceptance:

Complete image dataset for all 1,200 questions (all document images)
All OCR results for full document collection
Additional preprocessing variants (images_50, images_15, etc.)

Note: The current release includes the complete question dataset with all 1,200 questions, but only provides sample images for one document due to size constraints. This allows reviewers to understand the complete data structure and run experiments with the available sample data, while the full image dataset will be released after paper acceptance.

Dataset Overview

Total Questions: 1,200 questions (test-0 to test-1199)
Document Types: 9 major categories of financial research
Multimodal Components: Text + Images + Numerical data
Language: Bilingual (Chinese and English)
Domain: Financial analysis and numerical reasoning

Data Structure

Each question in the dataset follows this comprehensive structure:

{
    "question_id": "test-0",
    "doc_id": "0000",
    "doc_type": "Market Interpretation",
    "question": "Assume the Information Technology sector maintains its past 5-year annualized sales growth rate as reported over the next three years. Project the sector's total annual sales revenue three years from the report date. Use the sector's Total Market Cap and P/S ratio provided in the report for your calculations. Provide the answer in billions of USD, rounded to two decimal places.",
    "evidence": {
        "table": [9],
        "image": [],
        "plain_text": [],
        "generalized_text (layout)": [],
        "pie_chart": [],
        "bar_chart": [],
        "scatter_chart": [],
        "line_chart": []
    },
    "python_solution": "import numpy as np\n\ndef solution():\n    # Define variables with their values from the report\n    it_market_cap_billion_usd = 15035.99 # Table 11, page 9\n    it_ps_ratio = 5.15                 # Table 11, page 9\n    it_sales_growth_rate = 0.1673      # Table 11, page 9 (16.73%)\n\n    # Calculate current annual sales\n    current_sales = it_market_cap_billion_usd / it_ps_ratio\n\n    # Calculate projected sales in 3 years\n    num_years = 3\n    projected_sales = current_sales * np.power(1 + it_sales_growth_rate, num_years)\n\n    # Round final result to two decimal places\n    answer = round(projected_sales, 2)\n\n    # Return final result\n    return answer",
    "ground_truth": 4643.79,
    "source_id": "0000-01",
    "pages_num": 15,
    "images": [
        "/data/images/0000/page_1.png",
        "/data/images/0000/page_2.png",
        "/data/images/0000/page_3.png",
        "/data/images/0000/page_4.png",
        "/data/images/0000/page_5.png",
        "/data/images/0000/page_6.png",
        "/data/images/0000/page_7.png",
        "/data/images/0000/page_8.png",
        "/data/images/0000/page_9.png",
        "/data/images/0000/page_10.png",
        "/data/images/0000/page_11.png",
        "/data/images/0000/page_12.png",
        "/data/images/0000/page_13.png",
        "/data/images/0000/page_14.png",
        "/data/images/0000/page_15.png"
    ],
    "texts": "/data/texts/0000.json"
}

Key Components

1. Question Information

question_id: Unique identifier (test-0 to test-1199)
doc_id: Source document identifier
doc_type: Category of financial research
question: Detailed financial reasoning question requiring multimodal analysis

2. Evidence Tracking

The evidence field tracks which visual elements are relevant:

table: Page numbers containing relevant tables
image: General images (charts, graphs, etc.)
plain_text: Text-only content
generalized_text (layout): Layout-aware text extraction
pie_chart, bar_chart, scatter_chart, line_chart: Specific chart types

3. Multimodal Resources

images: Array of image paths for each page of the source document
texts: Path to extracted text content from the document
pages_num: Total number of pages in the source document

4. Solution and Ground Truth

python_solution: Expert-written Python code with clear variable names and execution logic
ground_truth: Numerical answer (typically float) derived from executing the solution
source_id: Specific identifier linking to the source document

Example Question Analysis

Question: "Assume the Information Technology sector maintains its past 5-year annualized sales growth rate as reported over the next three years. Project the sector's total annual sales revenue three years from the report date."

Required Skills:

Document Understanding: Locate and extract relevant financial data from tables
Numerical Reasoning: Apply growth rate calculations and projections
Financial Knowledge: Understand P/S ratios, market cap, and sector analysis
Multimodal Processing: Combine information from text and visual elements

Solution Approach:

Extract IT sector market cap ($15,035.99 billion) and P/S ratio (5.15) from Table 11
Calculate current annual sales: $15,035.99 ÷ 5.15 = $2,919.61 billion
Apply 16.73% annual growth rate for 3 years: $2,919.61 × (1.1673)³
Result: $4,643.79 billion

Data Preprocessing

The FinMMDocR dataset includes comprehensive preprocessing tools to handle multimodal document images with different resolution and quantity constraints.

Image Quantity Control

The merge_image.py script manages image quantity through intelligent concatenation strategies:

Processing Strategy:

≤50 images: Direct copy without modification
>50 images: Intelligent concatenation to reduce total image count
Special handling: Different column layouts for specific document types

Concatenation Parameters:

# For documents with >50 images
k = math.ceil(len(image_list)/50)  # Target image count
concat_num = len(image_list)//k     # Images per concatenated result
column_num = k                      # Column layout for concatenation

Output Structure:

data/
├── images/           # Original images (variable count)
└── images_50/       # Processed images (≤50 per document)
    ├── 0000/
    │   ├── page_1.png
    │   ├── page_2.png
    │   └── ...
    └── 0001/
        ├── page_1.png
        └── ...

Usage:

python merge_image.py

Image Resolution Processing

The resize_image.py script handles multiple resolution requirements:

Supported Resolutions:

Original: Maintains original image quality
3840px: High-resolution processing for detailed analysis
1920px: Standard resolution for balanced performance

Processing Logic:

def resize_to_1920(image_path, output_path=None):
    """
    Resize the maximum side of the image to 1920 while maintaining aspect ratio
    """
    # Calculate scaling ratio
    if width > height:
        scale = 3840 / width      # Width-based scaling
    else:
        scale = 3840 / height     # Height-based scaling
    
    # Apply high-quality resizing
    resized_img = img.resize((new_width, new_height), Image.LANCZOS)

Resolution Variants:

data/
├── images_15/              # ≤15 images per document
│   ├── 0000/
│   └── 0001/
├── images_15_1920/        # 1920px resolution
│   ├── 0000/
│   └── 0001/
└── images_15_3840/        # 3840px resolution
    ├── 0000/
    └── 0001/

Usage:

python resize_image.py

Preprocessing Pipeline

Complete Workflow:

Image Quantity Control:

python merge_image.py
# Creates images_50/ with ≤50 images per document

Further Quantity Reduction (if needed):

# Manual processing for ≤15 images
# Creates images_15/ directory

Resolution Processing:

python resize_image.py
# Creates images_15_3840/ with high-resolution images

Configuration Options:

Process Count: Control number of documents processed
Parallel Processing: 32 concurrent processes for efficiency
Memory Management: Automatic image cleanup to prevent memory overflow
Error Handling: Comprehensive logging and retry mechanisms

Running Inference

We support inference with various LLM models through the inference.py script. The inference system supports both text and multimodal inputs with automatic retry mechanisms and rate limiting.

Direct Script Execution

python inference.py

Configuration Setup Before running inference, you need to configure the following parameters in the Config class within inference.py:

Dataset Configuration:

dataset_file: str = "path/to/your/dataset.json"  # Dataset file path
outcome_dir: str = "path/to/output/directory"     # Root directory for saving output results

Model Configuration:

# Choose one of the supported models:
model_name: str = "qwen/qwen2.5-vl-32b-instruct"  # Currently active
# model_name: str = "google/gemini-2.5-pro-preview-03-25"
# model_name: str = "openai/gpt-4o"
# model_name: str = "meta-llama/llama-4-maverick"
# model_name: str = "anthropic/claude-3.7-sonnet"
# And many more...

API Configuration:

# OpenRouter (default)
client: AsyncOpenAI = AsyncOpenAI(
    api_key="your_api_key",
    base_url="https://openrouter.ai/api/v1",
)

# Or use other providers:
# Alibaba DashScope
# XAI Grok
# Anthropic Claude

Processing Parameters:

rpm: int = 100                                    # Requests per minute limit
max_no_improve_round_count: int = 3               # Maximum retry rounds for failed requests
process_count: int = -1                           # Number of items to process (-1 for all)
max_input_images: int = 1000                      # Maximum number of input images

Input Data Format The script expects a JSON dataset with the following structure:

[
    {
        "question_id": "unique_id",
        "question": "question text",
        "system_input": "optional system prompt",
        "user_input": "user question or instruction",
        "images": ["path/to/image1.png", "path/to/image2.png"]
    }
]

Output Structure The inference process creates organized output directories:

output_dir/
├── dataset_name/
│   └── model_name/
│       └── timestamp_process_count_X/
│           ├── execution.log
│           ├── final_results.json
│           └── rounds_outcome/
│               ├── round_1.json
│               ├── round_2.json
│               └── ...

Key Features:

Async Processing: Uses asyncio for efficient concurrent API calls
Rate Limiting: Built-in rate limiting to respect API quotas
Automatic Retries: Failed requests are automatically retried up to 3 rounds
Multimodal Support: Handles both text and image inputs with base64 encoding
Progress Tracking: Real-time progress bars and detailed logging
Error Handling: Comprehensive error handling with detailed error messages
Round-based Processing: Saves intermediate results after each round for recovery

Automated Evaluation

The evaluation system supports both Chain-of-Thought (COT) and Program-of-Thought (POT) evaluation methods with comprehensive metrics and automated processing.

Individual Model Evaluation

python evaluation.py \
    --prediction_path "outputs/test/raw_cot_outputs/model_output.json" \
    --evaluation_output_dir "outputs/test/processed_cot_outputs" \
    --prompt_type "cot" \
    --ground_truth_file "data/test.json" \
    --result_file "outputs/results/test_cot_results.json" \
    --api_base "https://openrouter.ai/api/v1" \
    --api_key "your_api_key"

Batch Evaluation Script

bash scripts/evaluate_all.sh

Evaluation Configuration The evaluate_all.sh script supports the following configurations:

Prompt Types:
- cot (Chain-of-Thought) - Currently active
- pot (Program-of-Thought) - Commented out
Dataset Subsets:
- test - Currently active
- Can be extended to include train, validation

Model Outputs: The script includes a comprehensive list of supported model outputs:

model_outputs=(
    "doubao_vision_rag_vidorag_0720.json"  # Currently active
    # "o4-mini-high.json"
    # "gpt4o.json"
    # "claude37_thinking.json"
    # "gemini.json"
    # "grok2.json"
    # "llama4.json"
    # "gemma3.json"
    # "mistral.json"
    # "qwen2_5vl.json"
    # And many more...
)

Evaluation Methods

Chain-of-Thought (COT) Evaluation:
- Uses LLM-based answer extraction to parse numerical answers from model responses
- Supports automatic retry for failed extractions
- Calculates accuracy by comparing extracted answers with ground truth
- Provides execution rate metrics
Program-of-Thought (POT) Evaluation:
- Extracts executable Python code from model responses
- Executes code in isolated processes with timeout protection
- Compares execution results with ground truth
- Includes comprehensive error handling and security measures

Evaluation Metrics The evaluation system provides the following metrics:

Accuracy: Percentage of correct answers
Execution Rate: Percentage of successfully processed responses
Token Usage: Total completion tokens consumed
Processing Statistics: Detailed breakdown by model and prompt type

Output Structure Evaluation results are organized as follows:

outputs/
├── test/
│   ├── raw_cot_outputs/          # Raw model outputs
│   ├── processed_cot_outputs/     # Processed evaluation results
│   └── raw_pot_outputs/          # Raw POT outputs
├── results/
│   ├── test_cot_results.json     # Aggregated COT results
│   └── test_pot_results.json     # Aggregated POT results

Key Features:

Multi-process Safety: Code execution in isolated processes with timeout
Security Measures: Disabled dangerous functions and system calls
Automatic Retry: Failed answer extractions are automatically retried
Comprehensive Logging: Detailed progress tracking and error reporting
Flexible Configuration: Support for multiple models, prompt types, and datasets
Batch Processing: Automated evaluation of multiple models simultaneously

Project Structure

The FinMMDocR project is organized into several key directories, each serving specific purposes in the multimodal document processing and evaluation pipeline.

Core Directories

FinMMDocR/
├── data/                    # Dataset and preprocessing outputs
│   ├── images/             # Original document images (variable count)
│   ├── images_50/          # Processed images (≤50 per document)
│   ├── images_15/          # Further reduced images (≤15 per document)
│   ├── images_15_1920/     # 1920px resolution images
│   ├── images_15_3840/     # 3840px resolution images
│   ├── texts/              # OCR results for each document
│   └── test.json           # Main dataset file (1,200 questions)
├── models/                  # RAG embedding model parameters
├── outputs/                 # Inference and evaluation results
│   ├── test/
│   │   ├── raw_cot_outputs/     # Raw Chain-of-Thought outputs
│   │   ├── processed_cot_outputs/ # Processed COT evaluation results
│   │   └── raw_pot_outputs/     # Raw Program-of-Thought outputs
│   └── results/            # Aggregated evaluation results
├── retrieved_results/       # RAG experiment results
├── scripts/                # Automation and evaluation scripts
└── utils/                  # Utility functions and tools

Directory Purposes

1. `data/` - Dataset and Preprocessing

images/: Original document images with variable page counts
images_50/: Images processed to ≤50 per document via concatenation
images_15/: Further reduced to ≤15 images per document
images_15_1920/: 1920px resolution variants for performance optimization
images_15_3840/: 3840px resolution variants for high-quality analysis
texts/: OCR results for each document, enabling text-based retrieval and analysis
test.json: Main dataset containing 1,200 multimodal financial reasoning questions

2. `models/` - RAG Embedding Models

Purpose: Stores parameters for RAG (Retrieval-Augmented Generation) embedding models
Usage: Enables semantic search and document retrieval for multimodal reasoning
Content: Model weights, configurations, and embeddings for financial document understanding

3. `outputs/` - Experiment Results

test/raw_cot_outputs/: Raw model responses for Chain-of-Thought evaluation
test/processed_cot_outputs/: Processed and evaluated COT results
test/raw_pot_outputs/: Raw model responses for Program-of-Thought evaluation
results/: Aggregated evaluation metrics and comparative analysis

4. `retrieved_results/` - RAG Experiment Results

Purpose: Stores results from RAG-based experiments
Content: Retrieved document chunks, relevance scores, and retrieval-augmented responses
Usage: Analysis of retrieval performance and multimodal document understanding

5. `scripts/` - Automation Tools

evaluate_all.sh: Batch evaluation script for multiple models
Purpose: Automated evaluation pipeline for comprehensive model comparison

6. `utils/` - Utility Functions

evaluation_utils.py: Evaluation metrics and answer extraction
openai_utils.py: API integration and model communication
bm25_utils.py: Text-based retrieval utilities
BGE-M3.py: BGE-M3 embedding model integration

Data Flow

Preprocessing Pipeline:

Original Images → merge_image.py → images_50/ → resize_image.py → images_15_3840/

OCR Processing:

Document Images → OCR Processing → data/texts/ → RAG Retrieval

Experiment Pipeline:

Dataset → Inference → outputs/ → Evaluation → results/
RAG Models → Retrieved Results → retrieved_results/

RAG Retrieval System

The FinMMDocR project includes a comprehensive Retrieval-Augmented Generation (RAG) system that supports both parallel and serial processing for document retrieval experiments.

Overview

The RAG system enables semantic search and document retrieval for multimodal financial reasoning by:

Text-based Retrieval: Using OCR results from data/texts/ for semantic search
Multiple Embedding Models: Support for various embedding models (OpenAI, HuggingFace, BM25)
Parallel/Serial Processing: Efficient processing options for different experimental needs
Retrieval Results: Stored in retrieved_results/ for analysis

Supported Retrieval Methods

1. BM25 Retrieval

Type: Traditional keyword-based retrieval
Usage: Baseline retrieval method for comparison
Implementation: utils/bm25_utils.py
Features: TF-IDF based scoring with text preprocessing

2. OpenAI Embedding Models

Models: text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002
Type: Semantic embedding-based retrieval
Features: High-quality semantic understanding
Usage: Advanced semantic search capabilities

3. HuggingFace Models

Models: contriever-msmarco and other custom models
Type: Local embedding models
Features: Offline processing, customizable
Usage: Cost-effective semantic retrieval

Processing Modes

Parallel Processing (`retrieve_parallel.py`)

python retrieve_parallel.py

Key Features:

Multi-process Execution: 32 concurrent processes for high throughput
Progress Tracking: Real-time progress bars with detailed logging
Memory Optimization: Efficient batch processing and memory management
Error Handling: Robust error recovery and retry mechanisms

Configuration:

# Model selection
model_name = "bm25"  # or "text-embedding-3-small", "contriever-msmarco"

# Processing parameters
num_processes = 32   # Number of parallel processes
top_k = -1          # Retrieve all documents (-1) or specific number

Serial Processing (`retrieve_serial.py`)

python retrieve_serial.py

Retrieval Pipeline

Complete Workflow:

1. Load Dataset → 2. Process Queries → 3. Retrieve Documents → 4. Save Results

Detailed Process:

Data Loading: Load questions from data/test.json
Text Processing: Extract OCR results from data/texts/
Embedding Generation: Create embeddings for queries and documents
Similarity Search: Find top-k most relevant documents
Result Storage: Save to retrieved_results/{model_name}.json

Output Format

Retrieval Results Structure:

[
    {
        "question_id": "test-0",
        "question": "Financial reasoning question...",
        "retrieved_results": [
            {
                "page": 1,
                "score": 0.85
            },
            {
                "page": 3,
                "score": 0.72
            }
        ]
    }
]

Configuration Options

Model Selection:

EMBEDDING_MODELS = {
    "contriever-msmarco": {
        "type": "hf",
        "path": "model_path",
        "projection_size": 768,
    },
    "text-embedding-3-small": {
        "type": "openai",
        "model": "text-embedding-3-small",
        "projection_size": 1536,
    },
    "bm25": {
        "type": "keyword",
        "features": "TF-IDF based"
    }
}

Processing Parameters:

top_k: Number of documents to retrieve (-1 for all)
batch_size: Embedding generation batch size
max_tokens: Maximum tokens for text truncation
n_subquantizers: Index compression parameters

Usage Examples

Basic Retrieval:

# Parallel processing with BM25
python retrieve_parallel.py

# Serial processing with OpenAI embeddings
python retrieve_serial.py

Custom Configuration:

# Modify model selection in script
model_name = "text-embedding-3-large"

# Adjust processing parameters
num_processes = 16  # Reduce for memory constraints
top_k = 10         # Retrieve top 10 documents

Results Analysis:

# Load retrieval results
with open("./retrieved_results/bm25.json", "r") as f:
    results = json.load(f)

# Analyze retrieval performance
for result in results:
    print(f"Question: {result['question_id']}")
    print(f"Retrieved {len(result['retrieved_results'])} documents")

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
scripts		scripts
src		src
static		static
utils		utils
LICENSE		LICENSE
README.md		README.md
evaluation.py		evaluation.py
index.html		index.html
inference.py		inference.py
merge_image.py		merge_image.py
requirements.txt		requirements.txt
resize_image.py		resize_image.py
retrieve_parallel.py		retrieve_parallel.py
retrieve_serial.py		retrieve_serial.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

FinMMDocR

Table of Contents

Environment Setup

FinMMDocR Dataset

Data Availability

Dataset Overview

Data Structure

Key Components

1. Question Information

2. Evidence Tracking

3. Multimodal Resources

4. Solution and Ground Truth

Example Question Analysis

Data Preprocessing

Image Quantity Control

Image Resolution Processing

Preprocessing Pipeline

Running Inference

Automated Evaluation

Project Structure

Core Directories

Directory Purposes

1. data/ - Dataset and Preprocessing

2. models/ - RAG Embedding Models

3. outputs/ - Experiment Results

4. retrieved_results/ - RAG Experiment Results

5. scripts/ - Automation Tools

6. utils/ - Utility Functions

Data Flow

RAG Retrieval System

Overview

Supported Retrieval Methods

1. BM25 Retrieval

2. OpenAI Embedding Models

3. HuggingFace Models

Processing Modes

Parallel Processing (retrieve_parallel.py)

Serial Processing (retrieve_serial.py)

Retrieval Pipeline

Output Format

Configuration Options

Usage Examples

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `data/` - Dataset and Preprocessing

2. `models/` - RAG Embedding Models

3. `outputs/` - Experiment Results

4. `retrieved_results/` - RAG Experiment Results

5. `scripts/` - Automation Tools

6. `utils/` - Utility Functions

Parallel Processing (`retrieve_parallel.py`)

Serial Processing (`retrieve_serial.py`)

Packages