This project was developed with the assistance of modern AI-powered development tools, including Cursor IDE and Tongyi Qianwen. All code has been carefully reviewed to ensure originality and compliance with best practices. The implementation represents original work by the authors.
The data and code for the paper FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation.
FinMMDocR is a new bilingual multimodal benchmark for evaluating MLLMs in financial numerical reasoning, featuring real-world scenarios, visually-rich documents, and multi-step computations.
- Environment Setup
- Dataset Overview
- Data Preprocessing
- RAG Retrieval System
- Inference Pipeline
- Evaluation Framework
- Project Structure
You can install the dependencies by the following command:
pip install -r requirements.txtThe FinMMDocR dataset is a comprehensive multimodal benchmark containing 1,200 carefully crafted financial reasoning questions derived from real-world financial research documents. Each question requires multimodal understanding, combining text analysis, visual document interpretation, and numerical reasoning.
Current Release: Due to AAAI submission requirements limiting supplementary materials to 50MB, we currently provide:
- Complete Dataset: Full
test.jsonwith all 1,200 questions and complete data structure - Sample Images: One document (
0000) with 10 associated images and all OCR results - Full Dataset Images: Complete image dataset for all 1,200 questions will be publicly released after paper acceptance
What Will Be Released After Acceptance:
- Complete image dataset for all 1,200 questions (all document images)
- All OCR results for full document collection
- Additional preprocessing variants (images_50, images_15, etc.)
Note: The current release includes the complete question dataset with all 1,200 questions, but only provides sample images for one document due to size constraints. This allows reviewers to understand the complete data structure and run experiments with the available sample data, while the full image dataset will be released after paper acceptance.
- Total Questions: 1,200 questions (test-0 to test-1199)
- Document Types: 9 major categories of financial research
- Multimodal Components: Text + Images + Numerical data
- Language: Bilingual (Chinese and English)
- Domain: Financial analysis and numerical reasoning
Each question in the dataset follows this comprehensive structure:
{
"question_id": "test-0",
"doc_id": "0000",
"doc_type": "Market Interpretation",
"question": "Assume the Information Technology sector maintains its past 5-year annualized sales growth rate as reported over the next three years. Project the sector's total annual sales revenue three years from the report date. Use the sector's Total Market Cap and P/S ratio provided in the report for your calculations. Provide the answer in billions of USD, rounded to two decimal places.",
"evidence": {
"table": [9],
"image": [],
"plain_text": [],
"generalized_text (layout)": [],
"pie_chart": [],
"bar_chart": [],
"scatter_chart": [],
"line_chart": []
},
"python_solution": "import numpy as np\n\ndef solution():\n # Define variables with their values from the report\n it_market_cap_billion_usd = 15035.99 # Table 11, page 9\n it_ps_ratio = 5.15 # Table 11, page 9\n it_sales_growth_rate = 0.1673 # Table 11, page 9 (16.73%)\n\n # Calculate current annual sales\n current_sales = it_market_cap_billion_usd / it_ps_ratio\n\n # Calculate projected sales in 3 years\n num_years = 3\n projected_sales = current_sales * np.power(1 + it_sales_growth_rate, num_years)\n\n # Round final result to two decimal places\n answer = round(projected_sales, 2)\n\n # Return final result\n return answer",
"ground_truth": 4643.79,
"source_id": "0000-01",
"pages_num": 15,
"images": [
"/data/images/0000/page_1.png",
"/data/images/0000/page_2.png",
"/data/images/0000/page_3.png",
"/data/images/0000/page_4.png",
"/data/images/0000/page_5.png",
"/data/images/0000/page_6.png",
"/data/images/0000/page_7.png",
"/data/images/0000/page_8.png",
"/data/images/0000/page_9.png",
"/data/images/0000/page_10.png",
"/data/images/0000/page_11.png",
"/data/images/0000/page_12.png",
"/data/images/0000/page_13.png",
"/data/images/0000/page_14.png",
"/data/images/0000/page_15.png"
],
"texts": "/data/texts/0000.json"
}question_id: Unique identifier (test-0 to test-1199)doc_id: Source document identifierdoc_type: Category of financial researchquestion: Detailed financial reasoning question requiring multimodal analysis
The evidence field tracks which visual elements are relevant:
table: Page numbers containing relevant tablesimage: General images (charts, graphs, etc.)plain_text: Text-only contentgeneralized_text (layout): Layout-aware text extractionpie_chart,bar_chart,scatter_chart,line_chart: Specific chart types
images: Array of image paths for each page of the source documenttexts: Path to extracted text content from the documentpages_num: Total number of pages in the source document
python_solution: Expert-written Python code with clear variable names and execution logicground_truth: Numerical answer (typically float) derived from executing the solutionsource_id: Specific identifier linking to the source document
Question: "Assume the Information Technology sector maintains its past 5-year annualized sales growth rate as reported over the next three years. Project the sector's total annual sales revenue three years from the report date."
Required Skills:
- Document Understanding: Locate and extract relevant financial data from tables
- Numerical Reasoning: Apply growth rate calculations and projections
- Financial Knowledge: Understand P/S ratios, market cap, and sector analysis
- Multimodal Processing: Combine information from text and visual elements
Solution Approach:
- Extract IT sector market cap ($15,035.99 billion) and P/S ratio (5.15) from Table 11
- Calculate current annual sales: $15,035.99 ÷ 5.15 = $2,919.61 billion
- Apply 16.73% annual growth rate for 3 years: $2,919.61 × (1.1673)³
- Result: $4,643.79 billion
The FinMMDocR dataset includes comprehensive preprocessing tools to handle multimodal document images with different resolution and quantity constraints.
The merge_image.py script manages image quantity through intelligent concatenation strategies:
Processing Strategy:
- ≤50 images: Direct copy without modification
- >50 images: Intelligent concatenation to reduce total image count
- Special handling: Different column layouts for specific document types
Concatenation Parameters:
# For documents with >50 images
k = math.ceil(len(image_list)/50) # Target image count
concat_num = len(image_list)//k # Images per concatenated result
column_num = k # Column layout for concatenationOutput Structure:
data/
├── images/ # Original images (variable count)
└── images_50/ # Processed images (≤50 per document)
├── 0000/
│ ├── page_1.png
│ ├── page_2.png
│ └── ...
└── 0001/
├── page_1.png
└── ...
Usage:
python merge_image.pyThe resize_image.py script handles multiple resolution requirements:
Supported Resolutions:
- Original: Maintains original image quality
- 3840px: High-resolution processing for detailed analysis
- 1920px: Standard resolution for balanced performance
Processing Logic:
def resize_to_1920(image_path, output_path=None):
"""
Resize the maximum side of the image to 1920 while maintaining aspect ratio
"""
# Calculate scaling ratio
if width > height:
scale = 3840 / width # Width-based scaling
else:
scale = 3840 / height # Height-based scaling
# Apply high-quality resizing
resized_img = img.resize((new_width, new_height), Image.LANCZOS)Resolution Variants:
data/
├── images_15/ # ≤15 images per document
│ ├── 0000/
│ └── 0001/
├── images_15_1920/ # 1920px resolution
│ ├── 0000/
│ └── 0001/
└── images_15_3840/ # 3840px resolution
├── 0000/
└── 0001/
Usage:
python resize_image.pyComplete Workflow:
-
Image Quantity Control:
python merge_image.py # Creates images_50/ with ≤50 images per document -
Further Quantity Reduction (if needed):
# Manual processing for ≤15 images # Creates images_15/ directory
-
Resolution Processing:
python resize_image.py # Creates images_15_3840/ with high-resolution images
Configuration Options:
- Process Count: Control number of documents processed
- Parallel Processing: 32 concurrent processes for efficiency
- Memory Management: Automatic image cleanup to prevent memory overflow
- Error Handling: Comprehensive logging and retry mechanisms
We support inference with various LLM models through the inference.py script. The inference system supports both text and multimodal inputs with automatic retry mechanisms and rate limiting.
Direct Script Execution
python inference.pyConfiguration Setup
Before running inference, you need to configure the following parameters in the Config class within inference.py:
-
Dataset Configuration:
dataset_file: str = "path/to/your/dataset.json" # Dataset file path outcome_dir: str = "path/to/output/directory" # Root directory for saving output results
-
Model Configuration:
# Choose one of the supported models: model_name: str = "qwen/qwen2.5-vl-32b-instruct" # Currently active # model_name: str = "google/gemini-2.5-pro-preview-03-25" # model_name: str = "openai/gpt-4o" # model_name: str = "meta-llama/llama-4-maverick" # model_name: str = "anthropic/claude-3.7-sonnet" # And many more...
-
API Configuration:
# OpenRouter (default) client: AsyncOpenAI = AsyncOpenAI( api_key="your_api_key", base_url="https://openrouter.ai/api/v1", ) # Or use other providers: # Alibaba DashScope # XAI Grok # Anthropic Claude
-
Processing Parameters:
rpm: int = 100 # Requests per minute limit max_no_improve_round_count: int = 3 # Maximum retry rounds for failed requests process_count: int = -1 # Number of items to process (-1 for all) max_input_images: int = 1000 # Maximum number of input images
Input Data Format The script expects a JSON dataset with the following structure:
[
{
"question_id": "unique_id",
"question": "question text",
"system_input": "optional system prompt",
"user_input": "user question or instruction",
"images": ["path/to/image1.png", "path/to/image2.png"]
}
]Output Structure The inference process creates organized output directories:
output_dir/
├── dataset_name/
│ └── model_name/
│ └── timestamp_process_count_X/
│ ├── execution.log
│ ├── final_results.json
│ └── rounds_outcome/
│ ├── round_1.json
│ ├── round_2.json
│ └── ...
Key Features:
- Async Processing: Uses asyncio for efficient concurrent API calls
- Rate Limiting: Built-in rate limiting to respect API quotas
- Automatic Retries: Failed requests are automatically retried up to 3 rounds
- Multimodal Support: Handles both text and image inputs with base64 encoding
- Progress Tracking: Real-time progress bars and detailed logging
- Error Handling: Comprehensive error handling with detailed error messages
- Round-based Processing: Saves intermediate results after each round for recovery
The evaluation system supports both Chain-of-Thought (COT) and Program-of-Thought (POT) evaluation methods with comprehensive metrics and automated processing.
Individual Model Evaluation
python evaluation.py \
--prediction_path "outputs/test/raw_cot_outputs/model_output.json" \
--evaluation_output_dir "outputs/test/processed_cot_outputs" \
--prompt_type "cot" \
--ground_truth_file "data/test.json" \
--result_file "outputs/results/test_cot_results.json" \
--api_base "https://openrouter.ai/api/v1" \
--api_key "your_api_key"Batch Evaluation Script
bash scripts/evaluate_all.shEvaluation Configuration
The evaluate_all.sh script supports the following configurations:
-
Prompt Types:
cot(Chain-of-Thought) - Currently activepot(Program-of-Thought) - Commented out
-
Dataset Subsets:
test- Currently active- Can be extended to include
train,validation
-
Model Outputs: The script includes a comprehensive list of supported model outputs:
model_outputs=( "doubao_vision_rag_vidorag_0720.json" # Currently active # "o4-mini-high.json" # "gpt4o.json" # "claude37_thinking.json" # "gemini.json" # "grok2.json" # "llama4.json" # "gemma3.json" # "mistral.json" # "qwen2_5vl.json" # And many more... )
Evaluation Methods
-
Chain-of-Thought (COT) Evaluation:
- Uses LLM-based answer extraction to parse numerical answers from model responses
- Supports automatic retry for failed extractions
- Calculates accuracy by comparing extracted answers with ground truth
- Provides execution rate metrics
-
Program-of-Thought (POT) Evaluation:
- Extracts executable Python code from model responses
- Executes code in isolated processes with timeout protection
- Compares execution results with ground truth
- Includes comprehensive error handling and security measures
Evaluation Metrics The evaluation system provides the following metrics:
- Accuracy: Percentage of correct answers
- Execution Rate: Percentage of successfully processed responses
- Token Usage: Total completion tokens consumed
- Processing Statistics: Detailed breakdown by model and prompt type
Output Structure Evaluation results are organized as follows:
outputs/
├── test/
│ ├── raw_cot_outputs/ # Raw model outputs
│ ├── processed_cot_outputs/ # Processed evaluation results
│ └── raw_pot_outputs/ # Raw POT outputs
├── results/
│ ├── test_cot_results.json # Aggregated COT results
│ └── test_pot_results.json # Aggregated POT results
Key Features:
- Multi-process Safety: Code execution in isolated processes with timeout
- Security Measures: Disabled dangerous functions and system calls
- Automatic Retry: Failed answer extractions are automatically retried
- Comprehensive Logging: Detailed progress tracking and error reporting
- Flexible Configuration: Support for multiple models, prompt types, and datasets
- Batch Processing: Automated evaluation of multiple models simultaneously
The FinMMDocR project is organized into several key directories, each serving specific purposes in the multimodal document processing and evaluation pipeline.
FinMMDocR/
├── data/ # Dataset and preprocessing outputs
│ ├── images/ # Original document images (variable count)
│ ├── images_50/ # Processed images (≤50 per document)
│ ├── images_15/ # Further reduced images (≤15 per document)
│ ├── images_15_1920/ # 1920px resolution images
│ ├── images_15_3840/ # 3840px resolution images
│ ├── texts/ # OCR results for each document
│ └── test.json # Main dataset file (1,200 questions)
├── models/ # RAG embedding model parameters
├── outputs/ # Inference and evaluation results
│ ├── test/
│ │ ├── raw_cot_outputs/ # Raw Chain-of-Thought outputs
│ │ ├── processed_cot_outputs/ # Processed COT evaluation results
│ │ └── raw_pot_outputs/ # Raw Program-of-Thought outputs
│ └── results/ # Aggregated evaluation results
├── retrieved_results/ # RAG experiment results
├── scripts/ # Automation and evaluation scripts
└── utils/ # Utility functions and tools
images/: Original document images with variable page countsimages_50/: Images processed to ≤50 per document via concatenationimages_15/: Further reduced to ≤15 images per documentimages_15_1920/: 1920px resolution variants for performance optimizationimages_15_3840/: 3840px resolution variants for high-quality analysistexts/: OCR results for each document, enabling text-based retrieval and analysistest.json: Main dataset containing 1,200 multimodal financial reasoning questions
- Purpose: Stores parameters for RAG (Retrieval-Augmented Generation) embedding models
- Usage: Enables semantic search and document retrieval for multimodal reasoning
- Content: Model weights, configurations, and embeddings for financial document understanding
test/raw_cot_outputs/: Raw model responses for Chain-of-Thought evaluationtest/processed_cot_outputs/: Processed and evaluated COT resultstest/raw_pot_outputs/: Raw model responses for Program-of-Thought evaluationresults/: Aggregated evaluation metrics and comparative analysis
- Purpose: Stores results from RAG-based experiments
- Content: Retrieved document chunks, relevance scores, and retrieval-augmented responses
- Usage: Analysis of retrieval performance and multimodal document understanding
evaluate_all.sh: Batch evaluation script for multiple models- Purpose: Automated evaluation pipeline for comprehensive model comparison
evaluation_utils.py: Evaluation metrics and answer extractionopenai_utils.py: API integration and model communicationbm25_utils.py: Text-based retrieval utilitiesBGE-M3.py: BGE-M3 embedding model integration
Preprocessing Pipeline:
Original Images → merge_image.py → images_50/ → resize_image.py → images_15_3840/
OCR Processing:
Document Images → OCR Processing → data/texts/ → RAG Retrieval
Experiment Pipeline:
Dataset → Inference → outputs/ → Evaluation → results/
RAG Models → Retrieved Results → retrieved_results/
The FinMMDocR project includes a comprehensive Retrieval-Augmented Generation (RAG) system that supports both parallel and serial processing for document retrieval experiments.
The RAG system enables semantic search and document retrieval for multimodal financial reasoning by:
- Text-based Retrieval: Using OCR results from
data/texts/for semantic search - Multiple Embedding Models: Support for various embedding models (OpenAI, HuggingFace, BM25)
- Parallel/Serial Processing: Efficient processing options for different experimental needs
- Retrieval Results: Stored in
retrieved_results/for analysis
- Type: Traditional keyword-based retrieval
- Usage: Baseline retrieval method for comparison
- Implementation:
utils/bm25_utils.py - Features: TF-IDF based scoring with text preprocessing
- Models:
text-embedding-3-small,text-embedding-3-large,text-embedding-ada-002 - Type: Semantic embedding-based retrieval
- Features: High-quality semantic understanding
- Usage: Advanced semantic search capabilities
- Models:
contriever-msmarcoand other custom models - Type: Local embedding models
- Features: Offline processing, customizable
- Usage: Cost-effective semantic retrieval
python retrieve_parallel.pyKey Features:
- Multi-process Execution: 32 concurrent processes for high throughput
- Progress Tracking: Real-time progress bars with detailed logging
- Memory Optimization: Efficient batch processing and memory management
- Error Handling: Robust error recovery and retry mechanisms
Configuration:
# Model selection
model_name = "bm25" # or "text-embedding-3-small", "contriever-msmarco"
# Processing parameters
num_processes = 32 # Number of parallel processes
top_k = -1 # Retrieve all documents (-1) or specific numberpython retrieve_serial.pyComplete Workflow:
1. Load Dataset → 2. Process Queries → 3. Retrieve Documents → 4. Save Results
Detailed Process:
- Data Loading: Load questions from
data/test.json - Text Processing: Extract OCR results from
data/texts/ - Embedding Generation: Create embeddings for queries and documents
- Similarity Search: Find top-k most relevant documents
- Result Storage: Save to
retrieved_results/{model_name}.json
Retrieval Results Structure:
[
{
"question_id": "test-0",
"question": "Financial reasoning question...",
"retrieved_results": [
{
"page": 1,
"score": 0.85
},
{
"page": 3,
"score": 0.72
}
]
}
]Model Selection:
EMBEDDING_MODELS = {
"contriever-msmarco": {
"type": "hf",
"path": "model_path",
"projection_size": 768,
},
"text-embedding-3-small": {
"type": "openai",
"model": "text-embedding-3-small",
"projection_size": 1536,
},
"bm25": {
"type": "keyword",
"features": "TF-IDF based"
}
}Processing Parameters:
top_k: Number of documents to retrieve (-1 for all)batch_size: Embedding generation batch sizemax_tokens: Maximum tokens for text truncationn_subquantizers: Index compression parameters
Basic Retrieval:
# Parallel processing with BM25
python retrieve_parallel.py
# Serial processing with OpenAI embeddings
python retrieve_serial.pyCustom Configuration:
# Modify model selection in script
model_name = "text-embedding-3-large"
# Adjust processing parameters
num_processes = 16 # Reduce for memory constraints
top_k = 10 # Retrieve top 10 documentsResults Analysis:
# Load retrieval results
with open("./retrieved_results/bm25.json", "r") as f:
results = json.load(f)
# Analyze retrieval performance
for result in results:
print(f"Question: {result['question_id']}")
print(f"Retrieved {len(result['retrieved_results'])} documents")