Skip to content

mbzuai-oryx/DuwatBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding [EACL 2026 (Main) 🔥]

Shubham Patle 1*   Sara Ghaboura 1*   Hania Tariq 2   Mohammad Usman Khan 3  
Omkar Thawakar 1   Rao M. Anwer 1  Salman Khan 1,4

1Mohamed bin Zayed University of AI    2NUCES    3NUST    4Australian National University

arXiv Our Page GitHub issues GitHub stars GitHub license
*Equal Contribution

If you like our project, please give us a star ⭐ on GitHub for the latest update.



Latest Updates

🔥🔥 [04 Jan 2026] 🔥🔥 DuwatBench accepted to EACL 2026 Main track.
🔥 [22 Jan 2026] DuwatBench the open-source Arabic Calligraphy Benchmark for Multimodal Understanding is released.
🤗 [23 Jan 2026] DuwatBench dataset available on HuggingFace.




hourg_logo Overview

DuwatBench is a comprehensive benchmark for evaluating LMMs on Arabic calligraphy recognition. Arabic calligraphy represents one of the richest visual traditions of the Arabic language, blending linguistic meaning with artistic form. DuwatBench addresses the gap in evaluating how well modern AI systems can process stylized Arabic text.

   Figure 1         Figure 2

Figure 1. Figure 1. Left: Proportional breakdown of calligraphic styles in the DuwatBench dataset. Right: Proportional breakdown of textual categories, covering religious and non-religious themes.



🌟 Key Features

Key Features of TimeTravel

  • 1,272 curated samples spanning 6 classical and modern calligraphic styles
  • Over 9.5k word instances with approximately 1,475 unique words spanning religious and cultural domains
  • Bounding box annotations for detection-level evaluation
  • Full text transcriptions with style and theme labels
  • Complex artistic backgrounds preserving real-world visual complexity


pipeline DuwatBench Creation Pipeline

The DuwatBench dataset follows a structured pipeline to ensure the accuracy, completeness, and contextual richness by style and categry.

pipeline

Figure 2. End-to-end pipeline for constructing DuwatBench, from data collection and manual transcription with bounding boxes to multi-tier verification and style/theme aggregation.


duwat_logo Calligraphic Styles

Style Arabic Description
Thuluth الثلث Ornate script used in mosque decorations
Diwani الديواني Flowing Ottoman court script
Naskh النسخ Standard readable script
Kufic الكوفي Geometric angular early Arabic script
Ruq'ah الرقعة Modern everyday handwriting
Nasta'liq النستعليق Persian-influenced flowing script


🧐 DuwatBench Dataset Examples

pipeline

Figure 2. End-to-end pipeline for constructing DuwatBench, from data collection and manual transcription with bounding boxes to multi-tier verification and style/theme aggregation.


duwat_logo Installation

Requirements

  • Python 3.10+
  • CUDA-compatible GPU (recommended for open-source models)

Setup

# Clone the repository
git clone https://github.com/mbzuai-oryx/DuwatBench.git
cd DuwatBench

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

API Keys Configuration

For closed-source models, set your API keys:

# Option 1: Environment variables
export GEMINI_API_KEY="your-key"
export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"

# Option 2: Create config file
cp src/config/api_keys.example.py src/config/api_keys.py
# Edit api_keys.py with your keys


data Dataset

Download

# Download from Hugging Face
huggingface-cli download MBZUAI/DuwatBench --local-dir ./data

# Or use Python
from datasets import load_dataset
dataset = load_dataset("MBZUAI/DuwatBench")

Data Format

Each sample in the JSON manifest contains:

{
  "image_id": "images/2_129.jpg",
  "Style": "Thuluth",
  "Text": ["صَدَقَ اللَّهُ الْعَظِيمُ"],
  "word_count": [3],
  "total_words": 3,
  "bboxes": [[34, 336, 900, 312]],
  "Category": "quranic"
}


hourg_logo Evaluation (Quick Start)

# Evaluate a single model
python src/evaluate.py --model gemini-2.5-flash --mode full_image

# Evaluate with bounding boxes
python src/evaluate.py --model gpt-4o-mini --mode with_bbox

# Evaluate both modes
python src/evaluate.py --model EasyOCR --mode both

# Resume interrupted evaluation
python src/evaluate.py --model claude-sonnet-4.5 --mode full_image --resume


🎯 Quantitative Evaluation and Results

Evaluation Metrics

Metric Description
CER Character Error Rate - edit distance at character level
WER Word Error Rate - edit distance at word level
chrF Character n-gram F-score - partial match robustness
ExactMatch Strict full-sequence accuracy
NLD Normalized Levenshtein Distance - balanced error measure

Open-Source Models

Model CER ↓ WER ↓ chrF ↑ ExactMatch ↑ NLD ↓
MBZUAI/AIN* 0.5494 0.6912 42.67 0.1895 0.5134
Gemma-3-27B-IT 0.5556 0.6591 51.53 0.2398 0.4741
Qwen2.5-VL-72B 0.5709 0.7039 43.98 0.1761 0.5298
Qwen2.5-VL-7B 0.6453 0.7768 36.97 0.1211 0.5984
InternVL3-8B 0.7588 0.8822 21.75 0.0574 0.7132
EasyOCR 0.8538 0.9895 12.30 0.0031 0.8163
TrOCR-Arabic* 0.9728 0.9998 1.79 0.0000 0.9632
LLaVA-v1.6-Mistral-7B 0.9932 0.9998 9.16 0.0000 0.9114

Closed-Source Models

Model CER ↓ WER ↓ chrF ↑ ExactMatch ↑ NLD ↓
Gemini-2.5-flash 0.3700 0.4478 71.82 0.4167 0.3166
Gemini-1.5-flash 0.3933 0.5112 63.28 0.3522 0.3659
GPT-4o 0.4766 0.5692 56.85 0.3388 0.4245
GPT-4o-mini 0.6039 0.7077 42.67 0.2115 0.5351
Claude-Sonnet-4.5 0.6494 0.7255 42.97 0.2225 0.5599

* Arabic-specific models

Per-Style WER Performance (Full Image)

Model Kufic Thuluth Diwani Naskh Ruq'ah Nasta'liq
Gemini-2.5-flash 0.7067 0.3527 0.5698 0.4765 0.5817 0.5222
Gemini-1.5-flash 0.7212 0.4741 0.5783 0.4444 0.5445 0.5023
GPT-4o 0.8041 0.5540 0.6370 0.4189 0.5507 0.4434
Gemma-3-27B-IT 0.7802 0.6315 0.7326 0.5138 0.7571 0.6637
MBZUAI/AIN 0.7916 0.7036 0.7130 0.5367 0.6111 0.6916

Key Findings

  • Gemini-2.5-flash achieves the best overall performance with 41.67% exact match accuracy
  • Models perform best on Naskh and Ruq'ah (standardized strokes)
  • Diwani and Thuluth (ornate scripts with dense ligatures) remain challenging
  • Kufic records the lowest scores due to geometric rigidity
  • Bounding box localization improves performance across most models


hourg_logo Qulaitative Evaluation and Results

pipeline

Figure 3. Qualitative results comparing open- and closed-source models on DuwatBench calligraphy samples.


hourg_logo Project Structure

DuwatBench/
├── README.md
├── requirements.txt
├── setup.py
├── LICENSE
├── CITATION.cff
├── data/
│   ├── images/                   # Calligraphy images
│   └── duwatbench.json           # Dataset manifest
├── src/
│   ├── evaluate.py               # Main evaluation script
│   ├── models/
│   │   └── model_wrapper.py      # Model implementations
│   ├── metrics/
│   │   └── evaluation_metrics.py # CER, WER, chrF, etc.
│   ├── utils/
│   │   ├── data_loader.py        # Dataset loading
│   │   └── arabic_normalization.py
│   └── config/
│       ├── eval_config.py
│       └── api_keys.example.py
├── scripts/
│   ├── download_data.sh
│   └── run_all_evaluations.sh
└── results/                      # Evaluation outputs


📚 Citation

If you use DuwatBench dataset in your research, please consider citing:

@misc{patle2026duwatbench,
      title={DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding}, 
      author={Shubham Patle and Sara Ghaboura and Hania Tariq and Mohammad Usman Khan and Omkar Thawakar and Rao Muhammad Anwer and Salman Khan},
      year={2026},
      eprint={2601.19898},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.19898},
}


⚖️ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

The dataset images are sourced from public digital archives and community repositories under their respective licenses.



acknowledge Acknowledgments



contact Contact

For questions or issues, please:

  • Open an issue on GitHub
  • Contact the authors at: {shubham.patle, sara.ghaboura, omkar.thawakar}@mbzuai.ac.ae

About

[EACL Accepted 🔥🔥] DuwatBench: A Benchmark for Arabic Calligraphy Understanding 🖋️📜

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors