DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding [EACL 2026 (Main) 🔥]

Shubham Patle ^1* Sara Ghaboura ^1* Hania Tariq ² Mohammad Usman Khan ³
Omkar Thawakar ¹ Rao M. Anwer ¹ Salman Khan ^1,4

¹Mohamed bin Zayed University of AI ²NUCES ³NUST ⁴Australian National University

^{*Equal Contribution}

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Latest Updates

🔥🔥 [04 Jan 2026] 🔥🔥 DuwatBench accepted to EACL 2026 Main track.
🔥 [22 Jan 2026] DuwatBench the open-source Arabic Calligraphy Benchmark for Multimodal Understanding is released.
🤗 [23 Jan 2026] DuwatBench dataset available on HuggingFace.

Overview

DuwatBench is a comprehensive benchmark for evaluating LMMs on Arabic calligraphy recognition. Arabic calligraphy represents one of the richest visual traditions of the Arabic language, blending linguistic meaning with artistic form. DuwatBench addresses the gap in evaluating how well modern AI systems can process stylized Arabic text.

Figure 1. Figure 1. Left: Proportional breakdown of calligraphic styles in the DuwatBench dataset. Right: Proportional breakdown of textual categories, covering religious and non-religious themes.

🌟 Key Features

Key Features of TimeTravel

1,272 curated samples spanning 6 classical and modern calligraphic styles
Over 9.5k word instances with approximately 1,475 unique words spanning religious and cultural domains
Bounding box annotations for detection-level evaluation
Full text transcriptions with style and theme labels
Complex artistic backgrounds preserving real-world visual complexity

DuwatBench Creation Pipeline

The DuwatBench dataset follows a structured pipeline to ensure the accuracy, completeness, and contextual richness by style and categry.

Figure 2. End-to-end pipeline for constructing DuwatBench, from data collection and manual transcription with bounding boxes to multi-tier verification and style/theme aggregation.

Calligraphic Styles

Style	Arabic	Description
Thuluth	الثلث	Ornate script used in mosque decorations
Diwani	الديواني	Flowing Ottoman court script
Naskh	النسخ	Standard readable script
Kufic	الكوفي	Geometric angular early Arabic script
Ruq'ah	الرقعة	Modern everyday handwriting
Nasta'liq	النستعليق	Persian-influenced flowing script

🧐 DuwatBench Dataset Examples

Figure 2. End-to-end pipeline for constructing DuwatBench, from data collection and manual transcription with bounding boxes to multi-tier verification and style/theme aggregation.

Installation

Requirements

Python 3.10+
CUDA-compatible GPU (recommended for open-source models)

Setup

# Clone the repository
git clone https://github.com/mbzuai-oryx/DuwatBench.git
cd DuwatBench

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

API Keys Configuration

For closed-source models, set your API keys:

# Option 1: Environment variables
export GEMINI_API_KEY="your-key"
export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"

# Option 2: Create config file
cp src/config/api_keys.example.py src/config/api_keys.py
# Edit api_keys.py with your keys

Dataset

Download

# Download from Hugging Face
huggingface-cli download MBZUAI/DuwatBench --local-dir ./data

# Or use Python
from datasets import load_dataset
dataset = load_dataset("MBZUAI/DuwatBench")

Data Format

Each sample in the JSON manifest contains:

{
  "image_id": "images/2_129.jpg",
  "Style": "Thuluth",
  "Text": ["صَدَقَ اللَّهُ الْعَظِيمُ"],
  "word_count": [3],
  "total_words": 3,
  "bboxes": [[34, 336, 900, 312]],
  "Category": "quranic"
}

Evaluation (Quick Start)

# Evaluate a single model
python src/evaluate.py --model gemini-2.5-flash --mode full_image

# Evaluate with bounding boxes
python src/evaluate.py --model gpt-4o-mini --mode with_bbox

# Evaluate both modes
python src/evaluate.py --model EasyOCR --mode both

# Resume interrupted evaluation
python src/evaluate.py --model claude-sonnet-4.5 --mode full_image --resume

🎯 Quantitative Evaluation and Results

Evaluation Metrics

Metric	Description
CER	Character Error Rate - edit distance at character level
WER	Word Error Rate - edit distance at word level
chrF	Character n-gram F-score - partial match robustness
ExactMatch	Strict full-sequence accuracy
NLD	Normalized Levenshtein Distance - balanced error measure

Open-Source Models

Model	CER ↓	WER ↓	chrF ↑	ExactMatch ↑	NLD ↓
MBZUAI/AIN*	0.5494	0.6912	42.67	0.1895	0.5134
Gemma-3-27B-IT	0.5556	0.6591	51.53	0.2398	0.4741
Qwen2.5-VL-72B	0.5709	0.7039	43.98	0.1761	0.5298
Qwen2.5-VL-7B	0.6453	0.7768	36.97	0.1211	0.5984
InternVL3-8B	0.7588	0.8822	21.75	0.0574	0.7132
EasyOCR	0.8538	0.9895	12.30	0.0031	0.8163
TrOCR-Arabic*	0.9728	0.9998	1.79	0.0000	0.9632
LLaVA-v1.6-Mistral-7B	0.9932	0.9998	9.16	0.0000	0.9114

Closed-Source Models

Model	CER ↓	WER ↓	chrF ↑	ExactMatch ↑	NLD ↓
Gemini-2.5-flash	0.3700	0.4478	71.82	0.4167	0.3166
Gemini-1.5-flash	0.3933	0.5112	63.28	0.3522	0.3659
GPT-4o	0.4766	0.5692	56.85	0.3388	0.4245
GPT-4o-mini	0.6039	0.7077	42.67	0.2115	0.5351
Claude-Sonnet-4.5	0.6494	0.7255	42.97	0.2225	0.5599

* Arabic-specific models

Per-Style WER Performance (Full Image)

Model	Kufic	Thuluth	Diwani	Naskh	Ruq'ah	Nasta'liq
Gemini-2.5-flash	0.7067	0.3527	0.5698	0.4765	0.5817	0.5222
Gemini-1.5-flash	0.7212	0.4741	0.5783	0.4444	0.5445	0.5023
GPT-4o	0.8041	0.5540	0.6370	0.4189	0.5507	0.4434
Gemma-3-27B-IT	0.7802	0.6315	0.7326	0.5138	0.7571	0.6637
MBZUAI/AIN	0.7916	0.7036	0.7130	0.5367	0.6111	0.6916

Key Findings

Gemini-2.5-flash achieves the best overall performance with 41.67% exact match accuracy
Models perform best on Naskh and Ruq'ah (standardized strokes)
Diwani and Thuluth (ornate scripts with dense ligatures) remain challenging
Kufic records the lowest scores due to geometric rigidity
Bounding box localization improves performance across most models

Qulaitative Evaluation and Results

Figure 3. Qualitative results comparing open- and closed-source models on DuwatBench calligraphy samples.

Project Structure

DuwatBench/
├── README.md
├── requirements.txt
├── setup.py
├── LICENSE
├── CITATION.cff
├── data/
│   ├── images/                   # Calligraphy images
│   └── duwatbench.json           # Dataset manifest
├── src/
│   ├── evaluate.py               # Main evaluation script
│   ├── models/
│   │   └── model_wrapper.py      # Model implementations
│   ├── metrics/
│   │   └── evaluation_metrics.py # CER, WER, chrF, etc.
│   ├── utils/
│   │   ├── data_loader.py        # Dataset loading
│   │   └── arabic_normalization.py
│   └── config/
│       ├── eval_config.py
│       └── api_keys.example.py
├── scripts/
│   ├── download_data.sh
│   └── run_all_evaluations.sh
└── results/                      # Evaluation outputs

📚 Citation

If you use DuwatBench dataset in your research, please consider citing:

@misc{patle2026duwatbench,
      title={DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding}, 
      author={Shubham Patle and Sara Ghaboura and Hania Tariq and Mohammad Usman Khan and Omkar Thawakar and Rao Muhammad Anwer and Salman Khan},
      year={2026},
      eprint={2601.19898},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.19898},
}

⚖️ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

The dataset images are sourced from public digital archives and community repositories under their respective licenses.

Acknowledgments

Digital archives: Library of Congress, NYPL Digital Collections
Community repositories: Calligraphy Qalam, Free Islamic Calligraphy, Pinterest
Annotation tool: MakeSense.ai
Arabic NLP tools: CAMeL Tools

Contact

For questions or issues, please:

Open an issue on GitHub
Contact the authors at: {shubham.patle, sara.ghaboura, omkar.thawakar}@mbzuai.ac.ae

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
figures		figures
scripts		scripts
src		src
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
ettings		ettings
requirements.txt		requirements.txt
setup.py		setup.py
upport		upport

Folders and files

Latest commit

History

Repository files navigation

DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding [EACL 2026 (Main) 🔥]

Latest Updates

Overview

Figure 1. Figure 1. Left: Proportional breakdown of calligraphic styles in the DuwatBench dataset. Right: Proportional breakdown of textual categories, covering religious and non-religious themes.

🌟 Key Features

Key Features of TimeTravel

DuwatBench Creation Pipeline

Figure 2. End-to-end pipeline for constructing DuwatBench, from data collection and manual transcription with bounding boxes to multi-tier verification and style/theme aggregation.

Calligraphic Styles

🧐 DuwatBench Dataset Examples

Figure 2. End-to-end pipeline for constructing DuwatBench, from data collection and manual transcription with bounding boxes to multi-tier verification and style/theme aggregation.

Installation

Requirements

Setup

API Keys Configuration

Dataset

Download

Data Format

Evaluation (Quick Start)

🎯 Quantitative Evaluation and Results

Evaluation Metrics

Open-Source Models

Closed-Source Models

Per-Style WER Performance (Full Image)

Key Findings

Qulaitative Evaluation and Results

Figure 3. Qualitative results comparing open- and closed-source models on DuwatBench calligraphy samples.

Project Structure

📚 Citation

⚖️ License

Acknowledgments

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages