Shubham Patle 1*
Sara Ghaboura 1*
Hania Tariq 2
Mohammad Usman Khan 3
Omkar Thawakar 1
Rao M. Anwer 1
Salman Khan 1,4
1Mohamed bin Zayed University of AI 2NUCES 3NUST 4Australian National University
🔥🔥 [04 Jan 2026] 🔥🔥 DuwatBench accepted to EACL 2026 Main track.
🔥 [22 Jan 2026] DuwatBench the open-source Arabic Calligraphy Benchmark for Multimodal Understanding is released.
🤗 [23 Jan 2026] DuwatBench dataset available on HuggingFace.
DuwatBench is a comprehensive benchmark for evaluating LMMs on Arabic calligraphy recognition. Arabic calligraphy represents one of the richest visual traditions of the Arabic language, blending linguistic meaning with artistic form. DuwatBench addresses the gap in evaluating how well modern AI systems can process stylized Arabic text.
Figure 1. Figure 1. Left: Proportional breakdown of calligraphic styles in the DuwatBench dataset. Right: Proportional breakdown of textual categories, covering religious and non-religious themes.
- 1,272 curated samples spanning 6 classical and modern calligraphic styles
- Over 9.5k word instances with approximately 1,475 unique words spanning religious and cultural domains
- Bounding box annotations for detection-level evaluation
- Full text transcriptions with style and theme labels
- Complex artistic backgrounds preserving real-world visual complexity
The DuwatBench dataset follows a structured pipeline to ensure the accuracy, completeness, and contextual richness by style and categry.
Figure 2. End-to-end pipeline for constructing DuwatBench, from data collection and manual transcription with bounding boxes to multi-tier verification and style/theme aggregation.
| Style | Arabic | Description |
|---|---|---|
| Thuluth | الثلث | Ornate script used in mosque decorations |
| Diwani | الديواني | Flowing Ottoman court script |
| Naskh | النسخ | Standard readable script |
| Kufic | الكوفي | Geometric angular early Arabic script |
| Ruq'ah | الرقعة | Modern everyday handwriting |
| Nasta'liq | النستعليق | Persian-influenced flowing script |
Figure 2. End-to-end pipeline for constructing DuwatBench, from data collection and manual transcription with bounding boxes to multi-tier verification and style/theme aggregation.
- Python 3.10+
- CUDA-compatible GPU (recommended for open-source models)
# Clone the repository
git clone https://github.com/mbzuai-oryx/DuwatBench.git
cd DuwatBench
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# or: venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtFor closed-source models, set your API keys:
# Option 1: Environment variables
export GEMINI_API_KEY="your-key"
export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"
# Option 2: Create config file
cp src/config/api_keys.example.py src/config/api_keys.py
# Edit api_keys.py with your keys# Download from Hugging Face
huggingface-cli download MBZUAI/DuwatBench --local-dir ./data
# Or use Python
from datasets import load_dataset
dataset = load_dataset("MBZUAI/DuwatBench")Each sample in the JSON manifest contains:
{
"image_id": "images/2_129.jpg",
"Style": "Thuluth",
"Text": ["صَدَقَ اللَّهُ الْعَظِيمُ"],
"word_count": [3],
"total_words": 3,
"bboxes": [[34, 336, 900, 312]],
"Category": "quranic"
}# Evaluate a single model
python src/evaluate.py --model gemini-2.5-flash --mode full_image
# Evaluate with bounding boxes
python src/evaluate.py --model gpt-4o-mini --mode with_bbox
# Evaluate both modes
python src/evaluate.py --model EasyOCR --mode both
# Resume interrupted evaluation
python src/evaluate.py --model claude-sonnet-4.5 --mode full_image --resume| Metric | Description |
|---|---|
| CER | Character Error Rate - edit distance at character level |
| WER | Word Error Rate - edit distance at word level |
| chrF | Character n-gram F-score - partial match robustness |
| ExactMatch | Strict full-sequence accuracy |
| NLD | Normalized Levenshtein Distance - balanced error measure |
| Model | CER ↓ | WER ↓ | chrF ↑ | ExactMatch ↑ | NLD ↓ |
|---|---|---|---|---|---|
| MBZUAI/AIN* | 0.5494 | 0.6912 | 42.67 | 0.1895 | 0.5134 |
| Gemma-3-27B-IT | 0.5556 | 0.6591 | 51.53 | 0.2398 | 0.4741 |
| Qwen2.5-VL-72B | 0.5709 | 0.7039 | 43.98 | 0.1761 | 0.5298 |
| Qwen2.5-VL-7B | 0.6453 | 0.7768 | 36.97 | 0.1211 | 0.5984 |
| InternVL3-8B | 0.7588 | 0.8822 | 21.75 | 0.0574 | 0.7132 |
| EasyOCR | 0.8538 | 0.9895 | 12.30 | 0.0031 | 0.8163 |
| TrOCR-Arabic* | 0.9728 | 0.9998 | 1.79 | 0.0000 | 0.9632 |
| LLaVA-v1.6-Mistral-7B | 0.9932 | 0.9998 | 9.16 | 0.0000 | 0.9114 |
| Model | CER ↓ | WER ↓ | chrF ↑ | ExactMatch ↑ | NLD ↓ |
|---|---|---|---|---|---|
| Gemini-2.5-flash | 0.3700 | 0.4478 | 71.82 | 0.4167 | 0.3166 |
| Gemini-1.5-flash | 0.3933 | 0.5112 | 63.28 | 0.3522 | 0.3659 |
| GPT-4o | 0.4766 | 0.5692 | 56.85 | 0.3388 | 0.4245 |
| GPT-4o-mini | 0.6039 | 0.7077 | 42.67 | 0.2115 | 0.5351 |
| Claude-Sonnet-4.5 | 0.6494 | 0.7255 | 42.97 | 0.2225 | 0.5599 |
* Arabic-specific models
| Model | Kufic | Thuluth | Diwani | Naskh | Ruq'ah | Nasta'liq |
|---|---|---|---|---|---|---|
| Gemini-2.5-flash | 0.7067 | 0.3527 | 0.5698 | 0.4765 | 0.5817 | 0.5222 |
| Gemini-1.5-flash | 0.7212 | 0.4741 | 0.5783 | 0.4444 | 0.5445 | 0.5023 |
| GPT-4o | 0.8041 | 0.5540 | 0.6370 | 0.4189 | 0.5507 | 0.4434 |
| Gemma-3-27B-IT | 0.7802 | 0.6315 | 0.7326 | 0.5138 | 0.7571 | 0.6637 |
| MBZUAI/AIN | 0.7916 | 0.7036 | 0.7130 | 0.5367 | 0.6111 | 0.6916 |
- Gemini-2.5-flash achieves the best overall performance with 41.67% exact match accuracy
- Models perform best on Naskh and Ruq'ah (standardized strokes)
- Diwani and Thuluth (ornate scripts with dense ligatures) remain challenging
- Kufic records the lowest scores due to geometric rigidity
- Bounding box localization improves performance across most models
Figure 3. Qualitative results comparing open- and closed-source models on DuwatBench calligraphy samples.
DuwatBench/
├── README.md
├── requirements.txt
├── setup.py
├── LICENSE
├── CITATION.cff
├── data/
│ ├── images/ # Calligraphy images
│ └── duwatbench.json # Dataset manifest
├── src/
│ ├── evaluate.py # Main evaluation script
│ ├── models/
│ │ └── model_wrapper.py # Model implementations
│ ├── metrics/
│ │ └── evaluation_metrics.py # CER, WER, chrF, etc.
│ ├── utils/
│ │ ├── data_loader.py # Dataset loading
│ │ └── arabic_normalization.py
│ └── config/
│ ├── eval_config.py
│ └── api_keys.example.py
├── scripts/
│ ├── download_data.sh
│ └── run_all_evaluations.sh
└── results/ # Evaluation outputs
If you use DuwatBench dataset in your research, please consider citing:
@misc{patle2026duwatbench,
title={DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding},
author={Shubham Patle and Sara Ghaboura and Hania Tariq and Mohammad Usman Khan and Omkar Thawakar and Rao Muhammad Anwer and Salman Khan},
year={2026},
eprint={2601.19898},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.19898},
}This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
The dataset images are sourced from public digital archives and community repositories under their respective licenses.
- Digital archives: Library of Congress, NYPL Digital Collections
- Community repositories: Calligraphy Qalam, Free Islamic Calligraphy, Pinterest
- Annotation tool: MakeSense.ai
- Arabic NLP tools: CAMeL Tools
For questions or issues, please:
- Open an issue on GitHub
- Contact the authors at: {shubham.patle, sara.ghaboura, omkar.thawakar}@mbzuai.ac.ae









