OpenSeg-R

Improving Open-Vocabulary Segmentation via Step-by-Step Visual Reasoning

Zongyan Han¹,Jiale Cao²,Shuo Chen³, Tong Wang¹, Jorma Laaksonen⁴, Rao Muhammad Anwer¹,

¹ Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), ² Tianjin University, ³ Nanjing University, ⁴ Aalto University

[Paper]

Introduction

Open-Vocabulary Segmentation (OVS) has drawn increasing attention for its capacity to generalize segmentation beyond predefined categories. However, existing methods typically predict segmentation masks with simple forward inference, lacking explicit reasoning and interpretability. This makes it challenging for OVS model to distinguish similar categories in open-world settings due to the lack of contextual understanding and discriminative visual cues. To address this limitation, we propose a step-by-step visual reasoning framework for open-vocabulary segmentation, named OpenSeg-R. The proposed OpenSeg-R leverages Large Multimodal Models (LMMs) to perform hierarchical visual reasoning before segmentation. Specifically, we generate both generic and image-specific reasoning for each image, forming structured triplets that explain the visual reason for objects in a coarse-to-fine manner. Based on these reasoning steps, we can compose detailed description prompts, and feed them to the segmentor to produce more accurate segmentation masks. To the best of our knowledge, OpenSeg-R is the first framework to introduce explicit step-by-step visual reasoning into OVS. Experimental results demonstrate that OpenSeg-R significantly outperforms state-of-the-art methods on open-vocabulary semantic segmentation across five benchmark datasets. Moreover, it achieves consistent gains across all metrics on open-vocabulary panoptic segmentation. Qualitative results further highlight the effectiveness of our reasoning-guided framework in improving both segmentation precision and interpretability.

Experiment

We utilize both MAFT+ and SED as open-vocabulary segmentation for our framework. Here we provide the instructions for reproducing our results with MAFT+. We will release the code for SED in the future. For OpenSeg-R w MAFT+, we follow all setting in MAFT+ as below:

Installation

Clone the repository

git clone https://github.com/Hanzy1996/OpenSeg-R.git

Navigate to the project directory
```
cd OpenSeg-R
```

Install the dependencies

bash install.sh
cd maft/modeling/pixel_decoder/ops
sh make.sh

Data Preparation

Firstly, download the Qwen2.5-VL-72B-Instruct-AWQ and save them in ./llm with the following command:

mkdir llm
cd llm
git lfs install
git clone https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct-AWQ

Next, generate the image-specific reasons and generic class reasoning using Qwen2.5-VL-72B-Instruct-AWQ, and save them in ./reason_data/image_reason and ./reason_data/generic_reason respectively. We also recommend extracting the text features of the image-specific reasons in advance for faster evaluation. These features could be saved in ./reason_data/image_reason_feat. Here we provide the reasoning file and corresponding features for PC59, and we will release the reasoning results for other datasets, along with the reasoning processing code, in the future.

Then, refer to MAFT+ for dataset preparation and organization. The data structure is similar to that of MAFT, with some modifications to accommodate the new dataset format and reasoning method.

reason_data/
  general_reason/
  image_reason/
    ade150/
    ade846/
    pc59/
    pc459/
    voc20/
  image_reason_feat
    ade150/
      Convnext-B/
      Convnext-L/
    ...
datasets/
  ade/
      ADEChallengeData2016/
        images/
        annotations_detectron2/
      ADE20K_2021_17_01/
        images/
        annotations_detectron2/
  coco/
        train2017/
        val2017/
        stuffthingmaps_detectron2/
  VOCdevkit/
     VOC2012/
        images_detectron2/
        annotations_ovs/      
    VOC2010/
        images/
        annotations_detectron2_ovs/
            pc59_val/
            pc459_val/

Evaluation
First, download the pre-trained model from MAFT+ and saved them into ./pretrained. Then, evaluate OpenSeg-R with MAFT+ using the following command on validation sets of other datasets.
```
sh eval_reason_base.sh
sh eval_reason_large.sh
sh eval_reason_pano.sh
```
Results
The results of OpenSeg-R on different datasets are shown below.

Method	VLM	A-847	PC-459	A-150	PC-59	PAS-20
OpenSeg-R w SED	ConvNeXt-B	11.8	18.9	33.6	59.0	95.1
OpenSeg-R w MAFT+	ConvNeXt-B	15.2	15.5	35.5	59.0	96.1
OpenSeg-R w SED	ConvNeXt-L	14.3	22.0	36.1	61.2	96.3
OpenSeg-R w MAFT+	ConvNeXt-L	16.8	17.1	37.1	60.3	96.2

Cite

If this codebase is useful to you, please consider citing:

@article{han2025opensegr,
  title={OpenSeg-R: Improving Open-Vocabulary Segmentation via Step-by-Step Visual Reasoning},
  author={Zongyan Han, Jiale Cao, Shuo Chen, Tong Wang, Jorma Laaksonen, Rao Muhammad Anwer},
  journal={arXiv preprint arXiv:2505.16974},
  year={2025}
}

Acknowledgement

MAFT+

SED

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenSeg-R

Improving Open-Vocabulary Segmentation via Step-by-Step Visual Reasoning

Introduction

Experiment

Installation

Data Preparation

Evaluation

Results

Cite

Acknowledgement

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

OpenSeg-R

Improving Open-Vocabulary Segmentation via Step-by-Step Visual Reasoning

Introduction

Experiment

Installation

Data Preparation

Evaluation

Results

Cite

Acknowledgement