See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Introduction

This is the official repository for the paper "See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning".

Motivation

To mitigate the perceptual bottleneck in VLMs, recent approaches often rely on external tools or explicit intermediate visual cues (e.g., generated masks, bounding boxes, or latent tokens) during inference. However, these paradigms face three critical limitations:

Shape Rigidity: Coarse boxes or masks fail to capture irregular, fine-grained evidence (e.g., thin polylines or specific intersections in charts).
Limited Generalization: Task-specific tools generalize poorly across diverse domains.
Inference Overhead: Multi-step visual reasoning increases computation costs and latency. BiPS takes a different route. Instead of using visual cues as inference-time crutches, we transform them into training signals to internalize perception.

Method: Bi-directional Perceptual Shaping

BiPS shapes the model's internal policy through a two-stage curriculum using programmatically generated views via chart code editing:

Consistency Stage: Minimizes divergence between the original image and an Evidence-Preserving View, teaching the model to focus on complete, supporting visual details.
Separation Stage: Maximizes divergence from an Evidence-Ablated View, penalizing the model for relying on text-only shortcuts when visual evidence is missing.

By strictly enforcing these constraints during training, BiPS achieves fine-grained visual grounding without any additional inference cost. Across 8 benchmarks, it boosts Qwen2.5-VL-7B by an average of 8.2%, demonstrating strong cross-domain generalization.

📝 Citation

If you find this work helpful in your research, please cite our paper:

@article{zhang2025bips,
  title={See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning},
  author={Zhang, Shuoshuo and Zhang, Yizhen and Fu, Jingjing and Song, Lei and Bian, Jiang and Yang, Yujiu and Wang, Rui},
  journal={arXiv preprint arXiv:2512.22120},
  year={2025}
}

Acknowledgments

This codebase is built upon Verl. We thank the authors for their outstanding contribution to the community.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
docker		docker
docs		docs
examples		examples
recipe		recipe
scripts		scripts
tests		tests
utils		utils
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_grpo.txt		requirements_grpo.txt
setup.py		setup.py
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Introduction

Motivation

Method: Bi-directional Perceptual Shaping

📝 Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Introduction

Motivation

Method: Bi-directional Perceptual Shaping

📝 Citation

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages