Logo VoRA

Vision as LoRA

Han Wang1, Yongjie Ye1 , Bingru Li1, Yuxiang Nie1,
Jinghui Lu1 , Jingqun Tang1 , Yanjie Wang1, Can Huang1

1ByteDance Inc., 2University of Birmingham

animation

Demonstration of how VoRA works.

comparison

Comparison of VoRA with Encoder-based MLLM.

🔔News

🔥[2025-04-04]: We are excited to release VoRA weights and data 😆.

Introduction

We propose Vision as LoRA (VoRA), a novel paradigm converting LLMs into Multimodal Large Language Models (MLLMs) by integrating vision-specific LoRA layers. Unlike conventional encoder-based MLLMs, VoRA eliminates dependency on external vision modules through three key innovations:

  • Vision-LoRA Integration: Trainable LoRA layers encode visual features directly into frozen LLM parameters
  • Block-wise Distillation: Transfer visual priors from ViT teachers via layer-aligned feature alignment
  • Bi-directional Attention: Replaces causal masks for enhanced visual context modeling

VoRA achieves comparable performance to encoder-based MLLMs while reducing inference overhead through parameter merging.

Core Methodology

1. Vision as LoRA

  • Integrates LoRA layers into first N_vit blocks of frozen LLM to stablize training.
  • Only 2B parameters trainable (LoRA + 6M visual embedding layer).
  • Seamless parameter merging post-training.

2. Block-wise Distillation

  • Aligns LLM block features with ViT teacher via cosine similarity loss.
  • Combined distillation and language modeling losses to accelerate converging.

3. Bi-directional Attention

  • Replaces causal masks for vision tokens.
  • Improves average benchmark scores by 2.4 points.
  • Maintains causal masking for text generation.
overview

Architecture of VoRA.

Experimental Results

Key Findings

  • Achieves 55.6 average score across 8 benchmarks (TQA, POPE, MME, etc.). When trained with a proper scale of additional data, VoRA matches conventional encoder-based MLLMs in terms of performance while reducing computational costs, demonstrating that LLMs can acquire native multimodal capabilities without external vision models. This challenges the widely perceived necessity of encoder-based architectures for multimodal tasks.
  • Vision as LoRA can stablize the vision internalization process.
  • Supports native-resolution inputs (VoRA-AnyRes variant).
  • Extensive ablations validate the effectiveness of each component.

Examples

BibTeX


      @article{wang2025vision,
        title={Vision as LoRA},
        author={Wang, Han and Ye, Yongjie and Li, Bingru and Nie, Yuxiang and Lu, Jinghui and Tang, Jingqun and Wang, Yanjie and Huang, Can},
        journal={arXiv preprint arXiv:2503.20680},
        year={2025}
      }