1. Vision as LoRA
- Integrates LoRA layers into first N_vit blocks of frozen LLM to stablize training.
- Only 2B parameters trainable (LoRA + 6M visual embedding layer).
- Seamless parameter merging post-training.
2. Block-wise Distillation
- Aligns LLM block features with ViT teacher via cosine similarity loss.
- Combined distillation and language modeling losses to accelerate converging.
3. Bi-directional Attention
- Replaces causal masks for vision tokens.
- Improves average benchmark scores by 2.4 points.
- Maintains causal masking for text generation.