Motivation

Building a concise yet powerful pixel-wise MLLM with strong scalability faces significant challenges:

  • Mask inputs and mask outputs cannot be modeled in a unified manner—mask input understanding relies on complex region-level feature pooling designs, while mask output depends on carefully designed segmentation decoders. Although unified modeling can be achieved through alternative approaches such as bounding boxes or points, this comes at the cost of reduced precision and introduced ambiguity.
  • Current state-of-the-art pixel-wise MLLMs cannot directly and concisely apply reinforcement learning (RL) to mask generation tasks, as they use continuous embeddings to connect the MLLM with the segmentation head.
  • Specially designed modules added for mask understanding and generation capabilities typically require co-training with the MLLM, and the different training losses and forward pipelines introduce substantial complexity for scaling training with VQA and pure text data.
  • Although some explorations attempt to circumvent these issues by treating masks as special images or representing them as text in formats similar to RLE encoding or polygon, this typically incurs enormous inference costs, with a single mask being represented by dozens or even hundreds of tokens.

How can we non-intrusively endow base MLLMs (such as the QwenVL series) with pixel-wise capabilities, making the learning process as simple as VQA training—requiring only next-token prediction loss for supervised fine-tuning (SFT) and straightforward reinforcement learning (RL)?

Contributions

Teaser
Figure 1

We propose SAMTok, a discrete mask tokenizer that can tokenize masks into textual special words and detokenize the textual special words into masks, which can transform masks into a new language for MLLMs to learn like regular text data. As shown in Figure 1, our proposed SAMTok can convert diverse masks into textual special tokens and accurately reconstruct the corresponding masks. Through SAMTok, any MLLM can acquire powerful pixel-wise capabilities by learning like language data through supervised fine-tuning and reinforcement learning, without any additional architectural modifications or specialized loss design.

In summary, our contributions are three-fold:

  • We propose a novel paradigm for MLLMs to model masks as a new language, enabling them to learn mask understanding and generation capabilities just like natural language without requiring architecture modifications or additional loss design.
  • We propose SAMTok, which can accurately achieve bidirectional conversion between masks and textual special tokens. Based on SAMTok, the QwenVL series of MLLMs acquire strong pixel-wise capabilities through next token prediction loss, achieving SOTA performance across dozens of diverse benchmarks.
  • We design a textual answer-matching reward function that enables MLLMs to perform reinforcement learning on mask generation tasks similar to natural language data, demonstrating significant performance improvements.

SAMTok

Tokenizer
Figure 2

SAMTok has an encoder \(f_{\text{enc}}\), a vector quantizer with codebook \(\mathcal{C}\), and a decoder \(f_{\text{dec}}\). Both \(f_{\text{enc}}\) and \(f_{\text{dec}}\) are instantiated with a SAM model, which includes an image backbone \(f_{\text{img}}\), a prompt encoder \(f_{\text{prm}}\), and a mask decoder \(f_{\text{msk}}\). Given an input image \(\mathcal{I}\) and a region \(\mathcal{M}\) (e.g., the area outlined in purple), the SAMTok encoder \(f_{\text{enc}}\) first encodes the 2D mask into a mask embedding \(\mathbf{z}\), then performs two-stage quantization to obtain discrete mask embeddings \([\mathbf{e}_1, \mathbf{e}_2]\). The SAMTok decoder \(f_{\text{dec}}\) reconstructs the 2D mask \(\hat{\mathcal{M}}\) from the original image and the region's discrete mask embeddings.

vlm_samtok
Figure 3

For the mask understanding task, SAMTok first tokenizes region masks into quantization codes, then formats them into mask words, which are used in the MLLM prompt to refer to the corresponding image regions. For the mask generation task, the MLLM first produces mask words according to the instruction, then maps these mask words to quantization codes, after which SAMTok reconstructs the 2D masks.

Experiments

Visualizations

Citation

Please kindly cite our paper if you find this project helpful.