SAMTok: Representing Any Mask with Two Words

Yikang Zhou1,2, Tao Zhang1, Dengxian Gong1, Yuanzheng Wu1, Haochen Wang2, Ye Tian2,
Haobo Yuan1, Jiacong Wang2, Lu Qi1, Hao Fei3, Anran Wang2,
Zhuochen Wang2, Yujing Wang2, Cheng Chen2,
Shunping Ji1,✉️, Xiangtai Li2,✉️
1Wuhan University    2ByteDance    3National University of Singapore

Why SAMTok?

Mask as Language

Unifies mask understanding and generation into a single linguistic framework. No need for separate complex decoders or pooling layers.

Extreme Efficiency

Represents any complex mask with just two discrete tokens, significantly reducing inference cost compared to polygon or RLE formats.

RL-Ready

Since masks are discrete tokens, standard Reinforcement Learning (RL) can be directly applied to optimize pixel-wise generation.

Plug-and-Play

Non-intrusive design allows any base MLLM (like QwenVL) to acquire pixel-wise capabilities via simple next-token prediction training.

Workflow

Teaser
Figure 1
We propose SAMTok, a discrete mask tokenizer that can tokenize masks into textual special words and detokenize the textual special words into masks, which can transform masks into a new language for MLLMs to learn like regular text data. As shown in Figure 1, our proposed SAMTok can convert diverse masks into textual special tokens and accurately reconstruct the corresponding masks. Through SAMTok, any MLLM can acquire powerful pixel-wise capabilities by learning like language data through supervised fine-tuning and reinforcement learning, without any additional architectural modifications or specialized loss design.

SAMTok Architecture

Tokenizer
Figure 2
SAMTok has an encoder \(f_{\text{enc}}\), a vector quantizer with codebook \(\mathcal{C}\), and a decoder \(f_{\text{dec}}\). Both \(f_{\text{enc}}\) and \(f_{\text{dec}}\) are instantiated with a SAM model, which includes an image backbone \(f_{\text{img}}\), a prompt encoder \(f_{\text{prm}}\), and a mask decoder \(f_{\text{msk}}\). Given an input image \(\mathcal{I}\) and a region \(\mathcal{M}\) (e.g., the area outlined in purple), the SAMTok encoder \(f_{\text{enc}}\) first encodes the 2D mask into a mask embedding \(\mathbf{z}\), then performs two-stage quantization to obtain discrete mask embeddings \([\mathbf{e}_1, \mathbf{e}_2]\). The SAMTok decoder \(f_{\text{dec}}\) reconstructs the 2D mask \(\hat{\mathcal{M}}\) from the original image and the region's discrete mask embeddings.
vlm_samtok
Figure 3
For the mask understanding task, SAMTok first tokenizes region masks into quantization codes, then formats them into mask words, which are used in the MLLM prompt to refer to the corresponding image regions. For the mask generation task, the MLLM first produces mask words according to the instruction, then maps these mask words to quantization codes, after which SAMTok reconstructs the 2D masks.

Main Results

Visualizations

Yikang Zhou, Tao Zhang, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng Chen, Shunping Ji, Xiangtai Li

Citation

@article{samtok,
  title={SAMTok: Representing Any Mask with Two Words},
  author={Zhou, Yikang and Zhang, Tao and Gong, Dengxian and Wu, Yuanzheng and Tian, Ye and Wang, Haochen and Yuan, Haobo and Wang, Jiacong and Qi, Lu and Wang, Anran and Wang, Zhuochen and Wang, Yujing and Chen, Cheng and Ji, Shunping and Li, Xiangtai},
  journal={arXiv preprint arXiv:2601.16093},
  year={2026}
}