Unifies mask understanding and generation into a single linguistic framework. No need for separate complex decoders or pooling layers.
Represents any complex mask with just two discrete tokens, significantly reducing inference cost compared to polygon or RLE formats.
Since masks are discrete tokens, standard Reinforcement Learning (RL) can be directly applied to optimize pixel-wise generation.
Non-intrusive design allows any base MLLM (like QwenVL) to acquire pixel-wise capabilities via simple next-token prediction training.
@article{samtok,
title={SAMTok: Representing Any Mask with Two Words},
author={Zhou, Yikang and Zhang, Tao and Gong, Dengxian and Wu, Yuanzheng and Tian, Ye and Wang, Haochen and Yuan, Haobo and Wang, Jiacong and Qi, Lu and Wang, Anran and Wang, Zhuochen and Wang, Yujing and Chen, Cheng and Ji, Shunping and Li, Xiangtai},
journal={arXiv preprint arXiv:2601.16093},
year={2026}
}