Building a concise yet powerful pixel-wise MLLM with strong scalability faces significant challenges:
We propose SAMTok, a discrete mask tokenizer that can tokenize masks into textual special words and detokenize the textual special words into masks, which can transform masks into a new language for MLLMs to learn like regular text data. As shown in Figure 1, our proposed SAMTok can convert diverse masks into textual special tokens and accurately reconstruct the corresponding masks. Through SAMTok, any MLLM can acquire powerful pixel-wise capabilities by learning like language data through supervised fine-tuning and reinforcement learning, without any additional architectural modifications or specialized loss design.
In summary, our contributions are three-fold:
SAMTok has an encoder \(f_{\text{enc}}\), a vector quantizer with codebook \(\mathcal{C}\), and a decoder \(f_{\text{dec}}\). Both \(f_{\text{enc}}\) and \(f_{\text{dec}}\) are instantiated with a SAM model, which includes an image backbone \(f_{\text{img}}\), a prompt encoder \(f_{\text{prm}}\), and a mask decoder \(f_{\text{msk}}\). Given an input image \(\mathcal{I}\) and a region \(\mathcal{M}\) (e.g., the area outlined in purple), the SAMTok encoder \(f_{\text{enc}}\) first encodes the 2D mask into a mask embedding \(\mathbf{z}\), then performs two-stage quantization to obtain discrete mask embeddings \([\mathbf{e}_1, \mathbf{e}_2]\). The SAMTok decoder \(f_{\text{dec}}\) reconstructs the 2D mask \(\hat{\mathcal{M}}\) from the original image and the region's discrete mask embeddings.
For the mask understanding task, SAMTok first tokenizes region masks into quantization codes, then formats them into mask words, which are used in the MLLM prompt to refer to the corresponding image regions. For the mask generation task, the MLLM first produces mask words according to the instruction, then maps these mask words to quantization codes, after which SAMTok reconstructs the 2D masks.