OneCAT: Unified Decoder-Only Auto-Regressive Model

Han Li*, Xinyu Peng*, Yaoming Wang†, Zelin Peng, Xin Chen,
Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, Hongkai Xiong
*Equal contribution, †Project lead

We introduce OneCAT, the open-source unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure decoder-only transformer architecture. Our framework uniquely eliminates the need for external components such as Vision Transformers (ViT) or Visual Tokenizer during inference, leading to significant efficiency gains and setting a new performance standard for unified multimodal intelligence.

Overview

Pure Decoder-Only Design

Eliminates external vision encoders and VAE tokenizers during inference, using only a lightweight patch embedding layer for raw image processing.

Mixture-of-Experts (MoE)

Three specialized FFN experts: Text FFN for language comprehension, Understanding FFN for visual tokens, and Generation FFN for image synthesis.

Multi-Scale Autoregressive

Pioneer Next Scale Prediction paradigm that generates images coarse-to-fine, drastically reducing generation steps compared to diffusion models.

OneCAT unified architecture overview
OneCAT unified architecture overview.

Example Showcase

Chat & Visual Question Answering

Input for the model

What is the price for tomatoes?

To determine the price for tomatoes, we need to carefully examine the image provided. The image shows a variety of fruits and vegetables with their respective prices displayed on small signs.
1. Identify the tomatoes in the image: The tomatoes are located in the lower left section of the image.
2. Locate the price sign for tomatoes: The price sign for tomatoes is positioned near the tomatoes.
3. Read the price on the sign: The sign clearly states "8.00" for the price of tomatoes.
Therefore, the price for tomatoes is 8.00.

Input for the model

What is d in the last equation?

To find the value of d in the last equation, we need to solve the equation 7d - 4 = 11d - 9 .
Step 1: Subtract 7d from both sides of the equation:

7d - 4 - 7d = 11d - 9 - 7d
-4 = 4d - 9
Step 2: Add 9 to both sides of the equation:
-4 + 9 = 4d - 9 + 9
5 = 4d
Step 3: Divide both sides by 4:
\frac{5}{4} = \frac{4d}{4}
d = \frac{5}{4}
Therefore, the value of d is \frac{5}{4}.

Input for the model

请详细描述该图片。

这是一张电影海报,背景为一片波涛汹涌的海面,海浪翻滚,水花四溅。海面上有几条巨大的龙,它们张牙舞爪,似乎在进行激烈的战斗。海报中央是一个手持长矛的男孩,他站在海浪中,背对观众,面向龙的方向。男孩的头发随风飘扬,身上穿着红色的腰带,腰带上系着红色的丝带。男孩的长矛顶端有火焰,火焰在海浪中闪烁,显得格外醒目。 海报上方有一行红色的文字:“魔童归来 风火开年”,表示电影的主题和上映时间。海报下方有电影的标题“哪吒之魔童闹海”,以及导演和上映时间的信息。标题下方的“2025 大年初一”表示电影将在2025年大年初一上映。海报的最下方还有两个标志,一个是“可可豆动画”,另一个是“COLD ROOM”。 通过这些信息,可以推断出这是一部以“哪吒”为主题的动画电影,讲述的是哪吒在海中大战巨龙的故事。海报通过生动的海浪和龙的描绘,营造出紧张刺激的氛围,吸引观众的注意力。

Input for the model

What are all the scene text in the image?

The text in the image reads: \"HAVE yourself A MERRY LITTLE Christmas.\" The text is stylized with a mix of cursive and print fonts, and it is adorned with decorative elements such as holly leaves and berries, which are traditional symbols associated with Christmas. The overall design conveys a festive and cheerful holiday message.

Input for the model

What are all the Chinese scene text in the image?

马曳 著 细细密密的光 每个人都有自己的困境, 却总会被光亮吸引,身不由己地慢慢靠近

Input for the model

Why are the persons wearing black clothes? Answer it and give the rationale.

The individuals in the image are wearing black clothes, which is a common choice for funeral attire. This tradition is rooted in several cultural and historical reasons. Firstly, black clothing is often associated with mourning and grief, symbolizing the seriousness and solemnity of the occasion. It is a visual cue to others that the person is in a state of mourning and should be treated with respect and dignity.\n\nSecondly, black clothing can also be a practical choice for funerals. It is a color that does not show stains or dirt easily, which is important when dealing with the physical aspects of a funeral, such as handling the body and participating in the service. Additionally, black clothing can help to create a sense of unity and solidarity among the mourners, as it is a color that is commonly worn by those who are grieving.\n\nIn summary, the individuals in the image are wearing black clothes because it is a traditional and practical choice for funerals, symbolizing mourning and grief while also providing a sense of unity and respect among the mourners.

Input for the model

请一步一步求解图中的问题。

已知方程 x + y = 17,求解 (x - 8)^3 + (y - 9)^3 的值。首先,根据方程 x + y = 17,我们可以将 y 表示为 y = 17 - x。接下来,我们将代入 y 的表达式到 (x - 8)^3 + (y - 9)^3 中,得到 (x - 8)^3 + (17 - x - 9)^3。化简得到 (x - 8)^3 + (8 - x)^3。由于 (a - b)^3 和 (b - a)^3 的绝对值相等,所以 (x - 8)^3 + (8 - x)^3 = 0。因此,(x - 8)^3 + (y - 9)^3 的值为 0。

Input for the model

What is the original price for pork belly before discount?

    To find the original price for pork belly before the discount, we need to use the given information and perform a simple calculation.
    1. Identify the Discounted Price and Discount Percentage:
  • The discounted price of pork belly is 9.80.
  • The discount percentage is 30%.
  • 2. Understand the Relationship Between Original Price, Discount, and Discounted Price:
  • The discounted price is the original price minus the discount amount.
  • The discount amount is calculated as a percentage of the original price.
  • 3. Calculate the Discount Amount:
  • Let P be the original price.
  • The discount amount is 0.30 \times P (since 30% of P is the discount).
  • 4. Set Up the Equation: The discounted price is the original price minus the discount amount:[9.80 = P - (0.30 \times P)]
    5. Solve for P:
  • Combine like terms:[9.80 = P - 0.30P][9.80 = 0.70P]
  • Divide both sides by 0.70 to isolate P:[P = \frac{9.80}{0.70}][P = 14]

  • Therefore, the original price for pork belly before the discount is 14.
Input for the model

Can you give a short introduction to this person?

The image depicts a portrait of Galileo Galilei, an Italian astronomer, physicist, and engineer, sometimes described as a polymath. He is widely recognized for his contributions to the scientific revolution of the 17th century. Galileo is often referred to as the \"father of observational astronomy\" and the \"father of modern physics.\" His most notable achievements include the observation of the four largest moons of Jupiter (Io, Europa, Ganymede, and Callisto), the observation of the phases of Venus, and the measurement of the acceleration due to gravity. His work laid the foundation for the scientific method and significantly advanced the fields of astronomy and physics.

Input for the model

请详细描述这张图片。

这是一张电影海报。海报主体是一位女性的近距离正面肖像,她的表情看似痛苦和疲惫,面部右侧有明显的伤痕和血迹。背景颜色主要是暖黄色,整体氛围显得沉重。 在海报的左上角,用大号字体写着“0315 全国上映”,这表明该电影将于3月15日在全国上映。接下来,左下方有一行垂直的文字“刘浩存 饰 徐嘉怡”,表明这部电影的主角由刘浩存饰演,名字是徐嘉怡。 在女性形象的下方,有一行斜排的词语:“跨 过 漫 长 岁 月 我 们 终 会 重 逢”,暗示电影的主题可能与时间和重逢相关。这些词语之间有一定的间隔,有些词语通过竖线连接在一起,使得整体设计很有视觉冲击力。 海报的底部正中间,有电影的标题“灿烂的她”,并有英文字样“Remember me”以及导演名字“导演 徐伟”。这些信息说明了电影的名称、导演和主要演员,同时也让观众对电影有了初步的了解。 海报的右下角有制片公司的标志和一些相关信息,表明电影的制作与发行公司。右下角也标注了“@刘浩存”,应该是电影相关的社交媒体账号或标志。 从这些元素来看,电影《灿烂的她》通过文字和图像的结合传达出一种强烈的情感氛围,预示着可能有感人至深的故事背景。

Text-to-Image Generation

Van Gogh's Starry Night reimagined with neon cityscape

Output for the model

Charcoal sketch of an old wizard’s study, ancient books and potion bottles, dramatic shadows

Output for the model

超写实冰川洞穴,蓝冰透射阳光,冰锥如水晶吊灯,地下暗河反光

Output for the model

一个戴帽子的男人,特写镜头,黑白照片,高对比度,面部细节清晰,背景模糊,穿着深色外套,胡须和短发

Output for the model

a photo of a red stop sign right of a blue book

Output for the model

A striking image depicting an imagined distant galaxy teeming with life, inspired by the work of the renowned artist Caspar David Friedrich, is showcased in high quality on Artstation. The matte painting presents a vibrant scene filled with ethereal colors and pulsating with otherworldly energy. As viewers delve into the scene, they encounter intricate details of celestial bodies and the silhouettes of people populating this fantasy cosmos.

Output for the model

Film noir detective close-up, venetian blind shadows, cigarette smoke swirls

Output for the model

Fashion model with iridescent makeup, prismatic light reflections, high-fashion studio setting

Output for the model

Quantum portal opening in desert ruins, fractal energy waves, archaeologists in exosuits.

Output for the model

A digitally rendered image of the iconic Monalisa capturing a selfie, boasting 8K resolution and hyper-realistic details that highlight the delicate textures of her skin and the intricate fibers of her clothing. The scene is illuminated with cinematic lighting that casts soft shadows and enhances the depth of field, giving a three-dimensional quality to the image. The background is blurred artfully, drawing full attention to her enigmatic expression and the modern device in her hands, all created with the precision of an Octane render engine.

Output for the model

A meticulously crafted Art Nouveau screenprint featuring a dog's face, characterized by its remarkable symmetry and elaborate detailing. The canine visage, which is the central motif of the piece, exhibits intricate linework and stylized features typical of the Art Nouveau aesthetic. The artwork is deftly rendered in a harmonious palette, with each element of the design echoing the balanced and ornate nature of the style.

Output for the model

Sci-fi warrior woman with glowing visor, electric sparks, metallic reflections, futuristic armor.

Output for the model

Magical library pixel scene, floating books, glowing runes, enchanted atmosphere.

Output for the model

Turtle swimming near ocean surface with sun rays penetrating water, peaceful marine life scene.

Output for the model

Penguin sliding on ice under aurora lights, comical pose, arctic environment with colorful sky reflection.

Output for the model

Instruction-based Image Editing

Input for the model

Change this image into Adorable 3D Character style.

Output for the model
Input for the model

Change this image into a loose, flowing watercolor-wash style.

Output for the model
Input for the model

Replace the mountain goat in the image with a rabbit.

Output for the model
Input for the model

Replace the deer in the image with a lion standing majestically in the same forest setting, under the glowing golden light and light snowflakes.

Output for the model
Input for the model

Change the color of the vehicle to red.

Output for the model
Input for the model

Remove the horse in the foreground.

Output for the model
Input for the model

Remove the human figure from the image, ensuring the background is restored to appear natural and seamless.

Output for the model
Input for the model

Change the blurred environment in the background to an autumn forest with orange and yellow leaves on the trees.

Output for the model
Input for the model

Change this image into classic impasto oil-painting style.

Output for the model
Input for the model

Change this image into 8-bit pixel-art style.

Output for the model
Input for the model

Change this image into classic stained-glass cathedral-window style.

Output for the model
Input for the model

Change this image into Graffiti Art style.

Output for the model

Training Pipeline

OneCAT training stages overview
Training stages overview.

Stage 1: Separate Pretraining

Understanding Distillation: 436M image-text pairs with teacher-student distillation using InterViT teacher model

Generation Pretraining: 51M text-to-image samples with Next-Scale Prediction loss

Stage 2: Unified Mid-Training

Unified training across all tasks with Scale-Aware Adapter integration for generation. Native resolution strategy for both understanding and generation with dynamic aspect ratios.

Stage 3: Supervised Fine-Tuning

High-quality instruction-following data with expanded generation resolution support

Multimodal Understanding

Model A-LLM Vis. MME-P ↑ MME-S ↑ MMBench ↑ MMMU ↑ MM-Vet ↑ MathVista ↑ SEED ↑
Encoder-based Understanding Only Models
InternVL2 1.8B 0.3B 1440 1877 73.2 34.3 44.6 46.4 71.6
InternVL2.5 1.8B 0.3B 2138 74.7 43.6 60.8 51.3
Qwen2-VL 1.5B 0.6B 1872 74.9 41.1 49.5 43.0
Qwen2.5-VL 3B 0.6B 2157 79.1 53.1 61.8 62.3
Encoder-free Understanding Only Models
Mono-InternVL 1.8B / 1875 65.5 33.7 40.1 45.7 67.4
EvE 7B / 1628 52.3 32.6 25.7 64.6
EvEv2.0 7B / 1709 66.3 39.3 45.0 71.4
HoVLE 2.6B / 1864 71.9 33.7 44.3 46.2 70.7
VoRA 7B / 1363 1674 64.2 32.2 33.7 - 67.5
SAIL 7B / - 1719 70.1 - 46.3 57.0 72.9
Unified Models
Chameleon 7B - 35.7 28.4 8.3 30.6
Emu3 8B 0.3B 58.5 31.6 37.2 68.2
Harmon 1.5B 0.9B 1155 1476 65.5 38.9 67.1
Show-o2 (1.5B) 1.5B 0.5B 1450 67.4 37.1 65.6
Janus-Pro (1.5B) 1.5B 0.3B 1444 75.5 36.3 39.8
ILLUME+ 3B 0.6B 1414 80.8 44.3 40.3 73.3
VILA-U 7B 0.4B 1401 33.5 59.0
Janus-Pro (7B) 7B 0.3B 1567 79.2 41.0 50.0
Tar 7B 0.4B 1571 1926 74.4 39.0
Show-o2 (7B) 7B 0.5B 1620 79.3 48.9 69.8
OneCAT-3B 3B / 1630 2051 78.8 41.9 52.2 61.7 72.5

Higher is better. Best and second-best are highlighted (across unified models).

Text-to-Image Generation

Model Params GenEval Overall DPG-Bench Overall
Emu3-8B 8B 0.66 81.60
Janus-Pro-7B 7B 0.80 84.19
Mogao-7B 7B 0.89 84.33
BAGEL-7B 7B 0.82
BAGEL-7B† 7B 0.88
Show-o2-7B 7B 0.76 86.14
Tar-7B 7B 0.84 84.19
BLIP3-o-8B 8B 0.84 81.60
OneCAT-3B 3B 0.90 84.53

GenEval and DPG-Bench results. Best and second-best are highlighted. † indicates prompt rewriting where applicable.

Image Editing

Model Add Adjust Extract Replace Remove Background Style Hybrid Action Overall
MagicBrush 2.84 1.58 1.51 1.97 1.58 1.75 2.38 1.62 1.22 1.90
Instruct-Pix2Pix 2.45 1.83 1.44 2.01 1.50 1.44 3.55 1.20 1.46 1.88
AnyEdit 3.18 2.95 1.88 2.47 2.23 2.24 2.85 1.56 2.65 2.45
UltraEdit 3.44 2.81 2.13 2.96 1.45 2.83 3.76 1.91 2.98 2.70
Step1X-Edit 3.88 3.14 1.76 3.40 2.41 3.16 4.63 2.64 2.52 3.06
ICEdit 3.58 3.39 1.73 3.15 2.93 3.08 3.84 2.04 3.68 3.05
OmniGen 3.47 3.04 1.71 2.94 2.43 3.21 4.19 2.24 3.38 2.96
OmniGen2 3.57 3.06 1.77 3.74 3.20 3.57 4.81 2.52 4.68 3.44
BAGEL-7B 3.56 3.31 1.70 3.30 2.62 3.24 4.49 2.38 4.17 3.20
UniWorld-V1-20B 3.82 3.64 2.27 3.47 3.24 2.99 4.21 2.96 2.74 3.26
OneCAT-3B 3.65 3.70 2.42 3.92 3.00 3.79 4.61 2.23 3.53 3.43

ImgEdit-Bench results. Best and second-best are highlighted.

Efficiency Analysis

Model Resolution of Input Image #Text Tokens #Visual Tokens TTFT (s) Reduction
Qwen2.5-VL-3B 768 × 768 24 731 0.135 50.4%
OneCAT-3B 768 × 768 24 731 + 256 0.067
Qwen2.5-VL-3B 1024 × 1024 24 1395 0.216 57.4%
OneCAT-3B 1024 × 1024 24 1395 + 256 0.092
Qwen2.5-VL-3B 1792 × 1792 24 4098 0.583 61.4%
OneCAT-3B 1792 × 1792 24 4098 + 256 0.225

Efficiency comparison of OneCAT-3B and Qwen2.5-VL-3B (Prefilling). TTFT = Time To First Token.

Model Resolution of Generated Image T2I Infer. Time (s) Edit Infer. Time (s)
BAGEL-7B 512 × 512 8.762 13.447
OneCAT-3B 512 × 512 1.40 2.03
BAGEL-7B 1024 × 1024 26.293 46.444
OneCAT-3B 1024 × 1024 2.85 4.61

Generation efficiency comparison of OneCAT-3B and BAGEL-7B. T2I = Text-to-Image.