OneCAT: The Unified Multimodal Model

We introduce OneCAT, the open-source unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure decoder-only transformer architecture. Our framework uniquely eliminates the need for external components such as Vision Transformers (ViT) or Visual Tokenizer during inference, leading to significant efficiency gains and setting a new performance standard for unified multimodal intelligence.

Overview

Pure Decoder-Only Design

Eliminates external vision encoders and VAE tokenizers during inference, using only a lightweight patch embedding layer for raw image processing.

Mixture-of-Experts (MoE)

Three specialized FFN experts: Text FFN for language comprehension, Understanding FFN for visual tokens, and Generation FFN for image synthesis.

Multi-Scale Autoregressive

Pioneer Next Scale Prediction paradigm that generates images coarse-to-fine, drastically reducing generation steps compared to diffusion models.

OneCAT unified architecture overview.

Example Showcase

Chat Generation Editing

Chat & Visual Question Answering

What is the price for tomatoes?

To determine the price for tomatoes, we need to carefully examine the image provided. The image shows a variety of fruits and vegetables with their respective prices displayed on small signs.
1. Identify the tomatoes in the image: The tomatoes are located in the lower left section of the image.
2. Locate the price sign for tomatoes: The price sign for tomatoes is positioned near the tomatoes.
3. Read the price on the sign: The sign clearly states "8.00" for the price of tomatoes.
Therefore, the price for tomatoes is 8.00.

What is d in the last equation?

To find the value of $d$ in the last equation, we need to solve the equation $7d - 4 = 11d - 9$ .
Step 1: Subtract $7d$ from both sides of the equation:

7d - 4 - 7d = 11d - 9 - 7d

-4 = 4d - 9

Step 2: Add 9 to both sides of the equation:

-4 + 9 = 4d - 9 + 9

5 = 4d

Step 3: Divide both sides by 4:

\frac{5}{4} = \frac{4d}{4}

d = \frac{5}{4}

Therefore, the value of

d

\frac{5}{4}

请详细描述该图片。

这是一张电影海报，背景为一片波涛汹涌的海面，海浪翻滚，水花四溅。海面上有几条巨大的龙，它们张牙舞爪，似乎在进行激烈的战斗。海报中央是一个手持长矛的男孩，他站在海浪中，背对观众，面向龙的方向。男孩的头发随风飘扬，身上穿着红色的腰带，腰带上系着红色的丝带。男孩的长矛顶端有火焰，火焰在海浪中闪烁，显得格外醒目。海报上方有一行红色的文字：“魔童归来风火开年”，表示电影的主题和上映时间。海报下方有电影的标题“哪吒之魔童闹海”，以及导演和上映时间的信息。标题下方的“2025 大年初一”表示电影将在2025年大年初一上映。海报的最下方还有两个标志，一个是“可可豆动画”，另一个是“COLD ROOM”。通过这些信息，可以推断出这是一部以“哪吒”为主题的动画电影，讲述的是哪吒在海中大战巨龙的故事。海报通过生动的海浪和龙的描绘，营造出紧张刺激的氛围，吸引观众的注意力。

What are all the scene text in the image?

The text in the image reads: \"HAVE yourself A MERRY LITTLE Christmas.\" The text is stylized with a mix of cursive and print fonts, and it is adorned with decorative elements such as holly leaves and berries, which are traditional symbols associated with Christmas. The overall design conveys a festive and cheerful holiday message.

What are all the Chinese scene text in the image?

马曳著细细密密的光每个人都有自己的困境，却总会被光亮吸引，身不由己地慢慢靠近

Why are the persons wearing black clothes? Answer it and give the rationale.

The individuals in the image are wearing black clothes, which is a common choice for funeral attire. This tradition is rooted in several cultural and historical reasons. Firstly, black clothing is often associated with mourning and grief, symbolizing the seriousness and solemnity of the occasion. It is a visual cue to others that the person is in a state of mourning and should be treated with respect and dignity.\n\nSecondly, black clothing can also be a practical choice for funerals. It is a color that does not show stains or dirt easily, which is important when dealing with the physical aspects of a funeral, such as handling the body and participating in the service. Additionally, black clothing can help to create a sense of unity and solidarity among the mourners, as it is a color that is commonly worn by those who are grieving.\n\nIn summary, the individuals in the image are wearing black clothes because it is a traditional and practical choice for funerals, symbolizing mourning and grief while also providing a sense of unity and respect among the mourners.

请一步一步求解图中的问题。

已知方程 x + y = 17，求解 (x - 8)^3 + (y - 9)^3 的值。首先，根据方程 x + y = 17，我们可以将 y 表示为 y = 17 - x。接下来，我们将代入 y 的表达式到 (x - 8)^3 + (y - 9)^3 中，得到 (x - 8)^3 + (17 - x - 9)^3。化简得到 (x - 8)^3 + (8 - x)^3。由于 (a - b)^3 和 (b - a)^3 的绝对值相等，所以 (x - 8)^3 + (8 - x)^3 = 0。因此，(x - 8)^3 + (y - 9)^3 的值为 0。

What is the original price for pork belly before discount?

1. Identify the Discounted Price and Discount Percentage:

The discounted price of pork belly is 9.80.
The discount percentage is 30%.

2. Understand the Relationship Between Original Price, Discount, and Discounted Price:

The discounted price is the original price minus the discount amount.
The discount amount is calculated as a percentage of the original price.

3. Calculate the Discount Amount:

Let $P$ be the original price.
The discount amount is $0.30 \times P$ (since 30% of $P$ is the discount).

4. Set Up the Equation:

5. Solve for P:

Combine like terms:[9.80 = P - 0.30P][9.80 = 0.70P]
Divide both sides by 0.70 to isolate P:[P = \frac{9.80}{0.70}][P = 14]

Can you give a short introduction to this person?

The image depicts a portrait of Galileo Galilei, an Italian astronomer, physicist, and engineer, sometimes described as a polymath. He is widely recognized for his contributions to the scientific revolution of the 17th century. Galileo is often referred to as the \"father of observational astronomy\" and the \"father of modern physics.\" His most notable achievements include the observation of the four largest moons of Jupiter (Io, Europa, Ganymede, and Callisto), the observation of the phases of Venus, and the measurement of the acceleration due to gravity. His work laid the foundation for the scientific method and significantly advanced the fields of astronomy and physics.

请详细描述这张图片。

这是一张电影海报。海报主体是一位女性的近距离正面肖像，她的表情看似痛苦和疲惫，面部右侧有明显的伤痕和血迹。背景颜色主要是暖黄色，整体氛围显得沉重。在海报的左上角，用大号字体写着“0315 全国上映”，这表明该电影将于3月15日在全国上映。接下来，左下方有一行垂直的文字“刘浩存饰徐嘉怡”，表明这部电影的主角由刘浩存饰演，名字是徐嘉怡。在女性形象的下方，有一行斜排的词语：“跨过漫长岁月我们终会重逢”，暗示电影的主题可能与时间和重逢相关。这些词语之间有一定的间隔，有些词语通过竖线连接在一起，使得整体设计很有视觉冲击力。海报的底部正中间，有电影的标题“灿烂的她”，并有英文字样“Remember me”以及导演名字“导演徐伟”。这些信息说明了电影的名称、导演和主要演员，同时也让观众对电影有了初步的了解。海报的右下角有制片公司的标志和一些相关信息，表明电影的制作与发行公司。右下角也标注了“@刘浩存”，应该是电影相关的社交媒体账号或标志。从这些元素来看，电影《灿烂的她》通过文字和图像的结合传达出一种强烈的情感氛围，预示着可能有感人至深的故事背景。

Text-to-Image Generation

Van Gogh's Starry Night reimagined with neon cityscape

Charcoal sketch of an old wizard’s study, ancient books and potion bottles, dramatic shadows

超写实冰川洞穴，蓝冰透射阳光，冰锥如水晶吊灯，地下暗河反光

一个戴帽子的男人，特写镜头，黑白照片，高对比度，面部细节清晰，背景模糊，穿着深色外套，胡须和短发

a photo of a red stop sign right of a blue book

A striking image depicting an imagined distant galaxy teeming with life, inspired by the work of the renowned artist Caspar David Friedrich, is showcased in high quality on Artstation. The matte painting presents a vibrant scene filled with ethereal colors and pulsating with otherworldly energy. As viewers delve into the scene, they encounter intricate details of celestial bodies and the silhouettes of people populating this fantasy cosmos.

Film noir detective close-up, venetian blind shadows, cigarette smoke swirls

Fashion model with iridescent makeup, prismatic light reflections, high-fashion studio setting

Quantum portal opening in desert ruins, fractal energy waves, archaeologists in exosuits.

A digitally rendered image of the iconic Monalisa capturing a selfie, boasting 8K resolution and hyper-realistic details that highlight the delicate textures of her skin and the intricate fibers of her clothing. The scene is illuminated with cinematic lighting that casts soft shadows and enhances the depth of field, giving a three-dimensional quality to the image. The background is blurred artfully, drawing full attention to her enigmatic expression and the modern device in her hands, all created with the precision of an Octane render engine.

A meticulously crafted Art Nouveau screenprint featuring a dog's face, characterized by its remarkable symmetry and elaborate detailing. The canine visage, which is the central motif of the piece, exhibits intricate linework and stylized features typical of the Art Nouveau aesthetic. The artwork is deftly rendered in a harmonious palette, with each element of the design echoing the balanced and ornate nature of the style.

Sci-fi warrior woman with glowing visor, electric sparks, metallic reflections, futuristic armor.

Magical library pixel scene, floating books, glowing runes, enchanted atmosphere.

Turtle swimming near ocean surface with sun rays penetrating water, peaceful marine life scene.

Penguin sliding on ice under aurora lights, comical pose, arctic environment with colorful sky reflection.

Instruction-based Image Editing

Change this image into Adorable 3D Character style.

Change this image into a loose, flowing watercolor-wash style.

Replace the mountain goat in the image with a rabbit.

Replace the deer in the image with a lion standing majestically in the same forest setting, under the glowing golden light and light snowflakes.

Change the color of the vehicle to red.

Remove the horse in the foreground.

Remove the human figure from the image, ensuring the background is restored to appear natural and seamless.

Change the blurred environment in the background to an autumn forest with orange and yellow leaves on the trees.

Change this image into classic impasto oil-painting style.

Change this image into 8-bit pixel-art style.

Change this image into classic stained-glass cathedral-window style.

Change this image into Graffiti Art style.

Training Pipeline

Training stages overview.

Stage 1: Separate Pretraining

Understanding Distillation: 436M image-text pairs with teacher-student distillation using InterViT teacher model

Generation Pretraining: 51M text-to-image samples with Next-Scale Prediction loss

Stage 2: Unified Mid-Training

Unified training across all tasks with Scale-Aware Adapter integration for generation. Native resolution strategy for both understanding and generation with dynamic aspect ratios.

Stage 3: Supervised Fine-Tuning

High-quality instruction-following data with expanded generation resolution support

Multimodal Understanding

Model	A-LLM	Vis.	MME-P ↑	MME-S ↑	MMBench ↑	MMMU ↑	MM-Vet ↑	MathVista ↑	SEED ↑
Encoder-based Understanding Only Models
InternVL2	1.8B	0.3B	1440	1877	73.2	34.3	44.6	46.4	71.6
InternVL2.5	1.8B	0.3B	—	2138	74.7	43.6	60.8	51.3	—
Qwen2-VL	1.5B	0.6B	—	1872	74.9	41.1	49.5	43.0	—
Qwen2.5-VL	3B	0.6B	—	2157	79.1	53.1	61.8	62.3	—
Encoder-free Understanding Only Models
Mono-InternVL	1.8B	/	—	1875	65.5	33.7	40.1	45.7	67.4
EvE	7B	/	—	1628	52.3	32.6	25.7	—	64.6
EvEv2.0	7B	/	—	1709	66.3	39.3	45.0	—	71.4
HoVLE	2.6B	/	—	1864	71.9	33.7	44.3	46.2	70.7
VoRA	7B	/	1363	1674	64.2	32.2	33.7	-	67.5
SAIL	7B	/	-	1719	70.1	-	46.3	57.0	72.9
Unified Models
Chameleon	7B	-	—	—	35.7	28.4	8.3	—	30.6
Emu3	8B	0.3B	—	—	58.5	31.6	37.2	—	68.2
Harmon	1.5B	0.9B	1155	1476	65.5	38.9	—	—	67.1
Show-o2 (1.5B)	1.5B	0.5B	1450	—	67.4	37.1	—	—	65.6
Janus-Pro (1.5B)	1.5B	0.3B	1444	—	75.5	36.3	39.8	—	—
ILLUME+	3B	0.6B	1414	—	80.8	44.3	40.3	—	73.3
VILA-U	7B	0.4B	1401	—	—	—	33.5	—	59.0
Janus-Pro (7B)	7B	0.3B	1567	—	79.2	41.0	50.0	—	—
Tar	7B	0.4B	1571	1926	74.4	39.0	—	—	—
Show-o2 (7B)	7B	0.5B	1620	—	79.3	48.9	—	—	69.8
OneCAT-3B	3B	/	1630	2051	78.8	41.9	52.2	61.7	72.5

Higher is better. Best and second-best are highlighted (across unified models).

Text-to-Image Generation

Model	Params	GenEval Overall	DPG-Bench Overall
Emu3-8B	8B	0.66	81.60
Janus-Pro-7B	7B	0.80	84.19
Mogao-7B	7B	0.89	84.33
BAGEL-7B	7B	0.82	—
BAGEL-7B†	7B	0.88	—
Show-o2-7B	7B	0.76	86.14
Tar-7B	7B	0.84	84.19
BLIP3-o-8B	8B	0.84	81.60
OneCAT-3B	3B	0.90	84.53

GenEval and DPG-Bench results. Best and second-best are highlighted. † indicates prompt rewriting where applicable.

Image Editing

Model	Add	Adjust	Extract	Replace	Remove	Background	Style	Hybrid	Action	Overall
MagicBrush	2.84	1.58	1.51	1.97	1.58	1.75	2.38	1.62	1.22	1.90
Instruct-Pix2Pix	2.45	1.83	1.44	2.01	1.50	1.44	3.55	1.20	1.46	1.88
AnyEdit	3.18	2.95	1.88	2.47	2.23	2.24	2.85	1.56	2.65	2.45
UltraEdit	3.44	2.81	2.13	2.96	1.45	2.83	3.76	1.91	2.98	2.70
Step1X-Edit	3.88	3.14	1.76	3.40	2.41	3.16	4.63	2.64	2.52	3.06
ICEdit	3.58	3.39	1.73	3.15	2.93	3.08	3.84	2.04	3.68	3.05
OmniGen	3.47	3.04	1.71	2.94	2.43	3.21	4.19	2.24	3.38	2.96
OmniGen2	3.57	3.06	1.77	3.74	3.20	3.57	4.81	2.52	4.68	3.44
BAGEL-7B	3.56	3.31	1.70	3.30	2.62	3.24	4.49	2.38	4.17	3.20
UniWorld-V1-20B	3.82	3.64	2.27	3.47	3.24	2.99	4.21	2.96	2.74	3.26
OneCAT-3B	3.65	3.70	2.42	3.92	3.00	3.79	4.61	2.23	3.53	3.43

ImgEdit-Bench results. Best and second-best are highlighted.

Efficiency Analysis

Model	Resolution of Input Image	#Text Tokens	#Visual Tokens	TTFT (s)	Reduction
Qwen2.5-VL-3B	768 × 768	24	731	0.135	50.4%
OneCAT-3B	768 × 768	24	731 + 256	0.067	50.4%
Qwen2.5-VL-3B	1024 × 1024	24	1395	0.216	57.4%
OneCAT-3B	1024 × 1024	24	1395 + 256	0.092	57.4%
Qwen2.5-VL-3B	1792 × 1792	24	4098	0.583	61.4%
OneCAT-3B	1792 × 1792	24	4098 + 256	0.225	61.4%

Efficiency comparison of OneCAT-3B and Qwen2.5-VL-3B (Prefilling). TTFT = Time To First Token.

Model	Resolution of Generated Image	T2I Infer. Time (s)	Edit Infer. Time (s)
BAGEL-7B	512 × 512	8.762	13.447
OneCAT-3B	512 × 512	1.40	2.03
BAGEL-7B	1024 × 1024	26.293	46.444
OneCAT-3B	1024 × 1024	2.85	4.61

Generation efficiency comparison of OneCAT-3B and BAGEL-7B. T2I = Text-to-Image.