GLM-Image: A "Hybrid" Model Mastering Complex Semantics and Text Rendering

Why the "Hybrid" Approach?

A Technical "Trick"

Progressive Training, Not Rushing

The conclusion is simple: To understand the complex logic in your prompts while ensuring the generated images are as clear as photographs.

The reason lies in the fact that current Diffusion Models, while stable in image quality and strong in generalization, often struggle with "hallucinating" or missing details when it comes to understanding complex instructions and "knowledge-intensive" scenarios. Their semantic alignment can sometimes fall short. On the other hand, Auto-regressive models (like GPT) are naturally good at handling such sequential logic and have a deeper understanding of semantics.

So, GLM-Image adopts a clear division of labor:

Auto-regressive Part: Partially based on and initialized from the 9-billion-parameter GLM-4-9B. It acts as the "bone structure," generating visual tokens that represent the image semantics, focusing on understanding complex text instructions.

Diffusion Decoder: Adopts a single-stream DiT structure based on CogView4, with 7 billion parameters. It acts as the "skin," rendering these tokens into high-fidelity image details, ensuring exquisite image quality.

For a concrete example, if you ask for "a blackboard displaying complex quantum physics formulas, next to a steaming cup of coffee, set in a futuristic laboratory," GLM-Image handles the text rendering (formulas), object composition logic (blackboard, coffee, background), and atmosphere much better. The auto-regressive module plans the content of the formulas and the position of objects, while the diffusion module is responsible for rendering the texture of the chalk and the steam of the coffee vividly.

A Technical "Trick"

In terms of technical implementation, one detail is worth mentioning: Visual Token Selection.

Previous models, when converting images to tokens, either chose high reconstruction accuracy (like VQVAE) or strong semantics. It was found that for image generation, semantic correlation is key to convergence and quality.

Therefore, GLM-Image chose semantic-VQ as its primary token strategy. Although it might not seem as "complete" as some solutions from an information theory perspective, in actual training loss comparisons, its convergence is much better (a magnitude difference of roughly 7 vs 3). This means the model can learn how to turn text into images faster and more accurately.

Progressive Training, Not Rushing

Training such a massive model wasn't done in one go, but rather in three stages:

256px Stage: Learning to draw small images first.

512px Stage: Increasing the resolution.

Mixed Resolution Stage: From 512px to 1024px, with final outputs upscaleable to 2048px.

Especially when generating high-definition images, a "Progressive Generation" strategy is used: first generating about 256 tokens to define the overall framework and composition of the image, and then filling in the high-frequency details. It's like a painter working: drafting the outline first, then polishing the details. This effectively avoids the problem where "local parts look good, but the overall structure collapses."

GLM-Image currently performs well not only in text-to-image but also in image-to-image, style transfer, and even "editing consistency" tasks due to its deep semantic understanding. You can try it out directly in the Generator on this site.

GLM-Image: A "Hybrid" Model Mastering Complex Semantics and Text Rendering

Table of Contents

Why the "Hybrid" Approach?

A Technical "Trick"

Progressive Training, Not Rushing