At the recent Computer Vision and Pattern Recognition Conference (CVPR) 2025 in Nashville, Tennessee, Kuaishou’s Kling AI division made a splash, showcasing its advancements in video generation and world model research. Dr. Wan Pengfei, representing the team, presented a dedicated tutorial, “An Introduction to Kling and Our Research towards More Powerful Video Generation Models,” offering a deep dive into the team’s progress. This involved exploring four key technical pillars: model architecture and generation algorithms, interactive and controllable capabilities, effect evaluation and alignment mechanisms, and multimodal understanding and reasoning.
Let’s break down what these advancements could mean for the future of AI-driven video creation.
**Advanced Model Architectures and Generation Algorithms**
Kling’s team is tackling scaling laws – the rules that govern how model performance improves with increased size and computational resources. While scaling laws are well-established in large language models, the video generation domain lags. Kling’s research has established precise mathematical relationships between hyperparameters, model scale, and compute budgets. This allows for more efficient use of computational power and data, leading to better model performance. *See: Towards Precise Scaling Laws For Video Diffusion Transformers*.
Further innovating in efficiency, the team unveiled DiffMoE, a Mixture of Experts (MoE) architecture tailored for diffusion models. DiffMoE uses a global token selection mechanism and optimized inference strategies that more effectively allocates computing resources based on the different stages of the diffusion process. Impressively, this approach requires only one-third of the parameters of a dense model of comparable size, yet delivers performance that rivals that of a model three times its size. *See: DiffMoE: Dynamic Token Selection For Scalable Diffusion Transformers*.
**Powerful Interaction and Control Capacities**
Kling is also focusing on how users can control the output of the models. The team’s FullDiT framework integrates all spatiotemporal conditions within a unified Diffusion Transformer architecture. FullDiT simplifies the process by removing the need to alter the model structure for different tasks, thus minimizing control conflicts. This methodology displays great scalability. *See: FullDiT: Multi-Task Video Generative Foundation Model with Full Attention*.
For interactive video creation, the team’s GameFactory framework allows for controllable continuous and discrete actions—such as mouse movements and keyboard commands—using limited video training data. This facilitates the generation of interactive video content that is adaptable to various game scenarios. *See: GameFactory: Creating New Games with Generative Interactive Videos*.
**Accurate Evaluation and Alignment Mechanisms**
To ensure quality and adherence to user intent, Kling is developing better evaluation and alignment tools. They’ve built a reinforcement learning from human feedback (RLHF)-based video generation framework, comprising multi-dimensional preference data construction, a VLM-based reward model, and various alignment algorithms. This offers one of the first systematic explorations of RLHF in the video generation space. *See: Improving Video Generation with Human Feedback*.
Additionally, addressing efficiency concerns in flow matching, the team has introduced Flow-GRPO, an online Reinforcement Learning (RL) algorithm that integrates GRPO into flow matching models and has demonstrated its effectiveness in image generation tasks. *See: Flow-GRPO: Training Flow Matching Models via Online RL*.
**Multimodal Perception and Reasoning**
Kling’s ambitions extend beyond video generation to encompass a deeper understanding of the underlying data. A key area of focus is developing better video captioning models. The team has developed VidCapBench, a video captioning evaluation framework designed with improved stability, reliability, and a strong correlation to the final video generation outcomes. *See: VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation*.
For understanding user intentions, the team’s Any2Caption interprets multimodal user input to generate semantically rich, structured descriptions, enhancing the success rate of video generation. *See: Any2Caption: Interpreting Any Condition to Caption for Controllable Video Generation*.
Beyond the tutorial, Kuaishou’s Kling team had an impressive showing at CVPR 2025, with seven papers accepted, covering a broad spectrum of research areas including video model scaling laws, video datasets, controllable generation, portrait generation, high-definition generation, and 4D generation. These developments suggest the company is focused on remaining a significant player in the competitive world of visual AI.
Original article, Author: Tobias. If you wish to reprint this article, please indicate the source:https://aicnbc.com/3303.html