Baidu’s ERNIE 4.5 Model Series Available First on China’s Computing Platform

CNBC AI News | June 30, 2025

Baidu has officially unveiled its latest generation of AI powerhouses, the ERNIE 4.5 series of large models, making them open-source. Marking a significant moment for China’s burgeoning AI ecosystem, the Modong Space, a prominent national computing platform, has become the first to integrate ERNIE 4.5.

Developed under the guidance of the Ministry of Industry and Information Technology by the China Academy of Information and Communications Technology, Modong Space serves as a central hub, aggregating a diverse array of model services from multiple sources. Through its gateway capabilities, the platform facilitates seamless online access and deployment of AI models and applications. This ecosystem empowers model providers to publish their proprietary models via APIs to a centralized marketplace, allowing developers to readily tap into these advanced tools for their creative and developmental endeavors.

The ERNIE 4.5 series, a testament to cutting-edge artificial intelligence research, introduces an innovative pre-training methodology leveraging multi-modal Mixture-of-Experts (MoE) models. This architecture ingeniously combines an “experts” structure with multi-dimensional rotary positional embeddings. Furthermore, during the loss function computation, it enhances the orthogonality between different experts, significantly boosting performance across tasks such as text generation, image comprehension, and multi-modal reasoning.

To enable highly efficient training, ERNIE 4.5 employs strategies for heterogeneous mixed parallelism and multi-level load balancing. On the inference side, it incorporates novel techniques like multi-expert parallel collaborative quantization and convolution-aware quantization algorithms, thereby constructing a robust and high-performance training and inference framework.

The pre-trained models have undergone specialized fine-tuning for their respective modalities. The large language models are meticulously optimized for general language understanding and generation, while the multi-modal large models are centered on visual-language comprehension, supporting both “thinking” and “non-thinking” modes to cater to diverse real-world scenarios.