Huawei Unveils “Digital Wind Tunnel” for AI: A Virtual Reality Check for Model Training
Huawei is making waves in the AI world with the unveiling of its “digital wind tunnel,” a groundbreaking technology poised to revolutionize the way complex AI models are trained. This innovative platform provides a virtual environment for “rehearsing” AI model training before deploying them in the real world, promising significant efficiency gains and cost savings.
This isn’t science fiction – though it certainly has a Matrix-esque feel. The technology, developed by Huawei’s Markov modeling and simulation team, allows for the hour-level pre-simulation of a 10,000-card cluster solution. The key motivation for this pre-emptive step? Huawei researchers discovered that over 60% of computing power is wasted due to hardware resource misallocation and system incompatibility.
Think of it like this: just as automotive engineers use wind tunnels to test car performance, Huawei’s platform simulates the training and inference processes of large AI models on a computer. This allows them to identify and address potential problems early, optimizing configurations and ultimately saving both time and computational resources.
In essence, Huawei’s digital wind tunnel is designed to help AI models avoid performance bottlenecks and run faster, resulting in more efficient AI model training and inference.
To understand the advantages, consider the challenges of running large models, comparing it to driving high-performance vehicles:
* **Training Phase:** Like flooring the accelerator, efficiency plummets if computing power, memory, and communication aren’t perfectly aligned.
* **Inference Phase:** Tasks vary dramatically, from short queries (like a city sports car) to extensive text generation (like an off-road endurance race). It is difficult for hardware to satisfy both.
* **Multi-Card Clusters:** Similar to managing a vast car fleet, it’s vital to prevent “traffic jams” and failures to ensure long-term stability.
The digital wind tunnel acts as an intelligent expert, ensuring that AI computing power “avoids pitfalls and runs faster and more stably,” tackling the above three pain points.
**Sim2Train: Automated Optimization in Hours**
Training large models is a growing complexity. As the number of parameters increases, so do the hardware requirements. Traditional scheduling strategies are unable to maximize the potential of these devices.
To tackle this, Huawei’s team developed Sim2Train, a simulation platform designed to simulate model training, identify optimal hardware configurations, and find the most effective training strategies. This helps in utilizing the Ascend devices in the most efficient way.
The platform focuses on two key aspects:
**First, simulating the training process.**
This is achieved through a large-scale training cluster modeling simulation that combines dynamic and static elements. The process allows for modular AI task flow assembly and the flexible construction of complex models, akin to assembling LEGO bricks. It enables comprehensive, quick analysis of compute, memory, and communication resource consumption.
Coupled with in-depth compatibility with Ascend hardware, the system leverages static planning and dynamic optimization to precisely boost the operating efficiency of large-scale training clusters.
**Second, automatically identifying the best solution.**
The Sim2Train system can intelligently explore and optimize model structures for Ascend platforms, achieving optimal balance between model performance and functional capabilities.
Faced with the complex topology of CloudMatrix Ascend supernodes, Sim2Train can achieve full-stack architecture modeling and strategy joint optimization at the chip, topology, and load levels. Also, based on real-time data collection and automatic feedback calibration mechanisms, the system completes fine-grained abstract modeling of hardware, comprehensively supporting Ascend clusters in a variety of load scenarios.
**Sim2Infer: Minute-Level Dynamic Acceleration**
Huawei’s innovation extends to the inference phase with Sim2Infer, which offers a 30% improvement in end-to-end inference performance.
It is a multi-tiered inference system modeling simulation with five primary capabilities:
* **Simulating Load Characteristics:** Mathematically modeling the compute, memory access, and communication needs of different models and input data. For example, in MoE models, it’s recording the frequency with which different experts are activated and the volume of data transmitted between devices.
* **Analyzing Hardware Architecture:** Simulating hardware performance across various areas, including chip microarchitecture (e.g., 3D Cube Tensor Acceleration Engine) and entire cluster network topology (e.g., how multiple servers are interconnected).
* **Describing Deployment Strategies:** Supporting the configuration of a variety of inference strategies, such as data-parallelism (multiple devices process different data) and tensor parallelism (splitting compute tasks) to determine which approach is most efficient on Ascend.
* **Driving Simulation Execution:** Using “discrete events” to simulate the inference process. This includes determining when an operator starts calculation and when data is transmitted, providing an accurate calculation of the time required for the entire inference process.
* **Automatic Search Optimization:** Given certain constraints (e.g., a low latency limit of 20ms), the software automatically searches for the best model structure, deployment strategy, and hardware configurations.
Furthermore, Sim2Infer drives a series of system innovation and optimization through modeling simulations that integrate hardware and software. These include:
* Modeling and analysis system parameters and model design factor correlations, and proposing Ascend inference-friendly MoE model architecture suggestions.
* Optimizing the best inference for large EP scenario MoE models.
* Achieving software and hardware collaborative inference acceleration based on Ascend platforms through multi-dimensional cost model modeling from memory access optimization, load balancing, compute and communication masking, and operator fusion.
**Sim2Availability: Second-Level Fault Localization**
Beyond training and inference, ensuring the stable and effective operation of large models, especially in vast multi-card clusters, is crucial — requiring high availability.
Huawei addresses this using a simulation framework, Sim2Availability.
Similar to computer-generated weather simulations, this framework utilizes Markov models to “virtually” create a cluster within the computer, simulating various failures, and assessing impacts and recovery processes to determine how to enhance availability.
Key features of the Sim2Availability simulation:
* **Failure “Generator”:** Simulates hardware failures such as NPU errors, memory errors, and optical module interruptions, and can simulate the simultaneous occurrence of multiple failures.
* **Failure “Detector”:** Simulates how to detect these failures, such as algorithms for determining network slowdowns or hardware anomalies, and the accuracy of detection influences recovery efficiency.
* **Failure “Impact Analysis”:** For example, an NPU failure can interrupt training and require a restart, while an optical module failure can slow down network transmission, thus decreasing training speed.
* **Recovery “Strategy Library”:** Designed with different recovery methods for different failures, such as “step-level rollback” (reverting only one step of the training data), “process-level recovery” (restarting only the process with the issue), and “full recovery” (restarting the entire job).
These steps enable efficient and precise construction of cluster system “state monitoring” for computing, storage, and networks, through Markov chains to characterize system randomness. It also discretizes the system into finite states (e.g., “healthy”, “sub-healthy”, “faulty”), and builds state transition models based on event triggers, taking a macroscopic view of the overall system hardware status.
Original article, Author: Tobias. If you wish to reprint this article, please indicate the source:https://aicnbc.com/2219.html