Stanford Accidentally Generates Super-Efficient CUDA Kernels with AI: Performance Astonishes, Featuring a Chinese-American Lead

Here’s the rewritten article, adhering to your guidelines:

In a stunning revelation, Stanford researchers have stumbled upon something truly remarkable: AI-generated CUDA kernels that outperform even the most meticulously optimized versions crafted by human experts. This unexpected breakthrough could redefine the landscape of deep learning, leaving the tech world buzzing.

Stanford’s recent findings are nothing short of impressive. Their AI-optimized kernels demonstrated a significant performance boost across various deep learning operations, even exceeding the capabilities of native PyTorch by a substantial margin. The results speak for themselves:

* Matrix Multiplication (Matmul, FP32): Achieved 101.3% performance compared to PyTorch’s torch.matmul.
* 2D Convolution (Conv2D): Reached 179.9% of torch.nn.Conv2D’s performance.
* Softmax: Demonstrated 111.8% of torch.softmax’s efficiency.
* Layer Normalization (LayerNorm): Soared to 484.4% of torch.nn.LayerNorm’s capabilities.
* Conv2D+ReLU+MaxPool Combos: Outperformed PyTorch reference implementations by 290.1% and torch.compile() by 189.0%.

_(Performance percentages were benchmarked on an NVIDIA L40S GPU and are defined as the reference time divided by the generated kernel time.)_

斯坦福意外用AI生成超强CUDA内核性能好得出奇！华人主创

The truly astonishing aspect? This remarkable achievement was, in the researchers’ own words, an “accident.” They initially set out to generate synthetic data to train a kernel generation model. The team found that the synthetic data, when tested, could alone be used to develop incredibly high-performing kernels.

斯坦福意外用AI生成超强CUDA内核性能好得出奇！华人主创

The news has sparked a flurry of excitement among industry watchers. The question on everyone’s mind: Is this the dawn of AI replacing kernel engineers?

斯坦福意外用AI生成超强CUDA内核性能好得出奇！华人主创

Beyond the impressive performance gains, what sets this research apart is the novel methodology employed by the Stanford team.

Rather than relying on a straightforward, iterative optimization process (akin to a hill-climbing algorithm), the researchers incorporated a language reasoning step between each iteration. This strategically encouraged diverse exploration within the search space.

In essence, the system “thought” about how to improve with each refinement, sparking new ideas and leading to superior solutions.

斯坦福意外用AI生成超强CUDA内核性能好得出奇！华人主创

Here’s a closer look at how they achieved this….

**From Natural Language to Code: The Optimization Approach**

As described in the team’s blog, the concept behind their kernel generation is elegantly simple: Given a PyTorch (torch) code snippet, the system is tasked with writing custom kernels to replace the original operators.

These kernels are written in pure CUDA-C, negating the need for specialized libraries and DSLs (Domain-Specific Languages) like CUTLASS and Triton. What distinguishes this approach from traditional methods is the use of natural language to generate optimization ideas first. These ideas are then translated into new code variants.

The team reasoned that a “sequential modification” strategy would lack diversity, leading to the risk of getting stuck in local minima, revisiting the same transformations, or endlessly optimizing fruitless paths.

To amplify the diversity of ideas, the Stanford team then embraced a multi-branch exploration model. Rather than refining only a single candidate at each step, they broadened the approach by letting each idea spawn multiple implementations, with the highest-performing kernel serving as the launchpad for the next round of refinement.

斯坦福意外用AI生成超强CUDA内核性能好得出奇！华人主创

The team used OpenAI’s o3 and Gemini 2.5 Pro to tackle ten problems from Level 1 of the KernelBench benchmark, and after multiple rounds, the top kernels emerged. Most of the best results materialized in later iterations (across 5 rounds), often rounds 4 or 5.

KernelBench, an internal standard developed by the Stanford team, sets the standard for AI-generated kernel testing. This benchmark is organized across three levels, with Level 1 encompassing single primitive operations, which form the foundation of AI, including convolutions, matrix-vector/matrix-matrix multiplication, loss functions, activation functions, and layer normalization.

斯坦福意外用AI生成超强CUDA内核性能好得出奇！华人主创

This latest discovery, alongside breakthroughs like DeepMind’s AlphaEvolve and the discovery of a Linux 0-day vulnerability in o3, has led some observers to suggest that Gemini 2.5 Pro and o3 have reached a new level of sophistication.

斯坦福意外用AI生成超强CUDA内核性能好得出奇！华人主创

Returning to the Stanford project, the process of generating the CUDA kernels reveals that the model’s approach is beginning to reflect the intuition of human experts:

* **Memory Access Optimization:** Enhancing the efficiency of data movement across various memory hierarchies (global memory, shared memory, and registers) and ensuring data access maximizes bandwidth while minimizing conflicts.
* **Asynchronous Operations and Latency Hiding:** By overlapping slower operations (such as accessing global memory) with calculations or other memory transfers, the system “hides” the latency of those slower operations.
* **Data Type and Precision Optimization:** Leveraging lower-precision data types (e.g., FP16 or BF16) to reduce memory bandwidth requirements and improve caching efficiency.
* **Compute and Instruction Optimization:** Boosting the efficiency of arithmetic calculations themselves by reducing the number of instructions or deploying specialized hardware instructions.
* **Parallelism and Occupancy Enhancement:** Maximizing the number of active threads on streaming multiprocessors (SMs) to better hide latency and improve overall throughput.
* **Control Flow and Loop Optimization:** Reducing the overhead associated with loops, branching, and index calculations.

The Stanford team also presented a detailed collection of optimization paths. This revealed that not every optimization step necessarily speeds up the process; however, through combining multiple steps, kernel efficiency could be significantly enhanced, ultimately surpassing PyTorch.

斯坦福意外用AI生成超强CUDA内核性能好得出奇！华人主创

Regarding the practical implementation, researchers were asked if the optimization suggestions for AI-generated CUDA kernels could be translated into corresponding code implementations or were simply triggering random exploration.

The authors stated that despite a lack of rigorous system verifications at the time, the generated CUDA implementations matched the proffered optimization suggestions across the manually inspected cases.

In other words, the AI wasn’t simply optimizing with chance, but it was indeed endeavoring to realize its own proposed strategies.

斯坦福意外用AI生成超强CUDA内核性能好得出奇！华人主创

**The Team Behind the Breakthrough**

This groundbreaking research was spearheaded by a trio of brilliant minds: Anne Ouyang, Azalia Mirhoseini, and Percy Liang.

Dr. Ouyang, currently a doctoral candidate at Stanford’s SAIL (Stanford AI Lab), earned her Bachelor’s and Master’s degrees from MIT. She previously worked with the cuDNN team at NVIDIA.

斯坦福意外用AI生成超强CUDA内核性能好得出奇！华人主创

Professor Percy Liang is an Associate Professor of Computer Science and an Assistant Professor of Statistics at Stanford University. Dr. Liang, who directs the Stanford Center for Research on Foundation Models, has been involved in collaborative research, as well as research advances with Fei-Fei Li.

斯坦福意外用AI生成超强CUDA内核性能好得出奇！华人主创

Dr. Azalia Mirhoseini is an Assistant Professor of Computer Science and the founder of the Stanford Extend Lab. Her prior posts have included work with DeepMind, Google Brain, and Anthropic. Dr. Mirhoseini’s prior research includes MoE and the AlphaChip algorithm for chip design.

斯坦福意外用AI生成超强CUDA内核性能好得出奇！华人主创

The researchers’ unexpected discovery came while seeking to create data to train kernel generation models.

The researchers found that the synthetic data, when used during testing, could produce high-performing kernels.

Because these kernels took advantage of advanced optimizations and hardware characteristics, thought until now to be complex, the team chose to publish their findings in a blog post, despite the methods to create data being unavailable for now.

Most importantly, these techniques have already demonstrated significant capability.

Additionally, the team viewed this discovery as resonant with recent trends, such as the belief that large-scale retraining may not always be necessary.

Careful searching and branching strategies, at times, are critical for opening the door to scientific innovation and tackling complex problems; broader gains are further assured by a verifier.

Combined with sturdy reasoning abilities and an emphasis on exploring all probable scenarios at once, the results look promising. Much the same could be said of DeepMind’s AlphaEvolve, AlphaEvolution, as well as that of Gemini 2.5 Pro itself.

The research team suggests there’s still plenty of room for improvement. The team is currently optimizing two dimensions:

* FP16 Matmul: 52% performance of torch.matmul
* FP16 Flash Attention: 9% performance of torch.nn.functional.scaled_dot_product_attention

Compared to FP16 and BF16, FP32 typically sees lower optimization on new hardware, which is why achieving performance gains is easier for FP32 kernels versus PyTorch.

The team believes that while limitations remain, the future looks bright. At the outset, they couldn’t generate a functional kernel, but through relentless refinement of their search methodologies, they’ve elevated the performance of flash attention to noteworthy levels.

Notably, the entire search operation only consumed a modest 3 million tokens for input and 4 million for output.

**One More Thing**

In an interesting aside, Stanford isn’t the only group working on kernel large language models.

In May, Cognition, the developers of Devin, open-sourced Kevin-32B, the first large language model capable of writing CUDA kernels through reinforcement learning. This large language model leverages QwQ-32B on the KernelBench dataset, utilizing GRPO to achieve multiple reinforcement learning rounds, outperforming o3 and o4-mini.

斯坦福意外用AI生成超强CUDA内核性能好得出奇！华人主创

[1]

[2]

[3]

Original article, Author: Tobias. If you wish to reprint this article, please indicate the source:https://aicnbc.com/1392.html

Stanford Accidentally Generates Super-Efficient CUDA Kernels with AI: Performance Astonishes, Featuring a Chinese-American Lead

About Author

Tobias

Related News

E3 Lithium Announces Clearwater Project Demonstration Plant Update

CHF 632.1 Million Revenue, Up 12.3%; EBITDA Climbs 27.3% to CHF 114.7 Million; Net Profit Reaches CHF 70.2 Million, a 20.4% Increase

Luxury Spending Rebounds on Qixi Festival, Consumers Still Eyeing Value