“`html
Tech giants Meta and Oracle are significantly upgrading their AI data centers, leveraging NVIDIA’s Spectrum-X Ethernet networking switches – a technology specifically engineered to meet the escalating demands of large-scale AI systems. This strategic adoption of Spectrum-X reflects a broader industry trend toward open networking frameworks, aimed at dramatically improving AI training efficiency and accelerating deployment across massive, interconnected compute clusters.
NVIDIA CEO Jensen Huang has characterized these advancements as transforming data centers into “giga-scale AI factories,” with Spectrum-X serving as the critical “nervous system” that interconnects millions of GPUs. This infrastructure is essential for training the increasingly complex AI models that are driving innovation across various sectors.
Oracle’s deployment of Spectrum-X Ethernet within its Vera Rubin architecture signals a substantial commitment to building these large-scale AI factories. Mahesh Thiagarajan, Executive Vice President at Oracle Cloud Infrastructure, emphasizes that this new configuration will enable a more efficient connection of millions of GPUs, significantly accelerating the training and deployment of next-generation AI models for their customers. This move positions Oracle to better compete in the rapidly expanding AI cloud services market.
Meta is expanding its current AI infrastructure by integrating Spectrum-X Ethernet switches into its Facebook Open Switching System (FBOSS), the company’s proprietary platform for managing network switches at scale. Gaya Nagarajan, Meta’s Vice President of Networking Engineering, underscores the necessity for an open and efficient network to support the evolution of increasingly sophisticated AI models and to deliver seamless services to billions of users. This integration is critical for Meta’s ongoing investment into AI across its social media platforms and metaverse initiatives.
Building Flexible AI Systems
As data centers become more intricate and demanding, flexibility is pivotal, according to Joe DeLaere, who leads NVIDIA’s Accelerated Computing Solution Portfolio for Data Center. He highlighted that NVIDIA’s MGX system, with its modular, building-block design, offers partners the flexibility to combine a diverse range of CPUs, GPUs, storage solutions, and networking components according to specific needs.
The MGX system is designed to optimize interoperability, which allows organizations to maintain consistency in design across different generations of hardware. This provides enhanced flexibility, improved time to market, and future-proof readiness for organizations investing in leading-edge technology.
With the growing size and complexity of AI models, power efficiency has become a central challenge for the sustainability of data centers. NVIDIA is addressing this issue “from chip to grid” to increase energy efficiency and scalability. This involves collaborative work with power and cooling vendors to achieve maximum performance per watt.
A notable example is the shift to 800-volt DC power delivery, which is designed to minimize heat loss and enhance overall efficiency. NVIDIA is also introducing power-smoothing technology to minimize grid stress, which helps lower peak power requirements by as much as 30 percent, which allows for greater computational capacity within the same footprint.
Scaling Up, Out, and Across
NVIDIA’s MGX system is critical to the scalability of data centers. According to Gilad Shainer, Senior Vice President of Networking at NVIDIA, MGX racks can host compute and switching components, supporting NVLink for increased scale connectivity and Spectrum-X Ethernet for the potential for scale-out growth.
MGX also supports interconnection between multiple AI data centers as a unified system which supports the massive distributed AI training operations that companies such as Meta require. Depending on the given distance, sites can be linked using dark fibre or additional MGX-based switches, thereby ensuring high-speed connections across regions. This cross-regional flexibility is essential for maintaining consistent performance in distributed AI operations.
Meta’s choice to adopt Spectrum-X highlights the growing importance of open networking in enabling scalable and adaptable AI infrastructure. Shainer confirmed that Meta will use FBOSS as its network operating system and also specified that Spectrum-X supports several other NOS’s, that include Cumulus, SONiC, and Cisco’s NOS through specific partnerships. This particular flexibility ensures that hyperscalers and enterprises are able to standardize entire infrastructure using systems that are best suitable for given environments.
Expanding the AI Ecosystem
NVIDIA highlights Spectrum-X as a mechanism for increasing the accessibility and efficiency of AI infrastructure. Shainer noted that the Ethernet platform was specifically created for AI workloads like both training and inference and offers as much as 95 percent effective bandwidth, which significantly outperforms traditional Ethernet.
Strategic partnerships with companies that included Cisco, xAI, Meta, as well as Oracle Cloud Infrastructure are helping to integrate Spectrum-X into a broader variety of environments, from the largest hyperscalers to small to medium enterprises.
Preparing for Vera Rubin and Beyond
NVIDIA has said that its upcoming Vera Rubin architecture is planned to be available commercially during the second half of the year 2026, the Rubin CPX product arriving by the end of the year. Rubin and Rubin CPX will work together with Spectrum-X networking and the MGX systems to support next-gen AI factories.
Shainer clarified that Spectrum-X along with XGS both make use of the same hardware, while utilizing different algorithms optimized for varying distances; Spectrum-X for use inside data centers, and XGS which enhances communication between data centers. This method reduces issues with latency and allows for multiple sites to collaborate as a massive single AI supercomputer.
Collaborating Across the Power Chain
NVIDIA currently collaborates with a number of partners and is working to support an 800-volt DC transition from the silicon up to the power grid. NVIDIA is working with Onsemi along with Infineon related to power components, with Flex, Delta, and Lite-On at the level of the rack, as well as Siemens and Schneider Electric with respect to data center designs. NVIDIA plans to release a technical white paper during the OCP Summit detailing the approach.
DeLaere described the concept as a “holistic design from silicon to power delivery,” that ensures that all systems effectively work together within high-density AI environments which companies such as Oracle, and Meta operate.
Performance Advantages for Hyperscalers
Spectrum-X Ethernet specifically was designed for distributed computing and AI workloads. Shainer said that it offers adaptive routing and telemetry based control to eliminate network hotspots and to deliver steady performance. These features are intended to enable faster inference and training speeds, and to allow for multiple workloads to operate at the same time without related interference.
Shainer added that Spectrum-X represents the only Ethernet tech certified to scale at extreme levels, that helps entities realize the maximum performance and the best return on investments into GPUs. For hyperscalers such as Meta, this scalability is critical for handling increased AI training demands while also maintaining efficient infrastructure.
Hardware and Software Working Together
While NVIDIA’s focus generally tends to be on hardware, DeLaere also mentioned that software optimization is also vital. NVIDIA continues to refine performance through co-design efforts: aligning the development of software and hardware to maximize the performance of the AI system.
NVIDIA is making investments to implement FP4 kernels, frameworks such as Dynamo and TensorRT-LLM, along with algorithms like speculative decoding, which helps increase throughput and AI model performance. These improvements are all intended to help systems such as Blackwell perform better with enhanced results with time, for hyperscalers such as Meta that depend on continuous AI performance.
Networking for the Trillion-Parameter Era
The Spectrum-X platform, which includes Ethernet switches as well as SuperNICs, represents NVIDIA’s initial Ethernet system which was purpose-built around AI workloads. Spectrum-X is designed to efficiently join millions of GPUs, all while maintaining predictable performance across a range of AI data centers.
Spectrum-X leverages congestion-control technology to achieve up to 95 percent data throughput, marking advancement over standard Ethernet, which often achieves about 60 percent due to flow collisions. Spectrum-X allows for long-distance AI data center connections, and supports facilities to be linked to unified “AI super factories.”
By interconnecting NVIDIA’s full tech stack, including CPUs, GPUs, NVLink, and software, Spectrum-X enables the consistent performance required to help support trillion parameter models, as well as the next generation of generative AI workloads.
“`
Original article, Author: Samuel Thompson. If you wish to reprint this article, please indicate the source:https://aicnbc.com/10796.html