Chipmaker Nvidia announced Monday that its Spectrum-X networking technology has helped expand startup xAI’s Colossus supercomputer, now recognized as the largest AI training cluster in the world
Located in Memphis, Tennessee, Colossus serves as the training ground for the third generation of Grok, xAI’s suite of large language models developed to power chatbot features for X Premium subscribers
Colossus, completed in just 122 days, began training its first models 19 days after installation Tech billionaire Elon Musk’s startup xAI plans to double the system’s capacity to 200,000 GPUs, Nvidia said in a statement on Monday
At its core, Colossus is a giant interconnected system of GPUs, each specialized in processing large datasets When Grok models are trained, they need to analyze enormous amounts of text, images, and data to improve their responses
Touted by Musk as the most powerful AI training cluster in the world, Colossus connects 100,000 NVIDIA Hopper GPUs using a unified Remote Direct Memory Access network Nvidia’s Hopper GPUs handle complex tasks by separating the workload across multiple GPUs and processing it in parallel
The architecture allows data to move directly between nodes, bypassing the operating system and ensuring low latency as well as optimal throughput for extensive AI training tasks
While traditional Ethernet networks often suffer from congestion and packet loss—limiting throughput to 60%—Spectrum-X achieves 95% throughput without latency degradation
Spectrum-X allows large numbers of GPUs to communicate more smoothly with one another, as traditional networks can get bogged down with too much data
The technology allows Grok to be trained faster and more accurately, which is essential for building AI models that respond effectively to human interactions
Monday’s announcement had little effect on Nvidia’s stock, which dipped slightly Shares traded at $141 as of Monday, with the company’s market cap at $3.45 trillion
Edited by Sebastian Sinclair