NVIDIA supercharged the AI industry with the launch of its revolutionary H100 chips in 2022, which have since added more than $1 trillion to the company’s value. Demand surged instantly and companies rushed to buy up this latest technology, in turn creating waiting times of up to six months. Once investments were made, the next competition was to deploy these chips into active use, creating an AI race never seen before.
We at Northern Data Group were one of the first in Europe to invest in NVIDIA H100 GPUs, and provide innovators access to this incredible compute power, via our AI cloud platform Taiga Cloud. This €400M investment was announced in September 2023, which was quickly followed by our investment of an additional €330M investment to power Europe’s largest Generative AI Cloud.
By Q2 2024, Taiga Cloud will have deployed more than 10,000 H100 and H800 GPUs via our clean, secure and compliant GPU network - but how did we get to this stage?
Getting started with the right network
The first and most important step is a strong network. If your network is not up to standard, you will inevitably run into issues when scaling your services. Your network needs to be reliable and strong, no matter what the level of demand is. Our customers want to get the very best out of the latest hardware, so every technical decision we make concerning our cloud architecture and orchestration recognizes this.
We have invested significantly in our networking capabilities. Our ultra-fast GPU network is powered by best-in-class hardware, maximized for performance. Our networks can also scale as much as needed without blocking factors - spinning up and down quickly via our portal – creating a non-blocking network and enabling users to get on with other tasks. Each of our customers therefore gets a tailored service catering to their unique needs.
Our internal ethernet also operates in the same way, meaning all our networks are non-blocking. We have two network interface controllers (NICs) per server with two links each. So, our ethernet is set up with one Bluefield DPU with DOCA HBN software for communications and one Bluefield DPU set up with Mellanox Snap for storage.
Meeting the right standards
Taiga Cloud has always leveraged state-of-the-art cloud architecture for all our Generative AI solutions. For the H100s, this was no different. Much of our existing approach to orchestration was carried over from our work with A100s, however, we reworked and upgraded our data halls to accommodate the new H100s. It was a rigorous approach but one that ensured our infrastructure mirrored the same high quality that our customers expect.
When it came to deploying the H100s, we used the experience of deploying A100s as the basis. Here, we opted for an island configuration with 2,048 GPUs per island, which maximized performance while keeping InfiniBand down. Once set up, NVIDIA reviewed and approved our orchestration approach with InfiniBand – something we’re very proud of.
We also leverage cross-team coordination on this front. The infrastructure division works closely with the development and front-end teams to deliver the H100s. We also have an in-house group of AI experts who act as our own ‘customers’. Having a dedicated AI team means we can see firsthand how our clusters are performing. They fine-tune our infrastructure and benchmark our speed and performance against other providers to ensure we’re always offering a competitive service. We intend to grow this division as we scale our services on top of our current offering.
Powering up
When adopting new hardware, considerations must be made to ensure you follow your wider business strategy. For us, that means having sustainability at the forefront. We only use data centers that have access to renewable energy and a low Power Usage Effectiveness (PUE). The industry PUE standard is 1.55 – we always commit to 1.2 or lower.
Fortunately, Taiga Cloud has a partnership with our sister company, Ardent Data Centers - Northern Data Group’s cutting-edge colocation provider. This means we have easy access to facilities that meet our performance optimization and energy efficiency standards. For example, our data center in Boden, Sweden, runs on 100% carbon-free, hydropower, with a PUE of 1.06. Ardent has housed a portion of our H100 GPU islands in its state-of-the-art, liquid-cooled data center since December 2023.
Manpower is another challenge. You might think that when deploying such in-demand technology at speed, it’s optimal to onboard a larger team, but this is in fact not the case. A large, hastily hired team can lead to inefficiencies and discrepancies in the way the technology is handled, approached, and maintained. Our team is tight knit. We have a very specific set of complementary skills that suit our client's needs. When onboarding, we are always vigilant to ensure we never dilute that. For us, part of the solution has also been to scale our capabilities and meet demand through automation.
The overarching theme that has helped us to deploy our NVIDIA H100s at pace has been an ongoing commitment to providing even our smallest customers with a solution that fits their business. We haven’t tried to implement a ‘one size fits all’ approach, and ultimately, this agility has helped us provide a tailored client experience which ultimately makes our work with them a partnership. Our customers know that we understand their ambitions and needs down to the minutest of details, creating an unparalleled quality of experience which ultimately contributes to the success of our deployment. This, and the highly skilled and diversified teams we have in place working on the ground.
You can register for access to Taiga Cloud’s NVIDIA H100s now.
More market insights
Bits & Pretzels 2024
Bits & Pretzels
Overcoming current generative AI challenges
Overcoming current generative AI challenges
A Look into Northern Data’s 2023 Audited Financial Results
Chief Executive Officer of Northern Data Group, Aroosh Thillainathan, comments on the company’s 2023 financial results and presents an outlook for 2024 and subsequent years.