Enable AI at Scale with NVIDIA and Equinix

Page created by Ramon Dunn
 
CONTINUE READING
Enable AI at Scale with NVIDIA and Equinix
Enterprise Strategy Group | Getting to the bigger truth.™

ESG WHITE PAPER

Enable AI at Scale with NVIDIA and Equinix

Embracing a Hybrid Cloud AI Future

By Mike Leone, ESG Senior Analyst
January 2022

This ESG White Paper was commissioned by NVIDIA and Equinix
and is distributed under license from ESG.

                                     © 2022 by The Enterprise Strategy Group, Inc. All Rights Reserved.
Enable AI at Scale with NVIDIA and Equinix
White Paper: Enable AI at Scale with NVIDIA and Equinix                                                                                                                              2

Contents
Introduction .............................................................................................................................................................................. 3

AI Infrastructure Challenges ..................................................................................................................................................... 3

Rethinking a Cloud-first AI Approach ....................................................................................................................................... 4

     The Impact of Data Gravity on Training and Inference ...................................................................................................................... 4

Hybrid Cloud AI to the Rescue .................................................................................................................................................. 5

     Rise of Colocation AI Services................................................................................................................................................................... 6

Follow the AI Leaders ............................................................................................................................................................... 6

NVIDIA LaunchPad with Equinix ............................................................................................................................................... 6

The Bigger Truth ....................................................................................................................................................................... 8

     Understand AI infrastructure requirements across all business units involved in AI development........................................ 8

     Understand AI infrastructure requirements based on what is needed to support the entire lifecycle of AI—from
     prototyping and experimentation to production training at scale and inference at the edge................................................. 8

     Understand the potential benefits of embracing data gravity by moving compute to where data resides. ........................ 8

     Understand the role the public cloud plays in the long-term success of AI. ................................................................................. 8

                                                        © 2022 by The Enterprise Strategy Group, Inc. All Rights Reserved.
Enable AI at Scale with NVIDIA and Equinix
White Paper: Enable AI at Scale with NVIDIA and Equinix                                                                                  3

Introduction
The impact AI is having on organizations is profound. Whether they are turning to AI to provide more predictive insights
into future scenarios/outcomes or developing AI-based products and services to capture new revenue opportunities,
businesses are continuing to emphasize the importance of AI adoption as a game changer for the modern business. ESG
research shows that 45% of organizations currently have AI projects in production using specialized infrastructure, with
another 39% at the piloting/POC stage, as organizations look for smarter and faster ways to gain value from data.1 While
the usage of AI offers potentially eye-opening benefits, challenges continue to arise that cause roadblocks, delays, and
outright failures in achieving AI success. Between infrastructure shortcomings throughout the AI lifecycle, an inability to
cost-effectively scale AI, and the increasing force of data gravity, even organizations that have seen early success by starting
their AI journeys in the public cloud are rethinking their cloud-first/cloud-only strategies. As organizations plan for the
increased pervasiveness of AI throughout the business, they come to the realization that a hybrid cloud approach to AI will
be a requirement to ensure that they can achieve true AI success.

AI Infrastructure Challenges
One of the greatest challenges experienced by organizations today in their adoption of AI comes at the hands of the
infrastructure stack. Simply put, many CIOs don’t have the right IT platform/infrastructure in place to satisfy AI workload
requirements. Between inadequate processing power, storage capacity, networking capabilities, and an inability to
properly manage resource allocation, infrastructure readiness is proving to be a significant issue in keeping up with the
performance and concurrency demands of diverse AI workloads. Those workloads include data analysis and
experimentation, feature engineering, model training, model serving, and inference within a deployed application. Each of
these workloads has different infrastructure requirements. Organizations are spending millions of dollars to stitch together
AI components in a DIY fashion or turning to public cloud-based as-a-service offerings, but in both cases, the foundation
infrastructure is rooted in general purpose components. This is a major reason that 98% of recently surveyed AI adopters
identified or anticipated a weak component somewhere in their AI infrastructure stack (see Figure 1). More specifically, 86%
identified at least one of the following areas as a weak link: GPU processing, CPU processing, data storage, networking,
resource sharing, or integrated development environments.

Figure 1. Top 8 Weakest Links in the AI Infrastructure Stack
                Which parts of the infrastructure stack do you believe are or will be the weakest links in
                    your organization’s ability to deliver an effective AI environment? (Percent of
                                    respondents, N=325, three responses accepted)
                                   Resource sharing                                                                             26%
     Integrated development environment (IDE)                                                                                25%
                                    GPU processing                                                                           25%
                                     CPU processing                                                                          25%
                                        Data storage                                                                22%
                                           Databases                                                                22%
                                       Multi-tenancy                                                             21%
                                          Networking                                                            20%
                                                                                                                 Source: Enterprise Strategy Group

 1
  Source: ESG Master Survey Results, Supporting AI/ML Initiatives with a Modern Infrastructure Stack, May 2021. All ESG research references and charts
 in this white paper have been taken from this master survey results set unless otherwise noted.

                                           © 2022 by The Enterprise Strategy Group, Inc. All Rights Reserved.
White Paper: Enable AI at Scale with NVIDIA and Equinix                                                          4

Another interesting component to the infrastructure stack challenge is the diversity of personas that may require access to
the system, from a data-centric persona like a data scientist or data engineer, an application developer, or someone in IT
responsible for resource allocation or maintenance. Availability of not only the system, but the tools, technologies, and
underlying data, create several bottlenecks, all of which impact the time to value. Organizations will need to embrace new
infrastructure purpose-built for diverse AI workloads and will need to carefully consider how to onboard such platforms,
especially if their data center is not optimized for an accelerated computing infrastructure or if they’ve moved away from
data centers altogether. One example of overlooked requirements is power and cooling. AI infrastructure consists of dense
HPC hardware that requires reliable power and proper cooling that simply cannot be provided in every data center.

Rethinking a Cloud-first AI Approach
The public cloud creates a low barrier to entry for AI by providing an as-a-service model to satisfy temporal AI needs. End-
users gain access to the right tools, technologies, and resources to get started with AI faster and more cost-effectively than
anywhere else. And while it’s appealing to have a controlled environment to effectively experiment and learn about the
best ways to leverage data to support an AI use case, challenges remain. As organizations experiment on their AI models in
the cloud, model complexity, increased compute and storage requirements, and exponential data growth introduce
rapidly escalating costs for tight-budgeted organizations. So, while it may be easy to scale cloud AI deployments quickly,
cost becomes a deterrent, forcing organizations to make tradeoffs in the way they deploy AI and leverage AI-specific
resources to a growing number of stakeholders. According to ESG research, this is driving the repatriation of AI workloads,
with organizations citing areas such as an inability to meet scalability/elasticity expectations, poor or unpredictable
performance, and high cost as
drivers for repatriation. In fact, 57%
of IT organizations have repatriated
workloads (inclusive of AI
workloads) from the public cloud
back to on-premises environments. 2

The Impact of Data Gravity on Training and Inference
Driving the repatriation of AI workloads is the idea of data gravity. Data gravity is the ability of a large data set to attract
applications, processing power, services, and other data. The force of gravity, in this context, can be thought of as the way
these other entities are drawn to data relative to its mass.

Data gravity is particularly challenging in AI training efforts, which often require the use of high-performant and therefore
higher cost compute, storage, and networking. Knowing that AI training is fueled by massive volumes of data, in workflows
where data must be moved from one location to another, like from a private deployment to a public cloud environment, a
significant amount of time and effort is required. And this is before factoring in the tangible cost of processing the data
using general purpose compute. As organizations look to offset the inefficiencies, time delays, and increased costs of
moving TBs of data into a public cloud environment, embracing the idea of moving compute to where data is generated or
                                                                                 stored is on the rise. This is also known as
                                                                                 training where the data lands. In fact, when it
                                                                                 comes to training, 67% of organizations have
                                                                                 embraced training models on-premises,
                                                                                 whether in a data center or at an edge location.

 2
     Source: ESG Master Survey Results, 2021 Data Infrastructure Trends, September 2021.

                                             © 2022 by The Enterprise Strategy Group, Inc. All Rights Reserved.
White Paper: Enable AI at Scale with NVIDIA and Equinix                                                                           5

While training is more focused on the development of algorithms and processing large volumes of data to ensure high
accuracy, enterprises must then deploy algorithms to production environments for inferencing. Inferencing has a different
set of requirements than training in that it does not require heavy processing resources, but for many use cases, it does
require fast execution of data analysis, returning a result in as close to real time as possible. This is a major reason why
organizations continue to search for ways to implement inferencing in line with the incoming data flow. As organizations
struggle to deploy developed algorithms to production environments, the fact that these environments are increasingly
falling outside of the public cloud is adding to complexity. This means that organizations will be looking to deploy models
directly on an edge device or at an edge aggregation point. And while aggregating data to a central location may introduce
unnecessary latency, several use cases require multiple data sources for inferencing, forcing organizations to architect a
well-networked and high-performance edge aggregation point.

Hybrid Cloud AI to the Rescue
The public cloud will continue to enable organizations to ramp up AI initiatives quickly, especially for early adopters that
do not have immediate access to critical AI infrastructure components. However, when the AI cloud tipping point is
reached because model complexity, scale, or data gravity has dictated a need for a private AI footprint, the availability of
private AI infrastructure resources will be essential. Organizations will require a fixed-cost infrastructure with right-sized
resources that supports the diversity of AI workloads, from experimentation and rapid model training at scale to multi-
environment deployment and management. This is driving the need for a hybrid AI architecture to set organizations up for
AI success. In fact, when ESG asked organizations about the most important considerations of infrastructure solutions used
to support their AI initiatives throughout the AI lifecycle, the top response was hybrid cloud capabilities (see Figure 2).

Figure 2. Most Important AI Lifecycle Infrastructure Considerations
               For all aspects of the AI lifecycle, which of the following are–or likely will be–the most
                important in your organization’s consideration of infrastructure solution(s) used to
                   support its AI initiatives? (Percent of respondents, N=325, multiple responses

                    Hybrid/multi-cloud capability                                                                             18%
                         Data security/governance                                                                        17%
 Maximizing hardware/infrastructure utilization                                                                         16%
     Integrated development environment (IDE)                                                                     15%
             Model management and monitoring                                                                      15%
                               Integration with GPU                                                            14%
                  Data durability/high Availability                                                           14%
              Speed of deployment/provisioning                                                               14%
                            Management simplicity                                                            14%
                            Lowest possible latency                                                         13%
                                     Data movement                                                          13%
                                    Data traceability                                                       13%
                                                                                                            Source: Enterprise Strategy Group

                                       © 2022 by The Enterprise Strategy Group, Inc. All Rights Reserved.
White Paper: Enable AI at Scale with NVIDIA and Equinix                                                           6

Organizations are recognizing that hybrid cloud AI is enabling them to overcome potential failures in accelerating AI
development and effectively operationalizing AI. With the knowledge that organizations do not want to be mired in a proof-
of-concept stage across a mix of siloed projects all with escalating cloud costs, hybrid AI is helping address the amount of
“model debt” companies incur by offsetting a widening gap between developed models and deployed models. Hybrid
cloud AI is enabling organizations to embrace an end-to-end platform on which enterprises can efficiently traverse the AI
lifecycle, from development to deployment of AI applications, ensuring reasonable ROI as the pervasiveness of AI scales
throughout the business.

Rise of Colocation AI Services
As organizations look for new solutions to enable them to embrace data gravity, bring the compute and software stack to
where the massive data sets reside, and eliminate the barrier to getting deterministic performance of on-premises AI
systems, interest in colocation services is on the rise with a goal of enabling organizations to achieve hybrid cloud AI
success. Colocation service offerings can help businesses overcome a lack of AI-optimized facilities and infrastructure, by
providing the right AI infrastructure resources to support all the workloads throughout the AI lifecycle. Low latency
connectivity to all major cloud providers, improved data locality that allows organizations to know that data is either inside
or near colo facilities, and the availability of right-sized resources based on AI workload demand enable colo service
offerings to improve time to insight while enabling effortless mobility of AI development workloads. For use cases where
latency is critical, deploying inference infrastructure at macro-edge locations can also serve as an elegant solution to
ensure AI SLAs are consistently met. For those AI use cases where data aggregation is required, an interconnected
colocation facility can offer faster access to multiple data sources and therefore enrich AI development.

Follow the AI Leaders
Forward-leaning enterprises are achieving AI success by overcoming the impact of data gravity and eliminating escalating
I/O costs by moving compute to where the data lives, enabling an affordable compute cost-model that lowers the barrier
to AI entry. AI leaders are looking to colocation facilities that provide access to modern AI infrastructure and a fully
optimized platform that supports the end-to-end AI lifecycle, from development to deployment, and all AI workloads,
including analytics, training, and inference. Powerful compute nodes, scalable compute clusters, high-performance
storage, and right-sized resource availability are being leveraged to deliver deterministic performance. To offset AI
operational burdens and shadow AI silos commonly experienced by IT staff tasked with AI resource delivery, leaders are
looking to new operating models that will empower IT to consolidate operational AI silos, simplify capacity planning, and
ensure resources are optimally delivered based on AI workload requirements. Leaders are embracing hybrid AI
architectures optimized for cost-effective AI development, yielding higher levels of efficiency, as well as faster
experimentation and traversal of the iterative AI lifecycle.

NVIDIA LaunchPad with Equinix
NVIDIA and Equinix recognize the power of leveraging the best AI hardware and software infrastructure in a seamless, easy,
and cost-effective way. Together they are building an AI ecosystem of technology providers, ISVs, tool developers, data
brokers, and network providers all with a goal of democratizing AI. To deliver a complete development-to-deployment AI
infrastructure solution, NVIDIA and Equinix have partnered to deliver the NVIDIA LaunchPad solution on Platform Equinix.
NVIDIA LaunchPad is a free service for enterprise customers to try NVIDIA AI. With NVIDIA LaunchPad, enterprises can get
immediate, short-term access to NVIDIA AI running on private accelerated compute infrastructure to power critical AI
initiatives. As organizations gain experience and see success, enterprises can move to a consumption-based model on a
subscription basis by deploying and scaling their AI infrastructure within an Equinix data center.

                                       © 2022 by The Enterprise Strategy Group, Inc. All Rights Reserved.
White Paper: Enable AI at Scale with NVIDIA and Equinix                                                                7

With NVIDIA and
Equinix, organizations
gain access to an end-
to-end solution that
provides both core AI
infrastructure for
model training with
NVIDIA Base
CommandTM, as well
as inference and edge
AI infrastructure with
mainstream NVIDIA-
Certified Systems enabled by Equinix Metal service. Foundational to the Base Command offering is NVIDIA DGX Foundry - a
high-performance AI training infrastructure based on NVIDIA DGX SuperPODTM, comprising NVIDIA DGXTM A100 systems,
NetApp storage and NVIDIA networking. Each DGX A100 integrates eight NVIDIA A100 Tensor Core GPUs and two 2nd Gen
AMD EPYCTM processors, powered by a full-stack AI-optimized architecture purpose-built for the unique demands of AI
workloads, from analytics and experimentation, to training and inference. NVIDIA DGX systems are optimized at every layer
for delivering the fastest time-to-solution on the most complex AI workloads. AI researchers and innovators don’t have to
waste time integrating, troubleshooting, and supporting hardware and software. Data scientists can confidently utilize
resources across their end-to-end workflows, from development to training at scale.

                                                                     The Equinix Fabric provides high-speed and secure
                                                                     connectivity between these distributed training and inference
                                                                     locations. The software-defined interconnection services
                                                                     provide fast and secure data transfer from distributed data
                                                                     sources to the NVIDIA AI model training stack. The same
                                                                     private interconnection solution also enables the transfer of
                                                                     the newly developed AI models to the NVIDIA AI edge
                                                                     infrastructure at Equinix. Enterprises can deploy their AI
                                                                     training and edge infrastructure on Platform Equinix in more
                                                                     than 64 metro markets across more than 26 countries on five
                                                                     continents. All these distributed Equinix sites are
                                                                     interconnected via Equinix Fabric high-speed, low-latency, and
                                                                     secure virtual connections. And most metros have been
                                                                     verified by NVIDIA to meet the power and cooling requirements
                                                                     of next-generation AI hardware.

In addition to providing the AI compute, network, and storage infrastructure, NVIDIA LaunchPad provides the necessary
software-based orchestration services to move data and AI models between the distributed sites in a seamless manner
using cloud technologies. Customers can manage their AI development workflow with NVIDIA Base CommandTM, NVIDIA
Fleet CommandTM, and NVIDIA AI Enterprise suite which provides easy, secure management and deployment of AI at the
edge. The Equinix infrastructure deploys in minutes, providing enterprises with immediate access to an entire spectrum of
NVIDIA resources that support virtually every aspect of AI, from data center training and inference to full-scale deployment
at the edge.

                                       © 2022 by The Enterprise Strategy Group, Inc. All Rights Reserved.
White Paper: Enable AI at Scale with NVIDIA and Equinix                                                            8

The Bigger Truth
As organizations look to embrace AI throughout the business, they are becoming increasingly aware of the challenges
preventing greater success. Distributed data sets, data gravity, operational silos, and the need for right-sized access to
powerful, yet cost-effective infrastructure are forcing organizations to make tradeoffs in how they best leverage AI
infrastructure solutions and capabilities across hybrid environments.

Together, NVIDIA and Equinix are looking to help organizations embrace AI at scale using a colocation model that removes
ongoing data center capital burden, eliminates data center redesign, minimizes AI operational silos, and enables teams to
benefit with the best of both worlds: the simplicity of the cloud with the deterministic performance needed to support
production AI workloads at scale.

Organizations looking to transform the business by scaling the use of AI can set themselves up for success by considering
the following areas and questions that should be asked of key stakeholders.

Understand AI infrastructure requirements across all business units involved in AI development.
  • What initiatives are driving the deployment of AI development environments?
  • Where are AI development deployments spawning?
  • Who is running them?

Understand AI infrastructure requirements based on what is needed to support the entire lifecycle of AI—
from prototyping and experimentation to production training at scale and inference at the edge.
  • Where does underlying data that is leveraged to support AI initiatives reside?
  • How is the AI data pipeline constructed?
  • How much of the data workflow is dependent on data movement?
  • Is inline inference required, for latency or data volume reasons?
  • What role does/should the edge play in the AI lifecycle?

Understand the potential benefits of embracing data gravity by moving compute to where data resides.
  • Where is training done today?
  • How rapidly are datasets supporting model development growing?
  • How does data movement impact the timeliness of training results?
  • How much cost goes into the support of dataset I/O, data movement, data hosting, and compute/storage resources?
  • Is data aggregation optimized for inference and training?

Understand the role the public cloud plays in the long-term success of AI.
  • Where are you leveraging the public cloud today in support of AI initiatives?
  • What are your current cloud costs in association with AI workloads?
  • Has the public cloud become the single hammer for every nail in your enterprise?

                                       © 2022 by The Enterprise Strategy Group, Inc. All Rights Reserved.
White Paper: Enable AI at Scale with NVIDIA and Equinix                                                                                                      9

All trademark names are property of their respective companies. Information contained in this publication has been obtained by sources The Enterprise Strategy Group (ESG)
considers to be reliable but is not warranted by ESG. This publication may contain opinions of ESG, which are subject to change. This publication is copyrighted by The
Enterprise Strategy Group, Inc. Any reproduction or redistribution of this publication, in whole or in part, whether in hard-copy format, electronically, or otherwise to persons
not authorized to receive it, without the express consent of The Enterprise Strategy Group, Inc., is in violation of U.S. copyright law and will be subject to an action for civil
damages and, if applicable, criminal prosecution. Should you have any questions, please contact ESG Client Relations at 508.482.0188.

                          Enterprise Strategy Group is an IT analyst, research, validation, and strategy firm that provides market
                          intelligence and actionable insight to the global IT community.

          www.esg-global.com                                                   contact@esg-global.com                                                    508.482.0188

                                                    © 2022 by The Enterprise Strategy Group, Inc. All Rights Reserved.
You can also read