TOP 6 DATA SCIENCE AND ANALYTICS TRENDS FOR 2021 - How the Data Cloud accelerates machine learning

FOR 2021
How the Data Cloud accelerates machine learning   EBOOK
Data science has evolved dramatically over the last 10 years. However,   In this ebook, you will learn how:
very few organizations have experienced the full business impact         • E
                                                                            asy-to-use ML tools and consolidated data platforms empower
or competitive advantage from their advanced analytics, despite            data analysts and bridge the gap between analytics and ML
significant investments in data science and machine learning (ML). The
                                                                         • S
                                                                            nowflake’s Data Cloud can expand data access and data
reason? Many of the tools needed to scale ML are too complicated,          sharing through a secure ecosystem with access to ready-to-use
and necessary skill sets are in short supply. But change is now afoot.     third-party data
Technology advancements in 2021 will significantly impact the way
                                                                         • D
                                                                            ata engineering tools remove the burden of data prep for data
in which data scientists and data analysts work. In 2021, six trends
                                                                           scientists and make repurposing existing work easy
have the potential to accelerate ML and move organizations from
descriptive and diagnostic analytics (explaining what happened and       • N
                                                                            ew distributed training frameworks offer an alternative
                                                                           superior to Spark while delivering up to 2,000x faster
why) towards predictive and prescriptive analytics that forecast what
will happen and also provide powerful pointers on how to change the
future.                                                                  • R
                                                                            apid advancements in ML libraries, tools, and frameworks
                                                                           demonstrate the need for a solution that future-proofs data
                                                                           science and ML investments

In 2021, the field of data science is poised to   Recent studies lend credence to this common               Remarkably, advancements made in 2020 point to six
finally live up to the high expectations that     sentiment. A report from MIT Sloan Management             exciting trends for data science and ML in 2021. New
many organizations have held for years. Over      Review and Boston Consulting Group found that             tools and technologies have emerged—and continue
                                                  only 10% of organizations are seeing significant          to be released every month—that accelerate the work
the last 10 years, huge investments have
                                                  financial benefits from their investments in AI,1 and     of data scientists while also empowering data analysts
been made in data science and ML, guided          VentureBeat AI reported that 87% of data science          to move beyond descriptive analytics and conduct
by the hope that it would transform the way       projects never make it into production.2 And in 2019,     light data science and ML.
companies do business. However, many              Gartner pointed to one of the bigger challenges
                                                                                                            Underlying this acceleration is the cloud. Data
organizations continue to feel challenged to      organizations face: “Through 2020, 80% of AI
                                                                                                            scientists and data analysts benefit from cloud
drive real business impact with analytics.        projects will remain alchemy, run by wizards whose
                                                                                                            technologies that provide virtually unlimited amounts
                                                  talents will not scale in the organization.”3
                                                                                                            of compute resources. In addition, the cloud enables
                                                  Organizations invest in data science because it           the elimination of data silos by consolidating data
                                                  promises to bring competitive advantages, but many        lakes, data warehouses, and data marts for fast,
                                                  of the tools and skill sets needed to scale ML have       secure, and easy data sharing and analysis in a
                                                  been missing or in short supply. Data scientists          single location.
                                                  continue to be a sought-after and expensive resource,
                                                                                                            In short, data is transforming into an actionable asset,
                                                  and their valuable efforts tend to be relegated to
                                                                                                            and new tools are using that reality to move the needle
                                                  time-consuming tasks such as data selection and data
                                                                                                            with ML. As a result, organizations are on the brink
                                                  preparation. Conversely, data analysts are in abundant
                                                                                                            of mobilizing data to not only predict the future but
                                                  supply in most companies and already know how to
                                                                                                            to also increase the likelihood of certain outcomes
                                                  address business problems directly, but they lack the
                                                                                                            through prescriptive analytics.
                                                  technical background required to make the jump from
                                                  analytics to data science to build their own ML models.   Here are six trends that will shape data science in 2021
                                                                                                            and continue the evolution of analytics towards ML.

Most organizations employ an abundance of                But AutoML isn’t just for analysts. Thanks to its power,   or image recognition, AI services simply plug-and-play
data analysts and a limited number of data               AutoML is making a huge difference for data scientists     into an application through an API, which requires no
scientists, due in large part to the limited             by addressing the busy work that can take up to 80%        involvement from a data scientist.
                                                         of their time, according to Harvard Business Review:       Amazon provides a variety of fully managed AI
supply and high costs associated with data
                                                         loading, selecting, preparing, and cleaning data.4
scientists. Since analysts lack the data science                                                                    services, including Amazon Lex, Polly, Rekognition,
                                                         By eliminating these time-consuming data chores,
                                                                                                                    Forecast, and Translate.5 To illustrate the value,
expertise required to build ML models, data              AutoML increases data scientists’ productivity and         Rekognition allows an image to be sent from an app
scientists have become the de facto bottleneck           provides more time to conduct analyses. Additionally,      through the API to Amazon; the AI service then
for broadening the use of ML.                            human errors found in manual modeling processes            returns a classification and description of what the
                                                         are eliminated, which improves accuracy.                   image is. These types of utilities not only save time
However, new and improved ML tools are
                                                         In addition, the one historical flaw of AutoML was         and effort, but they also free up data scientists to
opening the floodgates on ML by automating               that it was seen as a black box, but that challenge        focus on building and training models that are highly
the technical aspects of data science. Data              has been solved. AutoML services now provide               customized to their business, rather than re-creating
analysts are empowered with access to                    transparency and explanations for their models,            commonly used services.
powerful models without needing to manually              which is key for auditing and detecting bias. For data     AutoML tools and AI services lower the barrier to
build them. Specifically, automated machine              scientists, AutoML transforms how quickly they can         entry for ML, so almost anyone can now access
learning (AutoML) and AI services via APIs are           build and test multiple models simultaneously.             and use data science without requiring an academic
removing the need to manually prepare data               In 2020, AutoML tools from providers such as               background. However, the true power of these tools
and then build and train models.                         DataRobot, Dataiku, and H2O saw significant                is unleashed when they are integrated seamlessly
                                                         advancements, and new solutions were introduced            with your existing technologies. With Snowflake
                                                         such as Amazon SageMaker Autopilot.                        Partner Connect, organizations can receive faster
AUTOML                                                                                                              insights from their data through pre-built integrations
AutoML tools are aptly named: They automate the          AI SERVICES VIA AN API                                     between Snowflake and technology partners’
tasks associated with developing and deploying ML        Another approach growing in popularity is AI services,     products. Snowflake Partner Connect makes it fast
models. Their development is game changing for both      which are ready-made models available through APIs.        and easy to try new ML tools and services and then
data scientists and data analysts. By automating tasks   Rather than use your own data to build and train           adopt those that best meet your business needs.
traditionally done by data scientists, AutoML tools      models for common activities, organizations access
enable data analysts to access models through an         pre-trained models that accomplish specific tasks.
entirely graphical interface—without requiring a data    Whether an organization needs natural language
scientist’s involvement or the need to write code.       processing (NLP), automatic speech recognition (ASR),

Everyone knows data silos exist within and        Much like organizational silos, analytics silos thwart      Conversely, data scientists have the ability to build
across organizations. However, few realize that   collaboration and integration opportunities between         ML models that not only predict but also influence
these siloes also take the form of “analytics     data scientists and data analysts. This situation results   business outcomes. However, they are not as well
                                                  in organizations missing out on the combined power of       versed in the dynamic and fluctuating business
siloes,” particularly between data scientists
                                                  these two teams, which is exponentially stronger than       environment as data analysts are. Sisu describes data
and data analysts. These analytics silos have     simply the sum of the two parts.                            scientists as “narrow-and-deep workers,” and their
formed as a result of the different ways the                                                                  focus frequently results in organizations trying to
                                                  For example, data analysts leverage data to provide
two roles work and their respective skill sets.                                                               focus data science efforts at known problems (often
                                                  key business metrics and answer questions around
Data silos are just one part of the difference:   why something happened. According to Sisu, data             uncovered by data analysts) to maximize their value
Data scientists and data analysts use different                                                               and contributions rather than potentially wasting time
                                                  analysts’ superpower is speed, and they use it to
data (raw versus processed), data sources (data                                                               and effort on the unknown.7
                                                  analyze data sets quickly and work with business
lakes versus warehouses and marts), languages     stakeholders to uncover potential insights.6 While          Snowflake’s Data Cloud provides the tools to
(Python and Spark versus SQL), and tools (ML      their goals are to help companies monetize market           help deliver stronger outcomes and scale. Through
versus BI).                                       opportunities and improve competitive advantage,            Snowflake, analytics silos are eliminated. The same
                                                  most of the work data analysts do is backwards              consistent, governed metrics and data are available for
                                                  looking because they lack the data science skills           both analytics and ML through a shared feature store
                                                  necessary to build predictive ML models. Instead, data      and reuse of data engineering pipelines. When data
                                                  analysts rely on BI tools whose dashboards have built-      science insights are shared in Snowflake’s platform,
                                                  in limitations. While they can use data to understand       data analysts can access and incorporate them into
                                                  what has already happened, it’s challenging for data        dashboards and analysis, thus broadening the scope
                                                  analysts to be proactive and explore data deeply to         of impact of the models the data science team builds.
                                                  figure out what will happen and how to influence it.

In an IDC report sponsored by Seagate, IDC                  1 Snowflake’s Data Cloud is an ecosystem where              of business partners, suppliers, and customers, or
reports that by 2025, it expects 175 zettabytes                Snowflake customers, partners, data providers,            from third-party data providers and data service
of data to be created worldwide each year,8                    and data service providers connect to their own           providers. Snowflake Data Marketplace removes
                                                               data and seamlessly share and consume data and            the arduous processes involved in locating the
which is approximately four times the amount
                                                               data services shared by other users. Underpinned          right data sets, signing contracts with vendors,
produced in 2020 according to the World                        by Snowflake’s platform, the Data Cloud eliminates        and managing the data to make it compatible with
Economic Forum.9 The explosion of data                         barriers presented by siloed data and enables             internal data. Instead, data scientists and data
produced by technologies such as IoT, social                   organizations to unify and connect to a single            analysts can source new data with ease. In addition
media, and mobile devices is opening up vast                   copy of data. In addition, the Data Cloud is a            to Snowflake Data Marketplace, organizations
opportunities for data-driven insights into                    seamless way to derive value from rapidly growing         can use private data exchanges to share data with
every area of inquiry.                                         commercialized data sets with fast, easy, and             trusted partners, suppliers, vendors, and customers
                                                               governed access.                                          through Snowflake Secure Data Sharing.
Of course, it’s virtually impossible for any organization
to produce or collect all the data needed to uncover        2 Empowering the Data Cloud is Snowflake Secure          External data is available and accessible to all Data
business and competitive trends. Increasingly, the             Data Sharing, which removes traditional data           Cloud users with just a few clicks. Once it’s in the
ability to share and join data sets, both within and           transfer barriers. With Snowflake, data is generally   Data Cloud, data is ready to be shared and consumed.
across organizations, is viewed as the best way                never copied and transmitted. Instead, users           There’s no sending of CSV files or manual version
to derive more value from data. That’s why data                can share live data from its original location.        control. Data scientists can enrich models with
scientists and data analysts are continually on the            Those granted access simply reference data in a        seamless access to almost-unlimited data on any topic,
hunt for more data to supplement their ML models               controlled and secure manner without latency           including real-time and evolving circumstances. For
and analyses with external data to improve the                 or contention from concurrent users. Because           example, in 2020, COVID data sets were universally in
accuracy of results.                                           changes to data are made to a single version,          high demand across all sectors and industries because
                                                               data remains up to date for all consumers, which       organizations needed to analyze the impact of the
Snowflake enables secure, governed, compliant, and             ensures data models are always using the latest        virus on an hour-by-hour basis.
seamless access to third-party data in three ways.             version.

                                                            3 Snowflake Secure Data Sharing is the technology
                                                               foundation for Snowflake Data Marketplace,
                                                               which serves as a single location to access live,
                                                               ready-to-query data. Secure, governed data can
                                                               be shared with, and received from, an ecosystem

Data engineering tasks require an inordinately    Finally, direct integrations between data engineering       Data engineering tools not only enable reuse of
large amount of a data scientist’s attention.     tools and ML tools are starting to bring these two          prepared data by anyone within an organization,
The time commitment for data preparation          worlds together. For example, DataRobot acquired            but they can leverage the scalability and efficiency
                                                  Paxata to add data preparation tools alongside its          of Snowflake to provide the processing. Snowflake
varies. It can range from 45%, according to the
                                                  AutoML offering,12 and Alteryx is shifting its focus        works with various partners that provide feature store
Anaconda “2020 State of Data Science” survey      from data prep towards an automated, assisted ML            solutions to ensure data scientists and data analysts
reported by Datanami,10 to 80%, according         modeling offering.13 In addition, Amazon SageMaker          can reuse consistent features.
to a survey conducted by CrowdFlower              launched two new services: Data Wrangler,14 to
and reported by Forbes,11 but there is little     accelerate data prep for ML, and Pipelines,15 as a
disagreement among data scientists that           continuous integration and continuous delivery (CI/
collecting, organizing, and cleaning data are     CD) service for ML.
the least enjoyable tasks they undertake.         Data scientists are also benefiting from “feature
Minimizing this burden is paramount not only      stores,” which make it easy to repurpose existing work.
for keeping data scientists productive and        For example, once a data scientist has converted raw
                                                  data into a metric (for example, “cost of goods sold”),
happy but also for broadening access to ML.
                                                  this universal metric can be found quickly and used by
                                                  everyone else for quick analysis against that data. Not
                                                  only does this practice save data scientists time and
                                                  effort, but it also reinforces BI metrics, maintains data
                                                  governance, and ensures there are no discrepancies
                                                  across the work done by data analysts and data

Data scientists are always looking for strategic   One approach that’s gaining attention is Dask, a         The impact of these distributed training frameworks
ways to inject efficiency into training and        distributed training framework built in Python.16 Dask   is already being seen in the real world. Walmart uses
deploying models. Recently, a new generation       is designed to enable data scientists to improve model   RAPIDS with Dask and XGBoost (an ML algorithm)
                                                   accuracy faster. Data scientists can do everything in    for its data analytics and ML, and NVIDIA reports that
of distributed training engines has surfaced
                                                   Python end to end, which means they no longer need       Walmart has found that “one GPU server requires
that delivers on that goal by providing            to convert their code to execute in Spark. The result    only four percent of the time needed to run the same
tremendous speed and performance gains             is reduced complexity and increased efficiency.          forecasting models vs a 20-node CPU server.”19 That
over Apache Spark.                                                                                          translates to Walmart running models in four hours
                                                   Another open source Python framework is RAPIDS,
                                                                                                            that previously took several weeks using CPUs.
                                                   which is built on top of Dask.17 RAPIDS optimizes
                                                   compute time and speed by providing data pipelines       While organizations are thinking strategically about
                                                   and executing data science code entirely on graphics     training frameworks, some have run into barriers in
                                                   processing units (GPUs) rather than CPUs. Saturn         the past. Today, new technologies are unlocking what’s
                                                   Cloud recently compared RAPIDS to Spark and              possible and demonstrating how much faster things
                                                   discovered that model training with RAPIDS took one      can be when everything is done directly with Python.
                                                   second on a 20-node GPU cluster, while Spark took        By eliminating the need to convert models into Spark,
                                                   37 minutes on a similarly priced 20-node CPU cluster.    organizations are reducing complexity and increasing
                                                   Saturn Cloud concluded that RAPIDS enables 2,000x        efficiency. And it’s easy to try different distributed
                                                   faster processing using GPUs while costing a fraction    training frameworks on  Snowflake’s platform to find
                                                   of the price.18                                          what works best.

The field of data science is evolving rapidly.   That’s why it’s important to select a platform that          In addition to the underlying architecture, Snowflake
Not only are new ML and AI developments          is vendor-, framework-, and algorithm-agnostic. By           supports data science in a variety of ways.
released every month, but new startups, tools,   choosing a future-proof platform, you ensure that            •    nowflake’s External Functions allow any
                                                 upcoming ML tools will continue to work seamlessly               third-party, hosted, or custom ML service to be
and solutions emerge regularly. With the rapid
                                                 with the platform you have. After all, the last thing            accessed easily using SQL.
pace of innovation occurring in this space,      you want to do is re-platform in order to use the next
                                                                                                              •    ecognizing that various teams may prefer
it’s imperative not to get locked into using a   generation of tools.                                             languages other than SQL, Snowpark extends
single tool.                                                                                                      language support for Java, Scala, and, soon,
                                                 What makes Snowflake’s platform unique is its modern
                                                 architecture. Designed with separate, but logically              Python. Snowpark allows data scientists to write
                                                                                                                  code in their language of choice using familiar
                                                 integrated, compute and storage, Snowflake eliminates
                                                                                                                  programming concepts, such as DataFrames, and
                                                 the manual cluster-building efforts that other systems
                                                                                                                  then execute data preparation and workloads
                                                 must perform to make separate layers work together.
                                                                                                                  directly on Snowflake.
                                                 As a result, Snowflake provides a multi-cluster,
                                                 shared data architecture that provides nearly infinite       •   J ava user-defined functions (UDF) are supported
                                                                                                                   to enable trained models to run within Snowflake.
                                                 scalability, instant elasticity, and extremely high levels
                                                                                                                   That means models built and trained in an ML
                                                 of concurrency to power the Data Cloud.
                                                                                                                   partner’s technology can be brought into and run
                                                                                                                   directly on Snowflake resources.

IN 2021
It’s remarkable how quickly data science has      Today, a modern platform is a necessity if you want      With Snowflake, limitations on data science
become mainstream. In the last 10 years,          to analyze and share data quickly and scalably with      are removed. Are you ready to accelerate your
companies have shifted their focus from           security and governance built in. Snowflake provides     machine learning?
                                                  an architecture that enables data consolidation,
reporting and historical analysis to conducting
                                                  efficient data preparation, and an extensive partner
data science with advanced mathematical           ecosystem. Your data is mobilized, which allows you to
models and ML. The cloud changed everything.      benefit immediately from new trends in data science
With the ability to inexpensively collect and     and ML.
store more and more data came the need to
build data models powered by ML.

                          Snowflake delivers the Data Cloud—a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and
                          performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse
                           analytic workloads. Wherever data or users live, Snowflake delivers a single and seamless experience across multiple public clouds. Snowflake’s
                          platform is the engine that powers and provides access to the Data Cloud, creating a solution for data warehousing, data lakes, data engineering,
                              data science, data application development, and data sharing. Join Snowflake customers, partners, and data providers already taking their
                                                                            businesses to new frontiers in the Data Cloud.

