Data Centre Infrastructure Monitoring - Partnership for Advanced Computing in Europe - prace

 
Data Centre Infrastructure Monitoring - Partnership for Advanced Computing in Europe - prace
Available online at www.prace-ri.eu

                              Partnership for Advanced Computing in Europe

                               Data Centre Infrastructure Monitoring

                                                      Norbert Meyer *
                                 Poznań Supercomputing and Networking Center, PSNC, Poland

Contributors: Andrzej Gośliński (PSNC), Radosław Januszewski (PSNC), Damian Kaliszan (PSNC), Ioannis Liabotis
(GRNET), Jean-Philippe Nominé (CEA), François Robin (CEA), Gert Svensson (KTH, PDC), Torsten Wilde (LRZ), All
PRACE partners contributing to the DCIM survey

Abstract

Any data centre, especially an HPC centre, requires an advanced infrastructure which supports the efficient work of
computing and data resources. Usually, the supporting environment is of the same complexity as other parts of the HPC
data centre. The supporting environment includes chillers, coolers, pumps, valves, heat pumps, electrical distributions,
UPSs, high and low voltage systems, dryers and air conditioning systems, flood, smoke and heat detectors, fire
prevention systems and more. The variety of supporting equipment is very high, even higher than the IT infrastructure.
This dictates the necessity to collect, integrate and monitor the instrumentation. In addition to monitoring, inventory
system should be a part of each data centre. This report provides a summary of a DCIM survey collected from the most
important HPC centres in Europe, analysis of controlling and monitoring software platforms available on the market with
an assessment of the most wanted functionality from the users’ point of view. The analysis of requirements and
potentially available functionality was summarised by a set of recommendations. Another critical issue is the policy and
definition of the procedures to be implemented by the data centre owner and service provider to keep the necessary
Service Level Agreement (SLA). Parts of the SLA should be reflected in the data centre infrastructure management.
Apart from reliability and high availability you need to consider minimizing maintenance and operating costs, and the
DCIM systems are very helpful for this purpose as well. The best practice information was presented at the “7th
European Workshop on HPC Centre Infrastructures” organised in Garching (Germany) on April 2016. The list of
recommendation and the conclusions chapters describes the essence of what should be expected from a well-designed
DCIM system.

  *
      Corresponding author: meyer@man.poznan.pl
Data Centre Infrastructure Monitoring - Partnership for Advanced Computing in Europe - prace
Introduction

By implementing the right combination of hardware and software solutions to manage the data centre infrastructure, it is
possible to:

    •    Reduce day-to-day costs and the time involved in operating and maintaining the data centre
    •    Enable better management decisions by getting the right information to the right people at the right time
    •    Make better use of existing IT and infrastructure resources, optimise data centre performance and improve
         system availability.
Data centre managers need to have insight into their IT and physical infrastructure, Figure 1

                               Figure 1: The IT and physical infrastructure in a data centre (by Emerson)

The report focuses on the physical infrastructure. The diversity of equipment located in data centres (DC) makes any DC
continuity (business continuity) impossible without a proper monitoring and management software platform.
The term DCIM means Data Centre Infrastructure Monitoring. However, the literature also uses the abbreviation to
denote Data Centre Infrastructure Management. This report uses the first meaning.
A DCIM platform can help to:

    •    Identify actively issues before they escalate into problems,
    •    Gain insight into asset/system performance to improve processes, increase productivity and maximise
         operational and capital resources,
    •    Model the data centre’s behaviour based on its configuration,
    •    Incorporate real-world data to optimise performance and decisions,

                                                                                                                       2
•    Plan for future IT needs.
The authors prepared a questionnaire to survey the current status of implemented solutions in HPC centres in Europe.
The result is described in Section 1. The state of the art of monitoring and control platforms is described in Section 2.
Software platforms available on the market can be divided into commercial products and those developed under an open
source license and free of charge. This section shows the pros and cons of a number monitoring and control applications
and makes suggestions and recommendations to choose the best product. It is important to take into account the users
requirements, as only this makes sure the recommendations are relevant. Section 2 was divided into two parts to describe
both monitoring and control tools as well as asset management packages. Section 3 gives an overview of communication
protocols e.g. BACnet, sCEBus or PROFIBus. The second part of Section 3 presents APIs like Redfish, PowerAPI and
PowerDAM.
It is an interesting topic whether and how DCIM can be integrated with computing resource management platforms and
potential synergies between these platforms. This is discussed in Section 4. Sections 5 Recommendations and Section 6
Conclusions complete this white paper report.

1. Survey results from PRACE

The current state of the art in HPC technologies and DC maintenance are reflected by PRACE partners in Europe. It was
reflected in the results collected from a survey distributed in the first quarter of 2016. The results of the survey were
presented at the 7th European Workshop on HPC Centre Infrastructures [1] in April 2016. The data centre technology
used in academic institutions is highly advanced, at least in terms of the size of computing resources, the energy usage
and density of Flops/Watt and Flops/m2. However the size of data centres and capacity of resources require additional
monitoring software.
This section contains information based on reports received from large data centres (PRACE hosting partners) and large
infrastructures of the PRACE general partners, which gives a good overview of the data centres functionality. The goal of
the questionnaire was to collect information about the current state of the infrastructure monitoring systems supporting
daily work in the PRACE data centres.
At the moment, in HPC Data Centres, there are plenty of different monitoring platforms to choose from, for example:
    •    Tridium JACE system – Niagara
    •    Siemens Desigo
    •    Graphite/Graphana
    •    Pronix POC (Power Operation Centre)
    •    Check_MK
    •    Messdas
    •    InTouch
    •    UGM 2040
    •    Desigo Inside
    •    Johnson Controls Metasys
    •    Siemens WinCC
    •    SCADA PANORAMA
    •    victor Web LT
    •    StruxureWare Power Monitoring Expert Data Center Edition.
However, such a big selection of management systems does not mean that each management system provides all the
services. Each platform is often responsible only for a part of the monitored data centres’ infrastructures and its
functionalities, e.g. separate tools are used for power and cooling facilities. Separate hardware units are often monitored
by a separate monitoring infrastructure. The most common solution is three or more independent software platforms
used for various types of equipment. For example, one platform is responsible for heat exchangers (HxB), Air
Conditioning (Chillers and CRAHs) and generators; another for webcams, TV, fire protection, door access codes and
other security features installed in the building; and a third one for live status, event logging, triggering of alarms and
generating problem tickets.
It is practically impossible to take care of a DC utilising just one monitoring platform. Only DCs developed from scratch
allow integrating the DCIM into one global system. This encourages searching for complementing solutions and to create
complementary platform consisting of several systems for specific requirements. But then the components are
independent without any common interface.

                                                                                                                         3
Despite the diversity of used solutions, the lack of a global view is the biggest problem but, on the other hand, the easiest
to solve in comparison with creating one platform/program for the entire infrastructure. This lack makes it
difficult/demanding to have full control over the system and optimise it using information that is related and acquired
from different platforms. Structured view - “full-stack” combined dashboards covering everything from the queue system
down to the infrastructure level is a must.
However, one well-prepared platform would meet such relevant needs like monitoring water temperatures of all cooling
loops, energy optimization functionality, tight monitoring and early detection of problems, logging, view of historical
data, historical trend monitoring and easy programming. Such a platform should also support, among others, asset and
capacity management, workflow, Application Programming Interface (API), mobile & web-based interface, real-time
monitoring, trends & reporting, data centre floor visualization and simulations.
This leads to defining the general shortcomings of the current monitoring system infrastructure. Beyond the
standardization which would allow for integration of management systems and software, and as a result the above-
mentioned full control over the data and all processes taking place in the centre, there is also a need for a system which
provides detailed information about the status of the e-infrastructure.
The current multitude of software necessary for relatively precise control over the data centre does not only affect the
amount of collected information, but also the number of people (FTE) dedicated to maintaining and managing the data
centre hardware. There is a correlation between complexity of the complementary software used in a data centre, its size
and the number of employees. The more complex the system is the more people are needed, although according to
studies an average of 2 FTE staff is enough to support the DC infrastructure and monitoring. It should be emphasized that
FTE excluded functions such as managing computing, data infrastructures and services correlated with these
infrastructures.
The complexity of a monitoring system is shown by the wide range of equipment connected. This includes: electrical
distributions, cooling systems (including data from chillers, coolers, pumps, valves, heat pump and other technology),
electrical distributions, UPS, high-voltage system, low-voltage system, dryers and air conditioning systems for IT
technology, flood detectors, smoke and heat detectors, fire prevention system, air quality systems, temperature and
humidity sensors, oxygen and carbon dioxide sensors, fuel storage and management, basic security systems (video,
access control, door status indication) and more.
An inventory software platform plays a very important role in data centres. The HPC centres can use their own
implementations, where an inventory is handled via MySQL databases or an inventory data catalogue is based on an
Oracle database. Also, applications such as web-based Python/Django, CommScope iTRACS are being deployed. Open
source platforms like Ralph (based on Django framework) or Argos system are also used.
The monitored data are usually collected via a separated internal LAN network dedicated for that purpose. At the
moment, there are two common types of access to the system monitoring data: LAN (internal) and through VPN from
outside. The monitoring network is a physically and logically separated network, not directly connected to the building
network or any other public network, protected by VPN and ACLs from the HPC systems network. The network is
separated and not connected to the outside world (exceptionally access can be achieved only by VPN).
Furthermore, the mentioned solutions which secure access to the monitoring and controlling network are safe enough.
Physical separation from the building and public network is a sufficient protection against external threats and
unauthorised access.
As the monitoring systems become more and more complicated they require additional servers and storage to capture and
analyse data. To support the monitoring platform most DCs use additional IT systems: one up to 4 servers. There are
individual approaches in each HPC centre. However, it does not consume very large hardware resources. A much more
important feature is reliability and HA (high availability), which drives the need to build an HA-cluster consisting of at
least 2 machines.

2. State of the art

Most frequently implemented functions in DCIM software packages:

    •    Asset management
         Most popular DCIM solutions have the ability to catalogue IT assets such as network equipment, servers, etc.,
         which are installed in rack cabinets in the data centre. Also, there are some packages that map all connections
         between devices and from where a particular device is supplied with electricity. In addition, a DCIM solution
         provides tools to generate QR code or barcode labels. Labels can then be scanned by an administrator using a
         smartphone or tablet.
                                                                                                                           4
•   Capacity management
        Most of DCIM software available on the market is able to inform the user if there is enough space, available
        network ports or electric sockets in rack cabinets for new equipment.
        Furthermore, some DCIM software can detect that equipment is too heavy for a particular rack cabinet, or that
        the power supply or the cooling system reached its limits and new equipment cannot be added.

    •   Real-time monitoring
        Complex DCIM applications gather data and inform if equipment parameters exceed thresholds so that
        administrators can react to the situation immediately.

    •   Trends & reporting
        DCIM systems that monitor equipment parameters in real time are often able to save that data to visualize
        changes of parameters in the form of graphs or reports.

    •   Data centre floor visualization
        In most cases, DCIM systems provide 2D or 3D visualization of server rooms. Some solutions are even able to
        calculate heat dissipation and display it as a thermal-vision view of the server room.

    •   Simulations
        Most complex DCIM solutions have functions to calculate “what if” scenarios, such as a power failure, a
        cooling system malfunction or the deployment of new equipment, to better understand consequences of such
        events and take appropriate actions.

    •   Workflow
        Some DCIM software implements ticket systems to help automate and document change requests for assets in a
        data centre. Thus, only qualified personnel are responsible for performing changes to IT assets.

    •   Remote control
        Parts of DCIM solutions allow storing login credentials in an encrypted database to help administrators cope
        with a huge number of IT assets which require remote control.

    •   Application Programming Interface (API)
        Some DCIM applications are able to communicate with third party software by providing an Application
        Programming Interface.

    •   Mobile & web-based interface
        More and more often developers add a possibility to manage a data centre through mobile or web-based
        applications for ease-of-use from anywhere.

2.1. Data centre monitoring and controlling products

Most monitoring products are coming from the commercial world, developed by companies, usually with an idea to
support an internal set of products (like Schneider Electric or Emerson). However, lately, new ideas to integrate more
advanced functionality and customer requirements of heterogeneous physical data centre infrastructure have been
developed.

2.1.1. ABB

Decathlon – a complex solution for data centres which provides tools to manage the physical infrastructure and the
workflow. Key features of Decathlon® system are [2]:
    •   Visualisation, monitoring and alerts for IT asset parameters such as equipment status, energy consumption and
        environmental data,
    •   Help to maximise capacity of cooling, space and power by calculating “what if” scenarios and “before and after”
        data centre comparisons,
    •   Calculation of PUE, carbon emissions, cost of kWh and other statistics,
                                                                                                                      5
•   Management of IT assets.

2.1.2. CA Technologies

CA DCIM – the solution consists of two products CA ecoMeter and CA Visual Infrastructure.
Key features of CA ecoMeter are as follows [3]:
    •   Capturing information about a data centre, such as energy usage and cooling in real-time,
    •   Recognition of anomalies or deviations from normal patterns and sending alerts,
    •   Ability to calculate popular data centre metrics such as PUE,
    •   Ability to integrate with the existing Building Management System (BMS),
    •   Ability to gather data via SNMP, Modbus and BACnet protocols without additional hardware,
    •   Identification of inefficient devices,
    •   Remote control over facility equipment.
CA Visual infrastructure is designed for real-time monitoring and 3D visualisation of data centre assets. Users can
generate “what if” scenarios for outage analysis. The application tree view shows power and network connections
between devices. Furthermore, users can attach technical documentation, user guides, diagrams or warranty
documents[4][5].

2.1.3. CommScope

The product iTRACS is a complex DCIM solution which is utilized by many world Fortune 500 companies. Main
benefits of using iTRACS are [6]:

    •   Asset management - iTRACS has a built in database to catalogue information about data centre assets, such as
        model, dimensions, age, weight, purchasing details, physical location, and network and power connectivity. The
        ability to auto-discover assets decreases the margin of user errors and accelerates deployment of the iTRACS
        system.
    •   Power management - iTRACS pulls power consumption data directly from devices, such as intelligent PDUs
        and UPSs. The iTRACS 3D visualization allows users to follow the power chain from IT assets to its source.
        The iTRACS application provides users with views of total energy consumption of data centre and alerts when
        power thresholds are exceeded.
    •   Space management - CommScope's solution is capable of generating 3D models of data centre floors to better
        understand where each asset is located. It is possible to generate a simulation which will show the impact of
        each planned change.
    •   Predictions - iTRACS FutureView tool is capable of comparing the current state of a data centre with the past
        and future states. With this tool it is easy to understand the effect of each change in a data centre.
    •   iTRACS can be easily integrated with the existing BMS system, VMware, HP Insight Manager solutions and
        other third-party software.
    •   CommScope’s software offers templates to quickly create a variety of reports. Different reports can be generated
        for each user, group or zone. Reports can contain information about the total energy usage, thermal conditions,
        network utilisation, etc. The myAnalytics application offers insight into what is happening in a data centre and
        why [7].

2.1.4. Cormant

The DCIM solution has been developed for more than 11 years. Cormant-CS can be used by small and large companies
due to asset-based licensing. The application has an intuitive user interface that enables management of data centres
without any training [8].
With Cormant-CS users can manage data centres through a desktop, mobile or web-based application. Furthermore, the
solution can be integrated with third party programs using API.
For each asset Cormant-CS records purchase information, location, support and configuration details. Users can trace
network and power connections throughout the data centre. Power usage and cooling needs are monitored in real time to
better understand how much capacity is available. The solution provides users with a detailed 2D graphical representation
of the entire infrastructure. Cormant-CS provides a workflow management tool to record changes made to the data
centre’s assets. A built-in scripting engine allows users to send commands to any device within the data centre. Another
                                                                                                                        6
function of Cormant-CS is the ability to generate reports and send them directly to users by e-mail; the reports can
contain statistics such as PUE or DCIE. The Cormant-CS software alerts users when parameters exceed the set values.
Administrators with Android and Apple devices can manage assets from the data centre floor through a mobile
application using barcode labels or RFID tags to find device information in the systems database [9].

2.1.5. Device42

Device42 provides detailed views of IT assets in rack cabinets. For each IT asset users can attach information about
network configuration, purchase, license agreements for software installed on machines, user manuals, patch panel
documentation or support contacts [10].Device42 has a built-in tool to document physical, virtual, blade, clustered,
switches and more device types. The application is able to generate QR code labels which can be printed and attached to
a physical device for easy identification in the future. The Device42 solution provides interactive 2D views of rack
cabinets, data centre floors, heat dissipation and network connections between devices. Device42 enables power usage
and cooling needs management by providing tools to generate reports and monitor device parameters in real time.
DCIM solution has a built-in password database so that users can save login credentials for IT assets in a secure way.
Passwords are stored and encrypted using AES256 cipher. It is possible to integrate Device42 with third-party systems
using RESTful API [11].

2.1.6. Emerson Network Power

Emerson Network Power provides a few separate solutions for data centre infrastructure management. Only three
solutions will be presented below [12].
Trellis Enterprise Solutions are robust solutions for demanding clients with complex data centres. This product consists
of the following components:
    •    Trellis Inventory Manager is used to catalogue assets that are placed in a data centre. With this tool users are
         able to understand how much space, power and cooling is available in a data centre. The application provides
         users with an ability to quickly find any equipment in the data centre, simplify decision making by providing
         users with more knowledge about assets, and minimise errors due to outdated or conflicting spreadsheet lists of
         assets by maintaining one centralised database. Furthermore, views of data centre floor, rack cabinets and cable
         connections are available to the users.
    •    Trellis Site Manager monitors and reports on the status of the facility critical devices by polling data, such as
         power usage, temperature, humidity, air flow, leak detection from devices located in the data centre in real-time.
    •    Trellis Power System Manager allows a comprehensive view of the data centre power system from utility
         entrance down to rack power distribution helping managers and engineers cut energy costs and maximise
         capacity. With this tool engineers can gain knowledge that helps make proper decisions to minimise costs and
         ensure that there will be no interruptions of asset operations by better understanding dependencies in the power
         network.
    •    Trellis Thermal System Manager module is responsible for monitoring the cooling system from chillers and
         coolers down to end-point CRAC units. This tool helps prevent over-cooling and eliminate hot spots that
         increase the risk of downtime. From the application interface you can adjust fan speeds, temperatures or even
         turn off unused CRAC units. The thermal system manager is able to simulate a 3D thermal image of the data
         centre floors.
    •    Trellis Process Manager provides a function to organise work and eliminate errors associated with installing,
         decommissioning, moving or renaming equipment. An automated ticket system ensures that changes made to
         the data centre are performed by qualified personnel.
    •    Trellis Mobile Suite users that have the Apple iPad device can install a mobile application that provides access
         to the Trellis platform from anywhere.
Trellis Quick Start Solutions are low-cost, pre-packaged solutions targeted to small or new data centres. This product
provides a low-risk entry point to DCIM software. With a modular architecture, it is possible to expand in the future. This
product consists of:
    •    Trellis Data Center Monitoring Solution provides software, hardware and services for monitoring a data
         centre in a complete package. With this package, users are able to monitor power usage and environmental

                                                                                                                          7
parameters of a data centre infrastructure. Furthermore, it is possible to configure alarms and notifications when
         parameters exceed set values, manage capacity, power chain, cooling efficiency and device performance in real-
         time.
    •    Trellis Capacity Planning Solution provides software to help with data centre expansion planning and capacity
         management. Additionally, this tool can help maximise space in the rack cabinets and optimise cooling.
    •    Trellis Asset Management Solution provides users with a tool to improve device uptime performance, quickly
         search for devices and its description, visualize data centre floors, help identify an ideal place for new
         equipment, help plan device placement for future projects.
    •    Trellis Energy Management Solution is focused on power and cooling management. This tool lets users see
         power distribution and cooling needs of assets in a data centre; thus it is easier to find places where changes
         would bring some benefits, such as cost reduction.
Avocent solutions provide secure, centralized infrastructure, network monitoring and management tools. Avocent
solutions are used to achieve better performance and high availability of IT assets in a data centre. This product consists
of:
    •    Avocent DSView management software is used in conjunction with KVM appliances, serial console appliances,
         service processor gateways and PDU’s. DSView allows IT administrators to remotely access, monitor and
         control target IT assets from anywhere. It is possible to authenticate users with existing authentication services
         such as LDAP, Active Directory, NT Domain, TACACS+, RADIUS and RSA SecureID. Confidentiality is
         ensured by encryption using AES, DES, 3DES or 128bit SSL.
    •    Avocent Data Centre Planner provides visualisation of data centre floors that is dynamically updated as
         equipment is installed. Sophisticated modelling and predictive analysis capabilities, “what-if” scenarios, are
         easily created using a drag and drop interface. This solution helps understand the impact of every change
         performed in the data centre before they occur.

2.1.7. FNT

The products FNT Command and FNT Monitoring are the core of FNT DCIM solution [13].
    •    FNT Command is a modular infrastructure management software. In one database FNT Command stores data
         related to planning and management of the telecommunication and network infrastructure. The database stores
         device attributes, relationships, the responsible person, support contact, etc. With FNT Command users can trace
         cable connection and network configuration on devices. With this product users can view the current data centre
         state and also plan ahead and generate “what if” scenarios to understand the impact of changes or future
         expansion of data centre. FNT Command enables 3D comprehensive view of networks, servers, storage,
         software, services, power cooling and a data centre floor visualisation [14].
    •    FNT Monitoring enables real-time device parameters monitoring. It is possible to configure alerts and
         notifications when parameters deviate from set values. A detailed analysis helps with the identification of
         capacity bottlenecks [15].

2.1.8. Future Facilities

6SigmaDCX is a computational fluid dynamics (CFD) simulation tool that can help with troubleshooting and optimising
heat dissipation in data centres. The solution provides a 3D thermal image of a data centre floor and a knowledge base
about assets installed within the DC [16], [17].

2.1.9. Geist

Geist DCIM solutions consist of the Environet and Racknet products. Geist DCIM is equipped with a rich 3D graphical
interface for data centre assets monitoring [18].The application communicates in real-time with devices that support
SNMP, Modbus, BACnet or LONworks protocols. Furthermore, the application is accessible remotely through a web-
based interface [19]. By uncovering stranded capacity the Geist DCIM solution helps cut energy costs. Also, it is capable
of tracking existing or future asset details and predict impact of data center changes or expansion. Available to users are

                                                                                                                          8
multileveled views of data centre floors, such as a global view, thermal view, equipment view, energy spectrum view,
rack manager view, side-by-side equipment comparisons and more [20], [21], [22].

2.1.10. Modius

OpenData is a scalable DCIM solution. Modius provides three versions of the OpenData platform depending on
complexity of the data centre [23]:
    •   OpenData Enterprise Edition – a solution for large or multiple data centres,
    •   OpenData Standard Edition – a solution for small and medium single site data centres,
    •   OpenData Machine Edition – a low-cost solution for a limited number of devices in a remote location.
DCIM solutions from Modius are capable of real-time monitoring of any device that is able to connect to a computer
network or by serial connection. OpenData has a built-in extensive library of communication templates for devices.
Devices or sensors are identified and OpenData templates are used to pull the appropriate data points from each device.
OpenData Environmental Management module is responsible for collecting sensor data, such as temperature,
humidity or leak detection. Based on this data intelligent alarms with multiple tiers of escalation can be configured.
OpenData Asset Management module provides administrators with the ability to visualise assets and infrastructure,
manage the placement of equipment and make proper capacity management decisions as new devices are installed or
when old equipment is retired. The asset management module is capable of auto-discovery of assets including
relationships and dependencies between them. Also, the module provides complete visualisation of data centre floors.
OpenData Power Capacity Management module captures data from power meters, smart PDUs, busways, raceways
and other components in the power chain to monitor total energy consumption in a data centre.
OpenData Reporting Package allows end-users to quickly and easily create custom reports from monitored devices and
sensory data [24].

2.1.11. Nlyte Software

Nlyte solution is a combination of DCIM and DCSM (Data Center Service Management) software.
The world’s largest and most complex data centres rely on Nlyte to manage processes, resources, assets and their
dependencies. Key features of Nlyte are [25]:
    •   Capacity Planning – Nlyte helps manage and optimise space, energy usage, cooling and asset performance,
    •   Asset Management – the application catalogues all data centres equipment. All assets are visualised in details on
        data centre floors’ and rack cabinet views,
    •   Monitoring, alarming and reporting – the solution provides tools to generate complex reports, monitor assets
        parameters in real time, ability to configure alarms and notifications. It is possible to generate “what-if”
        scenarios to better understand the capacity of a data centre,
    •   Connection Management – Nlyte can document network and power connections between assets,
    •   Workflow Management – with Nlyte administrators can control changes performed in a data centre,
    •   Virtualisation Integration – administrators are able to integrate VMware, Microsoft and Citrix solutions with
        Nlyte,
    •   CMDB Integration – it is possible to synchronise configuration item information via connectors for BMC, HP
        and ServiceNow [26].

2.1.12. OpenDCIM

OpenDCIM is an open source DCIM solution. OpenDCIM is capable of creating views of data centre floors and rack
cabinets. From the data centre floor view, users can observe available space, and the temperature in the rack cabinets.
Furthermore, network and power connection can be mapped. The basic workflow system can be used to manage asset
change requests. Hosting costs can be calculated based on cost per U or cost per Watt. Open DCIM is able to alert users
about device faults and perform impact simulations of power outages. In the near future, administrators will be able to
integrate OpenDCIM with third-party software by using RESTful API [27], [28].

                                                                                                                        9
2.1.13. Optimum Path

Optimum Path provides two separate DCIM solutions: Visual Asset Manager and Visual Data Center software.
Visual Asset Manager is a SaaS (Software as a Service) based asset management tool used to improve tracking of assets
in a data centre [29]. Key features of the VAM solution are:
    •    Visual 3D representation of data centre floors and assets,
    •    Intuitive rack views,
    •    Centralised repository of port and connection information for network, power and fibre connections,
    •    Calendar tools to track scheduled service activity, planned downtime, warranty expiration and more,
    •    A task and work order module to document installation, move, decommission, connect, disconnect activities
         with assets,
    •    Definition of user-specific reports [30].
The Visual Data Center is a DCIM software package that provides the following features [31]:
    •    Complex 3D visualisation of entire physical infrastructure with a library of high definition 3D graphics,
    •    Capacity planning of space, power, cooling and cable connections,
    •    Impact simulation on power and cooling prior to deployment,
    •    A task and work order module to document installation, move, decommission, connect, disconnect activities
         with assets,
    •    A centralised repository of port and connection information for network, power and fibre connections,
    •    Definition of user-specific reports and trends,
    •    Real-time temperature and humidity sensor monitoring,
    •    Centralised IP camera views across multiple locations,
    •    A complete inventory of all data centre assets including all device properties and software.

2.1.14. Panduit

SmartZone is a suite of DCIM solutions designed to help you manage risks and changes within the physical
infrastructure by providing real-time data on the status of power, cooling, connectivity and security [32].
SmartZone provides visualization of critical metrics in one place to make fast, informed decisions.
Users are able to view historical data in the form of graphs or reports. Asset and capacity management functionality built-
in SmartZone provides users with information where to place new assets to optimise space, power and cooling capacity.
Furthermore, the application identifies capacity lost due to lack of resources and creates work orders to make the
necessary changes to reclaim this stranded capacity. The solution is able to automatically discover assets and connection
between them and automatically updates assets information when changes are made [33].

2.1.15. Rackwise

RACKWISE DCiM X includes an easy-to-use suite of DCIM capabilities within a central, scalable platform. Features
of the application are as follows:
    •    Real-time monitoring of network accessible attributes,
    •    2D data centre floors and rack cabinets visualisation,
    •    Calculation of popular metrics such as PUE or DCiE, carbon footprint and displaying them as histograms,
    •    Calculation and reporting of current consumption and remaining capacity of power equipment or circuits
         throughout the entire data centre power chain,
    •    Ability to build models of future consumption based on planned equipment moves, installation, or changes,
    •    Impact simulation of failures of components in the power infrastructure,
    •    Trending of asset parameters,
    •    Complex database of data centre assets,
    •    Network and power cable management,
    •    Apart from physical assets Rackwise’s solution is able to document virtual machines and their parameters,
                                                                                                                        10
•   Integration with CMDB systems such as BMC and ServiceNow,
    •   WEB PORTAL – web-based interface [34].

2.1.16. Schneider Electric

StruxureWare is a DCIM software that provides users with the following features [35]:
    •   Real-time monitoring of parameters of network accessible devices,
    •   Asset management and documentation of data centre infrastructure,
    •   Overview of data centre network paths and their interconnections for reduction of human error,
    •   Active cooling control for optimization of resource usage, extended equipment lifetime and energy savings,
    •   Insight into energy usage by providing historical and current values of metrics such as PUE, DCiE or carbon
        footprint,
    •   Insight into rack utilization for maximizing space capacity in a data centre,
    •   Software solution that eliminates the need for hardware KVM switches,
    •   Virtual store room for tracking new equipment that is not deployed yet,
    •   Multiple tenant assets management that provides reports about power usage and tenant impact in an event of an
        outage,
    •   Customisable report generator,
    •   Mobile application VIZOR that provides users with an ability to manage a data centre from smartphone or tablet
        devices,
    •   Handheld wireless barcode scanner for easy asset identification, based on Motorola MC75 hardware,
    •   Simulation, optimisation and planning of infrastructure capabilities,
    •   3D airflow visualisation for easy identification of overcooling and hot spots,
    •   Ability to create work orders for changes made in a data centre,
    •   Support for virtual machines [36].

2.1.17. Sunbird (Raritan)

Sunbird’s solution consists of dcTrack and Power IQ software.
dcTrack is an easy-to-use DCIM solution for maintaining an accurate inventory of data centre assets and real-time views
of data centre floors, including equipment in rack cabinets such as servers, storage, networking equipment, PDUs, patch
panels and even applications. Furthermore, it is possible to monitor branch circuit panels, UPS and CRAC units. dcTrack
provides a map of all network and power connections between assets. The Sunbird capacity management solution
provides views of capacity, including space, power, network port availability and capacity of devices such as UPSs,
CRACs, PDUs and circuit panels. The Sunbird DCIM solution enables to manage changes performed in a data centre by
providing a complex workflow automation tool.
The Power IQ solution monitors and measures energy usage in a data centre, providing dashboards and reports to help
find savings in a data centre [37].

2.2. Asset Management Systems

Asset management is one of the feature which enriches the DCIM functionality. Ralph is an example of a lightweight
asset management system. It is based on the Django framework initially developed by the Allegro company for internal
purposes but finally made available as open source software to let other users develop, extend and tweak it.
For simplicity reasons Ralph is divided into separate modules called Assets, storing information about a data centre and
back office assets, Scan responsible for new hardware discovery or hardware changes, CMDB - a database for
configuration management data presentation.
The System layout is quite simple. Users can navigate from the DC view, through racks, down to a single asset. Most of
the configurable items are accessible through the Django administration module. The following figure presents a bird’s-
eye view of one of the rooms of the PSNC data centre as seen in Ralph. Free and occupied slots in racks are represented
by green and red colours. By clicking on any rack a user enters the rack view (Figure 2) with the assets deployed.

                                                                                                                      11
Despite the physical location (DC, room, rack) each asset has to be matched with project(s) in which it is used and the
environment (e.g. production, test). Additional information includes purchase details, depreciation rate, etc. If filled in,
this information is used in the accounting module of Ralph (Figure 3).

                                           Figure 2: Data Center view by Ralph system

                                                                                                                         12
Figure 3: Rack view with front and back panel

2.2.1. Costs management

In terms of costs management there is a dedicated accounting module called Scrooge. It is a combination of IT
management and accounting software. It brings billing functionality to Ralph.
Scrooge provides information and aggregated data on the use of resources from many other systems and charges other
services. Scrooge generates flexible and accurate cost reports and lets you refer to historical usages and costs.
In the administration module of Scrooge a user can set costs of team(s) or staff, support, licences, unit costs for selected
usage types, i.e. depreciation, power consumption or even usage types pulled from Openstack software (Ralph provides a
plugin for this): simple usage, tenants.
Ralph scrooge features also include reports presenting costs per service, costs and usages per single device, and
dependency structure (Figure 4).

                                                                                                                         13
Figure 4: Example of simple service cost report

All the information provided by Scrooge can help optimise the costs of internal services and departments by reviewing
their structure and dependencies.

2.2.2. Integration with external services

Ralph with its plugin-based architecture allows users to write their own plugins to talk to external services. It is mostly
used to gather information from external services, not the other way.
This feature has for instance been implemented at PSNC where a dedicated plugin was written to pull data (Energy
usage) from PDUs (Figure 5) and then to calculate the costs of the energy used by selected racks/projects. In this way, the
user retrieves information on how much money is spent to pay the bill to the energy provider.

                                            Figure 5: Energy usage for selected PDUs

In terms of Ralph integration with other services supporting DC management, where the data need to be sent, it has to be
taken into account that Ralph internal plugins operate on a daily basis (e.g. using operating system job scheduler - cron).
However, it is imaginable to tweak the system by writing a plugin(s) that would run / check desirable parameters’ values
(e.g. energy consumed by a cluster) and – in case of reaching a threshold value – send a message to the external system
using its API. Despite Ralph is not a typical monitoring system some basic functionality of this kind could be introduced.
An example of implementation is a data centre BMS system. If not already equipped with such functionality, the BMS
could be triggered (alerted) by Ralph in the case of high energy consumption, implicating a chain reaction other
activities, e.g. sending an email to the cluster admins to suspend or reschedule tasks or even switch off a part of the
cluster.
                                                                                                                        14
3. Information exchange

3.1. Communication protocols

3.1.1. BACnet (Building Automation and Control Network)

BACnet is a communication protocol for building automation and control networks. It is an ASHRAE, ANSI, and ISO
16484-5 standard protocol.
BACnet was designed to allow communication of building automation and control systems for applications such as
heating, ventilating, and air-conditioning control (HVAC), lighting control, access control, and fire detection systems and
their associated equipment. The BACnet protocol provides mechanisms for computerized building automation devices to
exchange information, regardless of the particular building service they perform.
It may be used by head-end computers, general-purpose direct digital controllers, application specific or unitary
controllers, and other sensor or actuator devices. IT defines data communication services and protocol for computer
equipment used for monitoring and controlling HVAC, refrigeration and other building systems. In addition, abstract,
object-oriented representation of information communicated between such equipment is defined. BACnet also provides a
comprehensive set of messages for conveying encoded binary, analogue, and alphanumeric data between devices.
It may use Ethernet devices, ARCnet devices, or a specially-defined RS-485 asynchronous protocol for twisted-pair
wiring that will operate at up to 78,000 bits per second as the communication medium.

3.1.2. CEBus - Consumer Electronics Bus

CEBus(r), (Consumer Electronics Bus), also known as EIA-600, is a set of electrical standards and communication
protocols for electronic devices to transmit commands and data. It is suitable for devices in households and offices to use
and might be useful for utility interface and light industrial applications.
An interim standard (IS-60) is under development by the Electronics Industries Association (EIA) primarily for use in the
residential market.
A special language deemed particularly suitable for home automation functions has been developed and is part of the
interim standard.
Main goals are to accommodate a variety of communication media such as power lines, twisted-pair wiring, infrared
signalling, and radio frequency signalling, to allow appliances with varying capabilities to use subsets of the CEBus
facilities, to encourage the development of low-cost interface devices, and to separate the intelligent operation of devices
from the communication infrastructure.
On the physical layer, the protocol signalling devices use a proprietary signalling method devised by the CEBus
committee that operates at 6700 bits per second. Since the same signalling method is used on all transmission media, it is
felt this scheme is good for implementing home automation in both new home construction and retrofit construction.
Both the protocol software and the application language are designed to be implemented on standard microprocessors and
to use CEBus-specified transmission media.

3.1.3. KNX

KNX is a standardised (EN 50090, ISO/IEC 14543), OSI-based network communications protocol for building
automation. KNX is the successor to, and convergence of, three previous standards: the European Home Systems
Protocol (EHS), BatiBUS, and the European Installation Bus (EIB or Instabus). The KNX standard is administered by the
KNX Association.
KNX is approved as an open standard to:
    •    International standard (ISO/IEC 14543-3)
    •    Canadian standard (CSA-ISO/IEC 14543-3)
    •    European Standard (CENELEC EN 50090 and CEN EN 13321-1)
    •    China Guo Biao (GB/T 20965)
KNX defines several physical communication media:
    •    Twisted pair wiring
                                                                                                                         15
•    Powerline networking
    •    Radio (KNX-RF)
    •    Infrared
    •    Ethernet (also known as EIBnet/IP or KNXnet/IP)
The access to the KNX specification used to be restricted. However, as of January 2016, the access is free.
KNX allows different bus topologies: Tree, line and star topologies. These topologies can be mixed as needed. However,
ring topologies are not allowed.
KNX is a fully distributed network which accommodates up to 65,536 devices in a 16 bit individual address space. The
logical topology or sub-network structure allows 256 devices on one line. Lines may be grouped together with a main
line into an area. Up to 15 Lines can be connected to a main line via a Line Coupler (LC) for a total of 16 lines.
One line consists of a maximum of 4 line segments, each with a maximum of 64 bus devices. Each segment requires an
appropriate power supply. Maximum segment length is 1000 m. 4 segments may be connected with line repeaters to
establish a network length of 4000 m and 256 devices. The actual number of devices is dependent on the power supply
selected and the power input of the individual devices. Note: Line repeaters may not be used on the backbone or main
lines. Note that KNX KNXnet/IP optionally allows the integration of KNX sub-networks via IP. As shown above, this
topology is reflected in the numerical structure of the individual addresses, which (with few exceptions) uniquely identify
each node on the network. Note: Each line, including the main line, must have its own power supply unit. Installation
restrictions may depend on implementation (medium, transceiver types, power supply capacity) and environmental
(electromagnetic noise etc.) factors. Installation and product guidelines shall be taken into account.

3.1.4. LonWorks

The technology has its origins with chip designs, power line and twisted pair, signalling technology, routers, network
management software, and other products from Echelon Corporation. In 1999 the communications protocol (then known
as LonTalk) was submitted to ANSI and accepted as a standard for control networking (ANSI/CEA-709.1-B). Echelon's
power line and twisted pair signalling technology was also submitted to ANSI for standardization and accepted. Since
then, ANSI/CEA-709.1 has been accepted as the basis for IEEE 1473-L (in-train controls), AAR electro-pneumatic
braking systems for freight trains, IFSF (European petrol station control), SEMI (semiconductor equipment
manufacturing), and in 2005 as EN 14908 (European building automation standard). The protocol is also one of several
data link/physical layers of the BACnet ASHRAE/ANSI standard for building automation.
Two physical-layer signalling technologies, twisted pair "free topology" and power line carrier, are typically included in
each of the standards created around the LonWorks technology. The two-wire layer operates at 78 kbit/s using
differential Manchester encoding, while the power line achieves either 5.4 or 3.6 kbit/s, depending on frequency.
China ratified the technology as a national controls standard, GB/Z 20177.1-2006 and as a building and intelligent
community standard, GB/T 20299.4-2006; and in 2007 CECED, the European Committee of Domestic Equipment
Manufacturers, adopted the protocol as part of its Household Appliances Control and Monitoring – Application
Interworking Specification (AIS) standards.
During 2008 ISO and IEC granted the communications protocol, twisted pair signalling technology, power line signalling
technology, and Internet Protocol (IP) compatibility standard numbers ISO/IEC 14908-1, -2, -3, and -4.

3.1.5. PROFIbus - (Process Field Bus)

The protocol is developed by Siemens, Bosch, and Klockner Moeller and was agreed upon as a German National
Standard Proposal by ZVEI, the German National (DIN) Trial Use Standard in August 1987. Work continued with
support from BMFT (Federal Ministry for Research and Technology) and expanded to include 13 manufacturers and 5
technical institutes.
Its goal is to allow low-cost support of a broad range of services and to provide interoperability and a transparent
application program environment for Field bus devices.
Physically it is a combination of token passing ring and master-slave data link control software that can be implemented
on general-purpose microprocessors and supports data rates of 9.6, 19.2, 90, 187.5, and 500 kbps. Separate application
processors are needed for the higher speed data rates. The initial specifications propose a 2, 4 or 6-wire physical bus
using EIA RS-485 compatible links. All transmissions are asynchronous character transmission.

                                                                                                                        16
It designed to support up to 127 physical addresses, but has had several extensions proposed to add additional physical
bus segments.

3.1.6. Modbus

Modbus is a serial communications protocol originally published by Modicon (now Schneider Electric) in 1979 for use
with its programmable logic controllers (PLCs). Simple and robust, it has since become a de facto standard
communication protocol, and it is now a commonly available means of connecting industrial electronic devices. The
main reasons for the use of Modbus in the industrial environment are:
    •    Developed with industrial applications in mind,
    •    Openly published and royalty-free,
    •    Easy to deploy and maintain.
It is a very low-level protocol that operates on raw bits or words without placing many restrictions on vendors.
Modbus enables communication among many devices connected to the same network, for example, a system that
measures temperature and humidity and communicates the results to a computer. Modbus is often used to connect a
supervisory computer with a remote terminal unit (RTU) in supervisory control and data acquisition (SCADA) systems.
Many of the data types are named from its use in driving relays: a single-bit physical output is called a coil, and a single-
bit physical input is called a discrete input or a contact.
The development and update of Modbus protocols have been managed by the Modbus Organization since April 2004,
when Schneider Electric transferred rights to that organization. The Modbus Organization is an association of users and
suppliers of Modbus compliant devices that seeks to drive the adoption and evolution of Modbus.

3.1.7. Meter-Bus

The M-Bus (Meter-Bus) is a European standard for remote reading of heat meters and it is also usable for all other types
of consumption meters as well as for various sensors and actuators [38].
With its standardisation as a galvanic interface for remote readout of heat meters, this bus will be of great importance for
the energy industry as relevant users.
The remote reading of heat meters can take place in different ways, beginning with the classical method - manual reading
by the personnel of the providers - up to the remotely controlled collection of all the meter values for a complete housing
unit. The latter is a logical continuation/extension of the technical development of consumption meters and is realisable
with the help of the M-Bus.
The standardisation of the M-bus results in further technical possibilities. In particular devices of different manufacturers
can be operated on the same bus. The users are, therefore, free in the choice of the manufacturer. On the other hand, an
increase of the market can be expected, also regarding other M-bus based counters, so that with the highly variable
configuration options, even difficult problems can be solved.

3.2. Description of APIs

In any given data centre, sensor measurements and other valuable information can be gathered from different sources at
different levels of the infrastructure: the building infrastructure, the IT system hardware, the system software, and the
application level. On any of these levels, a wide variety of different physical and electrical specifications and protocols –
many of them proprietary – exist which make data collection from all those different sources a tedious task. There exist,
however, a few approaches to combine them in a single API in order to standardise and ease access: Redfish, PowerAPI,
PowerDAM. However, while all of these solutions cover multiple levels of the infrastructure, none covers all. So for
now, a comprehensive solution is still not available.

3.2.1. Redfish

Redfish is a specification by the Distributed Management Task Force, Inc. It is backed by large system vendors such as
Dell, HPE, Fujitsu, and Lenovo but also by other players in the server business like Βroadcom, Intel, SuperMicro, and
VMW. The focus is on scalable monitoring and management of IT equipment, and v1.0 of the Redfish specification
basically is a drop-in replacement for the industry standard IPMI. It allows for retrieving basic server information and
sensor data (e.g., serial numbers, temperatures, fans, power supply) and facilitates remote management tasks such as

                                                                                                                          17
reset, power cycle, and remote console. The main benefit of Redfish over IPMI is its use of an OData-compliant RESTful
API with JSON data representation. This allows for easy to implement and interoperable clients as suitable libraries are
available for most architectures and platforms.
As of now, the Redfish specification covers only IT equipment. However, by design Redfish is easily extendible as its
data model is independent of the protocol and defined by a machine-readable schema. There are plans to extend the scope
to cover also power and cooling.

3.2.2. Power API

Power API is an initiative driven mainly by Sandia National Laboratories and National Renewable Energy Laboratory
with support from industrial partners such as Cray, HPE, IBM, and Intel. The main focus is to develop a comprehensive
system software API for interfacing with power measurement and control hardware. As such it covers the IT system
hardware, the system software, and the application level. It defines multiple roles like facility manager, HPC user, and
HPC admin, but also non-human roles such as HPC applications or workload managers. Each role interacts with the
system at different levels and each such interaction represents an interface in the Power API.
A system in Power API is described by a hierarchical representation of objects, where objects can be everything from a
cabinet or chassis down to a CPU and its individual cores. Objects can be heterogeneous and may extend to custom
objects types. Each object has certain attributes (e.g., power cap, voltage) that can be accessed depending on the role of
the requester. Get/Set functions enable basic measurement and control for the exposed object attributes, whereas the
statistics interface allows for gathering data on one or more attributes of an object over time. The metadata interface
provides information quality, frequency, and other characteristics of measurement or control. Objects can be grouped to
allow for bulk access to their attributes and statistics or trigger parallel actions.

3.2.3. PowerDAM

PowerDAM [11] is a tool developed at LRZ to allow the collection and analysis of data from different data centre
systems. Its approach is different from Redfish or the PowerAPI in the sense that PowerDAM is not a specification nor
does it try to replace the functionality of already existing tools. Rather PowerDAM is used to centralise all sensor data
and measurement information related to power and energy from all data centre systems (covering Pillar 1 – data centre
building infrastructure, Pillar 2 – HPC systems, and Pillar 3 – system software, mainly scheduling information, of the 4
Pillar Framework).
PowerDAM provides a plug- in infrastructure for system data collection and for reports over the collected data allowing
for a relatively easy extension of the tool. It also provides a framework for defining virtual sensors which can be a
combination of multiple physical as well as other virtual sensors. For example, the CooLMUC cluster power
consumption is the sum of the power consumption of all nodes, of the networking equipment, and of the internal cooling
circuit. Currently, the collected data is used to calculate the Energy-to-Solution of applications, to calculate data centre
Key Performance Indicators (KPI’s) such as Power Usage Effectiveness (PUE) and Data centre Workload Power
Efficiency (DWPE), and for power and energy prediction of applications.

4. Integration of Infrastructure and HPC resource management systems

A current trend in data centre and IT infrastructure management practice is the integration of infrastructure and IT
management solutions. This includes tools for monitoring, measuring, managing and/or controlling IT assets usage,
energy consumption of the IT equipment (servers, storage and network switches) as well as the facilities and
infrastructure components (power distribution units, air conditioners, temperature and humidity sensors). In the literature
[1] this is referenced as Data Centre Infrastructure Management or DCIM.
DCIM uses various inputs such as data from the Building Management System – BMS, usage reports, batch system
accounting, logging reports, etc. Then it performs analysis of such data in order to create a new set of data that is
actionable data. Such data contain information that can be used to manage the data centre.
The set of data that is of interest for DCIM can be categorised as follows:
    1.   Cooling facilities data, Building management system data,
    2.   Environmental data, temperature humidity, etc.,
    3.   IT hardware assets and their configuration,
    4.   Power and energy consumption related data,
    5.   Power management and power capping,
                                                                                                                         18
You can also read
Next slide ... Cancel