New-Tech Europe | Sep 2017 | Digital Edition
parts of the power system and the IT load and having appropriate levels of resilience in the hardware, software and processes. Recovering from an outage requires patience and a systematic process – two things that were seemingly missing according to reports on BA’s outage. No data centre professional has ever asked ‘have you tried switching it off and on again?’ The skill is to pace oneself and follow each step in turn, controlling and monitoring a phased restart so that batches of systems are only brought online when it’s safe to do so and one is sure of the correct phase balancing and loads. Skipping any steps in the rush to get back online can create a power surge, overloading circuits, tripping breakers and, to put it mildly, cause chaos. Resilience and infrastructure upgrades Alongside skills andpower processes, the facilities infrastructure itself often needs upgrading to meet today’s efficiency, reliability and flexibility expectations. Around half of respondents in Eaton’s survey report that their core IT infrastructure needs strengthening, and this number is closer to two- thirds when it comes to facilities such as power and cooling. Power management is increasingly becoming a software defined activity; given the skills gap, software can play an important role in bridging the divide between IT and power by presenting power management options in dashboard styles that are familiar to an IT audience, making it easier to understand and even automating management of power infrastructure. This could have prevented the outage that faced BA as the automated processes would have brought systems back online in a controlled and monitored fashion.
We’ve moved towards more virtualised environments in data centres, IT and data centre professionals are familiar with using virtualisation to maintain hardware, so the question is why not use the same principles in power? It is important that all power distribution designs, and associated resiliency software tools, are compatible with all the major virtualisation vendors to ensure future-proofing of the infrastructure. This approach will enable data centre professionals to do concurrent maintenance to mitigate risks of infrastructure maintenance and upgrades. Learning lessons While we may never fully understand what happened within BA’s data centre, it’s near guaranteed that it won’t be an isolated incident across the wider data centre industry, even if it’s unlikely we’ll see anything on the same scale for a long time. The issue comes down to either poor preparation or implementation of disaster recovery. Better preparation of the data centre disaster recovery process would have seen it designed with resilience in mind, meaning firstly the DR site should have kicked in to cover the demand during the outage and, secondly, when restarting the hardware and applications, it should have been done in a far more controlled manner. This would have meant that the reintroduction of power to systems in a slow and phased manner, allowed for a smooth and steady recovery. We, as a data centre industry, need to make sure that we all learn lessons from BA’s high-profile outage and take actions to ensure that effective power management is a ‘must have’ and not a ‘nice to have’.
the simple rule of data centre power management – actions have consequences and consequences require action. The BA example demonstrates again that power misunderstanding is a common problem. Two-thirds of data centre professionals in Eaton’s research weren’t fully confident in power, and until organisations get to grips with power management we can expect to see more power- related outages. There is a profound concern around skills availability, that it’s hard to acquire and retain the relevant expertise or talent, whether it’s designing for energy efficiency, managing consumption on an ongoing basis, or dealing with power-related failures quickly and effectively to avoid and mitigate outages. Have you tried switching it off and on again? Should a full power outage occur then it’s absolutely imperative to have a disaster recovery process in place that clearly defines the steps to be taken when re-energising the data centre, detailing which systems must be brought back online first. In a full outage situation where people are in a state of panic and under pressure to resume normal services, staggering the re-energisation of the systems in your data centre may seem counter intuitive as the goal is to get back online as quickly as possible, but such a process helps to avoid further extension of the outage. The restoration of a data centre post going black needs to be done gently and in a clearly defined methodical fashion, simply trying to get everything back up in a hastily and unplanned way will only cause in-rush which could cause more outages, quickly crippling the data centre again. Power management is all about understanding the dependencies between the different
New-Tech Magazine Europe l 19
Made with FlippingBook flipbook maker