Introduction to Hybrid IT Service Management Challenges
A Cloud service is rarely implemented as one service. It usually needs to be integrated into the existing technology architecture and operating model. This results in different organizations being responsible to manage the service level. In addition to this change, the over service level expectations have also changed.
For example, the only acceptable service level today for an Application service is:
• 100% available
• 100% reliable
• Always fast, measured at the screen of the user, no matter the device.
• Always secure
The challenge we face today is how to manage service to achieve this service level in a Cloud/Hybrid IT environment when an application service is delivered through multiple partners and technologies. The compound impact on the reliability of a service as a whole increase with each additional component or service provider added to the service. For example, assume an application service consists of the following elements:
• Mobile application
• Mobile device
• Mobile carrier
• Mobile network
• Data Center network
• Application Server
• Data Base Server
There are seven components in this system. Let’s assume the reliability of each element is 99.9%. 99.9% to the seventh power is 99.3% or 61 hr. & 21 min. of downtime per year.
Furthermore, application performance has now become even more critical to manage. Systems are considered down if they get slow. Now again consider a Hybrid IT/Cloud application service with multiple service providers that all contributing variation in the area of performance. Figuring out why service is slow, is virtually impossible the way IT is currently managed.
Areas where a change in Service Management is required
The historical approach of Service Management is each component in the architecture is managed independently. Unfortunately, this approach will not allow us to achieve the new service levels we now must deliver. We need to control/monitor the service as a whole and integrate the Service Management processes, Service Management responsibilities, and Service Monitoring technologies into one integrated system. To accomplish this, we need to make changes in the following three specific areas:
1. Service Management processes and roles
2. People performance measurement and compensation systems
3. Service monitoring tools
1. Service management processes and roles
The following graphic outlines the work, roles, and OACA domains that service management involves. The overall responsibilities of the Shared responsibly model are as follows:
• Ensure security of applications infrastructure data facilities and networks
• Facilitate unified demand management
• Facilitate unified service level management
• Facilitate information lifecycle management
• Facilitate the efficient use of assets
The work required to accomplish this is:
• Security management
• Demand management
• Service management
• Operations management
• Information lifecycle management
Historically these groups operated independently. However, they now need to be integrated into one unified force that looks at the service as a whole. It starts by all agreeing that services need to be measured at the user’s screen and not on the individual parts.
In the new model, a Service Manager needs to be responsible for the service measured at the user’s point of view. The SLA should also be developed from the customer’s point of view. Generally, an SLA will be related to the application’s performance, or the response time of resolving an incident on that application, even though the service is dependent on several cloud or underpinning service providers.
To appreciate the magnitude of this change consider our previous application example. A Service Manager now needs to negotiate and understand the service level requirements for all these components of the service.
• Mobile application
• Mobile device
• Mobile carrier
• Mobile network
• Data Centre network
• Application Server
• Data Base Server
The Service Manager also needs to forecast the demand for the new application service and the impact on all underpinning components to ensure that sufficient supply is proactively in place as the overall service demand increases.
Next, the service level agreements need to be rewritten to reflect the new measurement criteria. Each application needs to be measured by the performance, availability, and reliability expectations measured at the screen of a user.
It is recommended that a set of key transactions that a user needs, be the basis of the SLA, and that the Response Time performance, Service Availability, and Service Reliability be defined and included in the SLA. This will enable the service manager to have a common measurement test to understand the impact on each the underpinning components in the application’s architecture. This may seem simple enough except for the fact that some cloud providers do not guarantee the performance of their services. They only provide an SLA on uptime, so it may not be possible for a Service Manager to develop a performance service level measure in the SLA.
This Cloud Service provider fact may require the Service manager to recommend that a Cloud Provider not be used and that the service is delivered using an alternative method, such as a dedicated server in a company’s data center or rebuilt using a scale-out architecture. A service cannot achieve a service level if it is not initially designed to meet that service level. A Service Manager needs to detect issues with the proposed service architecture and identify where changes need to be implemented before it goes into production
Service performance is an Achilles heel for many cloud providers. A Service Manager must deeply understand the limitations of all cloud providers used in their service.
It is also the responsibility of the Service Manager to set the expectations of the service as a whole for a customer in their SLA.
2. People performance measurement and compensation systems
A primary challenge in many IT organizations is that many IT people work in silos and they have never been responsible for delivering an end-to-end service. Technology personnel needs to start thinking like a cell phone provider or electricity provider. Everyone knows that electricity’s service level is measured at the user’s “switch” and it is either working or not. When the power goes out, the transmission line department doesn’t say it’s not their problem because the transmission lines are performing at 100%. Instead, the electricity provider knows the blackout area immediately and how many people are impacted and can triage/troubleshoot the incident down to the component level.
Once the failed component is repaired, the electricity is not turned on for everyone at once, because that would merely cause another outage due to power surges. Instead, the electricity provider knows their customer base and first restores power to the portions of the grid that contain critical customers such as hospitals.
In most Technology organizations the measurement and compensations systems reward each group in the area of their responsibility. In the Cloud/Hybrid IT Paradigm, Technology staff should be paid on the performance of the service as a whole, and not on their own individual components performance. All Technology staff, also need to be transformed into an integrated unified force that is focused on managing the service as a whole.
The level of collaboration needs to be much more than merely transactional using email. It needs to be an integrated partnership with groups wholly dependent on one another. Here are some examples of this type of operating model:
• Collaboration between internal/external groups to develop joint or integrated monitoring systems
• Facilitating operational reviews for performance & SLA metrics
• Joint efforts for troubleshooting incidents and analyzing the cause
The service levels need to become the overall measure of all the individuals in the architecture. No longer will it be okay to achieve a 99.9% reliability metric on a network or a database or some component in the architecture and still be deemed as doing the job well. In the new Shared Responsibility model, if the overall service level measured at the user screen fails, everyone fails.
David Packard of Hewlett Packard fame one time said, “Tell me how a person is measured and I will tell how they will behave.” This statement is very accurate in the case of service management. If employees are measured only on the performance of their part of the architecture, then they will not be concerned about service management on the service as a whole.
The measurement/compensation systems need to be changed to reward the required service levels. This one change alone will bring the highest amount of sensitivity to ensuring the service level measured at the user’s point of view is achieved. All staff including managers need to be compensated by the Application service level measured at the user screen. Bonuses and compensation should be directly connected to reaching the application service levels.
3. Service management architecture
The next area of change is to the technical architecture of the service monitoring and management tools. Historically tools have been chosen to enable personnel within a specific functional domain. For example, the networking team chose their tools; the database team chose their tools. The data that was collected from these tools were kept within that domain and not shared widely. In fact, in most cases, if the information is needed in another area, It has to be requested. There is no access to that data outside that group.
Also in most cases, the data is not used to do predictive analytics. To be able to achieve a 100% reliability service level the service management team need to have the ability to foresee incidents before they occur. This capability change will require a change in the way the data is collected, integrated, and analyzed to be able to achieve service levels.
These demands will also require a significant change in the willingness of the different groups to move to:
• One common service management architecture
• One common service management database and
• One common set of service management analytic tools
Below is an example of a technical architecture that reinforces a common Service Management Architecture. The tools are designed to look at a service as a whole, not just the individual components.
In most firms, this will require a significant overhaul of the current system management architecture to accomplish this result. Technology will also need to integrate internal service management systems with cloud providers.
Most cloud providers will not provide service management information on their internal systems so the new service management architecture needs to be able to detect performance, availability, and reliability of cloud providers by treating them as black boxes. This will require the addition of monitoring tools on each cloud provider linked the transactions delivered by that provider
Historically a System Management tool stores data within the Tools repository. In the new model, all System Management tool data is integrated into one database and analyzed by one analytic toolset. An essential addition to this new architecture is the data analytics applications that will allow the service management team to be able to do advanced predictive analytics and foresee incidents before they occur
Close
The transformation to a shared responsibility service management model is one of the most challenging activities a Technology organization has to achieve due to the magnitude of the changes in process, people, and technology. However, to meet the required service levels that we need now we need to accomplish these changes.
For additional information on the OACA cloud Maturity model go to: https://www.oaca-project.org/cmm40/
About the Author:
Bill is the Digital Strategist for Liam Associates Inc. Formerly the Cloud Chief Technologist for Hewlett-Packard Enterprise Canada, Bill has provided Hybrid IT and IoT Strategic Planning advisory and planning services to over fifty Private and Public sector clients to help them migrate to a Hybrid IT Cloud Operating model. These transformation plans have helped both government and industry reduce the cost of IT, re-engineer their IT governance models, and reduce the overall complexity of IT. Bill is also a member of the Open Alliance for Cloud Adoption Team and has co-authored several documents on Cloud Maturity and Hybrid IT implementation.