DevOps SRE or ITIL – Know Before You Leap!

In an era of Continuous Integration, Continuous Delivery and Automation, implementation of a solid IT Service Management strategy is important for organizations to succeed at Digital Transformation. There are several IT Service Management Frameworks available today and the possibilities and processes stemming from each framework is often overwhelming. While all the Service Management methodologies are closely connected, we will discuss about the DevOps, SRE and the latest ITIL4.0 service management framework in how they compare with each other. What are the vision and values governing the frameworks and the guidance each provides when embarking onto this journey.


DevOps is an umbrella concept that advocates a collaborative working relationship between Development and Operations. It aims to achieve an adequate velocity of software and services for the line of business (i.e. high deploy rates) while simultaneously increasing the reliability, stability, resilience and security of the production environment.


SRE or Site Reliability Engineering is Google’s approach to service management and emphasizes the development of systems and software that increases the reliability and performance of applications and services.


ITIL4 is the latest evolution of the well-known service management framework from Axelos. With the introduction of the new service value system to the core guiding principles of ITIL, it emphasizes service quality and consistency and aims for improved stakeholder satisfaction through ensuring value from the perspective of the stakeholders.


We will discuss on How can an organization decide which service management methodology to adopt to best enable them to deliver business value and to ensure a successful transformation powered with operational excellence.


All three methodologies can coexist together, however, adoption of DevOps or SRE or ITIL is as much a cultural and behavioral transformation for the organization and its people as it is about technological and process related changes. Organizations need to continuously adapt and adopt, upskill and upscale to keep up the pace in the continuously evolving digital world.

MM

Meenal Meenaakshi

Product Landscape Owner, SAP

Transcript

00:00:13

Hello, and a warm welcome to all of you at DevOps enterprise summit. Us, let me give a short introduction about myself. So I'm product landscape owner at SAP labs in India. I have close to two decades of experience and service and product delivery management, where I have led several digital transformation projects within my organization. I love to follow the digital transformation journeys across industries, across organization, and to speak a lot at all these kinds of similar forums about the information and experiences that I gain out of my journey. And one of the topics that has often come out of most paramount importance is the implementation of a solid IP service management strategy within organizations for a successful digital transformation. So today I would like to take this opportunity to provide a short overview about the evolution of it, service management framework and itself, and take a little bit more deeper understanding about the three most well-known, uh, it service management methodologies widely adopted across organizations, which is DevOps SRE and the item.

00:01:37

What are the vision value and the guiding principles they offer? How do they compare with each other? What are the commonalities and where are the key differences? And finally, as an organization, how can one decide which methodology would in alignment with its strategic goals and objectives best fit for their requirement? And why is there a need to continuously adapt and adopt? Now, if you look at traditional, it, it was seen more as a service and support organization, delivering technology solutions, but in the current era of industrial revolution, it is no more delivering only technology solutions to business, but it is the business in itself. It and business is fast converging. The digital services that are being offered today are all customer oriented and value driven, and they need to be managed in a way that they are not only supporting business in achieving and fulfilling their requirements, but it's also supporting in their growth because your product will have value for customers and for business only when it is able to fulfill the expected services and is able to provide the needed outcome, which the business expects.

00:03:00

So how do we know whether or not our services provide or deliver that business value to the customer? What is the business value business value is that differentiated experience, which the customers would get on consuming our product and services. And as an organization, it is extremely important to continuously and constantly keep a check on better or not. We have been delivering value to the customer value could be in terms of cost value. It could be in terms of experience value. It could be in terms of platform value, where you are offering some product or service, which is at a relatively much lesser cost for the business for the consumers, or you are providing such an experience to the consumers that they are ready to be additional costs just for that while experience that they get, or it would be just such a vast and robust platform that we are offering that a customer would not want to go to a switch to any other platform or service, but how do we know this?

00:04:06

How do we know whether this is adding our product and service is adding value to the customer? How do we know what would be the future demands and needs for the customer? This requires data. This requires analysis because only when we have data, can we take informed decisions? And this is where we need support of all the digital tools and technologies, AI ML, several reporting, stash boards, metrics that is available and armed with this information. Once we have the status, then we need to really think and decide, are we doing the right thing? We need to be hyper aware. We need to be able to take formed decisions so that we can execute fast on it. We need to be really demonstrating that digital ability because we do not know the current need of the customer may no more be valid tomorrow. And we have to continuously adapt and adopt according to the business requirements, because if we will not, then our competitors would do that.

00:05:10

And that would mean then the end of business for us. And when we know that, yes, we are on the right path, then we really need to understand, are we doing the things right? Will we be able to really deliver what we have gathered based on information elected and have taken an informed decision, do a fast execution and deliver that to the customers, to the business. This is where we have the several guiding principles, the processes and technologies of it, service management and the different frameworks available to guide us and support us in faster, delivering our services in, in velocity and with quality and beat any it service management framework, uh, that we use today, lean DevOps, SRE, uh, GI ITIL. Finally, it all boils down to delivering value to customers. So let us now with this background, uh, take one step deeper to understand how DevOps and SRE frameworks, uh, compared to each other, and what do they have to offer now, dev ops is an umbrella concept that advocates the culmination of different teams working together as one system.

00:06:29

It involves people from across the organization from different teams, be it development, design, UI security, documentation, uh, testing operations, all the teams that contribute towards the value chain in the value stream, all the teams to come together, to work as one common theme. Yeah, this involves, then this involves then right from planning to building to continuous integration, deployment, operation, continuous feedback, which flows again, back into your development pipeline and dev ops offers three ways or suggests three ways how we can achieve this, which is also very well known as the DevOps three ways or the three ways of DevOps. The first way talks about thinking it as a system where work should flow as fast as possible from left to right from development, through testing, QA regressions, uh, security, um, uh, and, and different nodes across the development life cycle to, uh, to operations and finally to reach customer because that is where value is getting created.

00:07:45

And value is created only are seen only for the finished product work that is in process adds no value. Hence it is extremely important that we keep smaller pieces of work, smaller development pipeline that is flowing through this ensures a faster flow. Keep the development life cycle Putin was moving with smaller chunks of work, smaller chunks of development that has been developed and shipped to the customer. This is what helps in increasing the flow from left to right. The value is getting created at the end of, at the customer's end. The second way. It talks about amplifying the feedback loop because when you have increased the flow of work, you also have to ensure that you are getting fast and quick feedback from flowing from right to left, which is not only coming from customers who are at the extreme, right, but from each and every note of a development life cycle.

00:08:42

So this means basically not only an increase in the number of feedback loops across your development pipeline, but also increase in the frequency at which you are receiving feedback and working and continuously improving upon it. And the third way talks about creating a culture of continuous experimentation and learning fail often and fail early because failure should be considered as an opportunity to improve and to innovate as if we inculcate this culture of continuous experimentation and learning. We not only enable the risk-taking ability, but we also give built in a lot of confidence within our teams. And then our teams becomes, the system becomes a melting point of new ideas and innovations, which helps in, uh, further increasing the velocity and quality of work and services that has been delivered. So DevOps can be considered as a culture where people from different disciplines work together to design, develop, deploy, and run our system.

00:09:48

So within an organization, if we are trying to implement a DevOps setup, uh, as part of our digital transformation journey, then what are the guiding principles that can be adopted or that should be considered, or is being offered as part of DevOps implementation? Collaboration is key because here we are talking about bringing all different teams, which were earlier used to work in silos are responsible only for their own, uh, area of work. All the teams are coming together. So it is extremely important to build and bring in that culture of collaboration where each and every team works together in sync each and every team has to take over end to end responsibility and accountability of the entire work that has been delivered to the customer each and every team should focus on automating everything and anything that is possible by treating everything as a code.

00:10:50

This not only hits and bringing and improving the flow, but also helps in a faster CIC D it helps in faster resolution of issues when all the teams are coming together and working together for a common purpose and also helps in ensuring a stable environment. It adds a lot of technical value with the organization and also provides a lot of business value for, um, for the customers, because this helps in increasing the quality of work. It helps in increasing the velocity of work. You are able to not only work and provide fixes for the issues that have been detected or raised by customers, but more on new features. And that's reducing the overall, uh, TCU knowledge upskilling invest in your people, invest in sharing the knowledge. This is extremely important because DevOps talks about different teams coming together and each and every team is responsible and accountable for the entire work entire system.

00:11:52

So is extremely important that we do not only raise specialists who are specializing only in one particular domain, but we really need people well-rounded with cross-functional skills, fail to learn. As we already discussed, fail early fail often because failure should be considered only as an opportunity to improve and further enhance and optimize our processes and system and continuous improvement based on the feedback that we are receiving based on the feedback pipeline that we are setting up at each and every node with increased frequency and, um, ah, and increase the number of feedback loops. We are also improving continuous improvement, bringing in continuous improvement in our service delivery. And this brings in a lot of cultural value within your organization, that this people first approach brings a lot of innovation, motivation, and efficiency. Uh, that's leading towards a successful digital transformation now, and that is look at how Google talks about addressing these customer values site.

00:12:59

Reliability engineering is Google stake to delivering value to customers. It talks about building end to end reliability at site site reliability engineering is more a postproduction set of processes and activities for systems at scale. And it operates on the principles of prevent recover and optimize, do whatever you can to prevent an issue from reaching production. This could begin right from providing input to architecture, to influencing and ensuring resilient development and resident testing, intelligent alerting mechanisms, um, proper health check monitoring, so that we are already able to identify and fix issues even before it is reaching production. But once an edit occurs, then ensure with intelligent alerting self healing mechanisms that you are able to have such a robust system that you are minimizing the meantime to restore and meantime to recover, and you can recover as fast as possible. And once you have recovered, then focus on the optimization, but then do, um, post-mortem of the issue that happened.

00:14:15

Do the root cause analysis, find the issue and ensure, share the learnings and ensure that it is completely removed from the system so that it does not happen again by providing again, the input, uh, to development and architecture optimization involves one of the biggest, um, contribution in eliminating toil, toil is anything and everything that involves content was in repetitive set of actions and activities, which of, or manual set of activation and activities. This is the engineering part of site reliability engineering, where we are continuously trying to engineer and re-engineer your system to make it more and more robust and more optimized, more and more stable. So side relax. And this can happen only when you have you take a software development, kind of an approach or software engineering approach of which you are then applying to infrastructure and operations and site reliability engineering is considered as a discipline that incorporates, um, concepts of software engineering and applying them to infrastructure and operations problem.

00:15:31

So what are the guiding principles and values which site reliability engineering has to offer like in DevOps? Collaboration is key. Also here site reliability engineers have to collaborate with other engineers, with product owners, with customers, with other stakeholders to come up with an aligned, um, service level objective for the service that has been delivered. SRE encourages you to have objectives and defined objectives and aligned objectives for each level of service that you are offering the festival, because this is what pushes you. This is what sets a benchmark and pushes you to really speed or focus or bring in the flow to reach that benchmark. And once you have a defined SLO and place, have a plan, how much of change can be delivered and at what frequency and just stop when you have reached the error budget, automation is key. This is anyhow, the key for, um, when you plan to, uh, provide, uh, deliver continuous development to customers.

00:16:42

But automation is also key in ensuring that you are maintaining the resiliency and robustness of the it environment and system that you are supporting balance self regulated, um, control over development of new features versus the stability of the system, decide where to stop, decide what to deliver to the customer. What will have an impact on the overall stability of the customer of the system. And finally fail to learn because a failure that happens in the production system is not really caused by an individual or a team, but it's a failure of the entire system that each and every team is responsible for it. And aside, he really encourages on post, um, blameless post-mortem that it's not really someone's responsibility, but the entire system is at fault, which needs to be corrected. So when you have these kinds of guiding principles, you follow these kinds of guiding principles and adopted.

00:17:47

When you are setting into this journey of establishing site, reliability engineering within your organization. Because as I mentioned, it's a separate team that that needs to be set up. This is then what brings a lot of technical values, business values, and cultural value within your organization by ensuring robust system resilient, which is resilient and reliable. You are operating auto operations, automating operations, which is for the improving the reliability. And it mainly helps them in reducing the customer rate rates because you're able to improve on the service and that you're providing the sort of using or the overall eco and bringing in a lot of cultural values within the organization.

00:18:33

Now, let us look at how it addresses these topics about delivering value to customers. ITIL by far is one of the most widely known and accepted and adopted it service management framework. And I think four dot O is the latest evolution of Accellos, which talks about co-creation of values. And it has always kept pace with the ever-changing demands and in the industry, starting from a more process centric approach with to the establishment of a service lifecycle around set, set up around these processes in item lead auto to set up off or to the formation of the service value system in it, um, which embeds the service life cycle within its value chain. So in ITIL four dot or, uh, it ensures or encourages the, uh, effective collaboration with the business, with the customers at each and every phase of your development. So that we try to always ensure that a demand that we have seen, which will add value to the customer is also perceived as valuable for the customer from the customer's perspective. So we start together by, uh, the demand that needs to be with the change that needs to be shipped to the customer and design build, deliver support, uh, together with the customer, till it reaches the, uh, the point where it is really adding value to the customers and gain continuous feedback from the customers, uh, so that you are then able to work on your next input and next to demand.

00:20:25

So I do food out can be considered as a digital operating model, which believes in co-creation or values for its it supported products and services that you are co-creating value together with, with the customer with business, I did for Datto also provides a very robust set of guiding principles values, which should be taken into consideration when embarking onto this journey start where you are, look within your organization, which, um, which is the area which really needs improvement. And just to start from, from there collaboration like for any other it service management framework is really the key because you need to help. You have to ensure that you have everyone within your organization together with you in this journey. And everyone aligns to, yes, this is a change and this is what needs to be delivered. This is ensured by an effective, uh, communication and collaboration mechanism that needs to be, uh, put in place within the organization.

00:21:32

Keep it as simple and practical, always break down the requirement into small deliverable changes, progress iteratively. Once you have worked on that small change, um, and it has been delivered to the customer gain feedback. And accordingly start again with the next sheet, you need to really work with a holistic view so that your change is not disrupting, um, the stability of the environment or disrupting something that was working really well working, uh, fine before. So you really have to have a complete holistic view of the entire system. And finally, you always have to automate and continuously keep on optimizing your system to ensure that you are delivering, uh, value, which is also perceived valuable to the customer at the end. And this then helps in improving when you have the customer and focus and the focus being on delivering value to them, right from beginning to it has been delivered till your service and product has been delivered to the customer is when you are able to improve customer satisfaction, improve on the quality of service that you are providing. And this helps in growing the, in delivering business value to customer. And it not only helps in delivering business value, but also helps in improving the technical value and cultural value within your organization.

00:23:00

Now, if you look at the three ID service management frameworks, which we just discussed, we see that they are more or less all aligned. Each of them, I finally focused on delivering value, which is the final goal for any it service management framework that it has to deliver value to the customer, but each talks about how we can increase the flow, which means we are able to deliver faster, but we are able to increase the speed of execution and deliver faster to, to the business, to our customers. Each talks about improving continuous improvement by having a really solid feedback by pain and a feedback mechanism in place because here it really encourages giving proactive giving and receiving the feedbacks. This is what helps in improving the quality of the software and service being developed. Each talks about the focus ever increasing focus on automation because this not only helps in increasing the speed, but also the quality and above all. Finally, what we see as most important is this people first approach encouraging the concept of experimentation and learning, giving an opportunity to the individuals, to the teams, to take informed decisions, to learn from failures, to spread knowledge, to upskill themselves, contained with sleep and, uh, uh, and courage and in improving then the elevation efficiency and overall productivity within the organization.

00:24:39

Now, having said this, let us take a close look also at where are the key differences, where do these three IP service management frameworks, DevOps, idle, and SRE cross roads, where, where do they, how would you as an organization then decide which methodology to adopt? So it is equally important for you to really understand where are the key differences. Now, if you look at the overall architecture, as we discussed about, um, how DevOps functions and Sid or ITIL functions at how, how, um, what are the guiding principles that it offers you see that the final goal for each of them is slightly different, where DevOps really focuses on the speed and quality of delivery site. Reliability engineering focuses more on scaling up time, robustness of the system, and ITIL focuses more on, uh, delivering service with quality and consistency. This means also that the way change is managed in each of these methodologies is then different dev ops focuses or follows more on delivering gradual changes via continuous integration and continuous delivery.

00:25:58

While site reliability engineering focuses on delivering quick changes by error budget, as long as you are within the error budget, keep on delivering your genius while ITIL change management is or change delivery of changes via well-defined governance model letters. And please, this also means that the error handling process is quite different than for DevOps society. And for dev ops error handling is at a pre failure state where we are trying to remove the error, even before it reaches estimate. While for site reliability engineering, it is a post failure set of activities. You do an RCA after the failure has occurred so that it never occurs again while for items, uh, error handling is part of the problem management fees in the development life cycle. Also, if you look at the operating model of DevOps, SRE and items, the way they operate is as well different just coming to a team topology, which DevOps, sorry, and ITIL recommends as we discussed, the DevOps talks about bringing in different disciplinary teams together, which were earlier working asylum.

00:27:13

So it talks about breaking the silos and bringing all the teams together while for SRE, it was a defined team. It is a separate team with defined roles of site, reliability engineers, who, who are basically more software engineers working or applying the concepts on infrastructure and operations problem. And then we have ITIL which really focuses on establishing a symbiotic relationship between it and the stakeholders and the business, and does not really require setting up of any new or separate team. If you look at the entire value chain, DevOps starts right from development. Yeah. We start from development, moving into production, trying to remove, identify issues and errors that occurs during the light cycle, before it even reaches production. While for site reliability engineering, it starts with production from production. Once failure has occurred and then trying to, uh, bring the corrections back and the improvements back into the development life cycle, into the value chain, by for ITIL, it is wrapped around the service value chain and also the way you measure the success of DevOps or SRE or ITIL, uh, the metrics is then quite different for DevOps.

00:28:28

It depends more on the apply frequency because here we talk more about continuous integration and continuous delivery. So how frequently you have been able to successfully deploy. So deploy frequency, time change, failure percentage, while per SRE does more about how much you have been able to meet your SLO or SLIs SLE, what has been your meantime to recovery and so on? And for ITIL, it mostly depends upon the SLA that is in place, the change success rate, the ticket volume, the cost overall cost involved in running this office. Now, if you look at this as an organization, um, even after gaining this understanding of how ITIL SRE or DevOps really operates, how would you still decide which way is then the right way, or do you think there is really a right way or a wrong way for an organization?

00:29:30

It does. It is not really an either or kind of a situation. It entirely depends upon your organization's specific needs and requirements. What is the target that your organization wants to reach? Because digital transformation is not only about, uh, implementation of new technologies and tools. It has to completely align with your organizations objective. So we talk about the five Watts. We talk about the power of purpose. When you have the purpose clear in front of you, when you know, what is the target that you want to achieve, that is when you need to, uh, that is what will help you to decide which methodology or methodologies can be adopted. It could be one, it could be a combination of each of them within your different organize within your organization, because you really need to identify what is the problem that you are trying to address.

00:30:32

And when you have that, then, then you need to find out which is say the right hammer for your knee. You need to know what is your nail then to find. Then you can find the right time for it. So always question these five words before deciding on or adopting onto a metal methodology or an it service management framework adoption, what is the problem that you're trying to solve? There can be different solutions to different problems. And is your entire organization in sync, in alignment with you that, yes, this is the problem and this needs to be solved. And if there are several problems, um, what is the priority at which it should be solved? What is the scope of the problem? You really need to have a birds eye view of a helicopter view of your organization to understand the scope and scale of transformation, which your organization will have to go to because digital transformation is not only about, uh, implementation of new tools and technology, but it is as much as about behavioral and cultural changes that needs to be implemented within the organization.

00:31:45

What is the solution and why? Yeah, there can be different solutions to different problems, but it is equally important to understand what is really running good in your organization. For sure what is running into issues needs improvement needs to be addressed, but it is equally important to understand that you do not want to disturb what is already running quite good within your organization. And this then also helps this information then also helps you in deciding and determining which way is then the right, what should be your starting point. Look within those portions within your it organization that really needs improvement and can bring in massive impact on adoption of these new, um, transformations. And just start from there, start small and always have defined and aligned KPIs and success factors to really define the success of, of the implementation of your digital in your digital transformation journey, have the right KPIs and for measuring your success.

00:32:48

And when you have these five Watts defined and decided, then you, you know, that you have, you have the nail, you are ready to, uh, you have found the right hammer for it. And when you have nailed it, then you are ready to steal it because it is extremely important that you are continuously ready to adopt and adapt in this transformational journey. Always remember digital transformation cannot be brought about within a day. It is a continued and continuous journey, and it sits right at the heart of, uh, cultural and behavioral transformation, because it is as much as about implementation of new tools and technology, as much as about the human and behavioral transformation, you need to really have the right mindset in place to bring in that transformation, because always remember it, service management is only a means to an end and not the end itself.

00:33:53

So finally, whichever methodology you plan to adapt to be it, it will be it, sorry, be it DevOps. It will only be successful if you really have the right mindset. If you are really able to nail also the cultural and the human aspect of it, where people really agree to change and adopt and adapt to this, um, uh, situation and your environment, the future of IP service management is definitely pride. Whether you look at it from an it lens or from a business lens and the digital agility, which the it organization has always shown, and it needs to continuously show, uh, is extremely important to always be successful in this digital transformation journey. Having said this, I would like to say, thanks a lot for your patient curing. I hope you have been able to take out some information out of the stock, uh, and it helps you in really deciding on which path to adopt which direction to go to while embarking onto your digital transformation journey. Thanks a lot once again, and thanks a lot to DevOps enterprise summit for this opportunity for me to share my thoughts. Thank you.