Virtual US 2022

Establishing SRE Foundations: Aligning The Organization On Ops Concerns Using SRE Team Topologies

Establishing SRE in a larger software delivery organization requires an SRE organizational structure. The structure needs to support the organization with the goal to appropriately align everyone involved on operational concerns in a sustainable manner. The goal can be achieved using different organizational structures. Deciding on the appropriate organizational structure for SRE is not a straightforward task. It requires weighing different dimensions against each other. A core dimension to consider is who should run the services. This gives rise to 4 core options:


- You build it, you run it

- You build it, you & SRE run it

- You build it, SRE run it

- You build it, ops run it


The options differ in incentives for the development teams to implement reliability. The incentives are maximized with “you build it, you run it” because developers are in full control of their own operations workload, which might require them to wake up in the middle of the night. The incentives diminish as dedicated SREs share or take over the operations responsibilities from the developers.


Another dimension to consider is the organizational function to place the SRE teams and SRE infrastructure teams in. There are 3 core options here:

- Development organization

- Operations organization

- SRE organization


Sensible permutations of all the options above give rise to 9 SRE team topologies presented in the talk:


SRE Team Topology 1:

- Development organization: You build it, you run it with no dedicated SRE role. Every developer is an SRE on rotation

- Operations organization: SRE infrastructure team

- SRE organization: none


SRE Team Topology 2:

- Development organization: You build it, you run it with a dedicated SRE role in the team

- Operations organization: SRE infrastructure team

- SRE organization: none


SRE Team Topology 3:

- Development organization: You build it, you run it with a dedicated SRE role in the team and a dedicated developer on rotation

- Operations organization: SRE infrastructure team

- SRE organization: none


SRE Team Topology 4

- Development organization: You build it, you & SRE run it with a dedicated SRE team

- Operations organization: SRE infrastructure team

- SRE organization: none


SRE Team Topology 5

- Development organization: You build it, you & SRE run it

- Operations organization: Dedicated SRE team and SRE infrastructure team

- SRE organization: none


SRE Team Topology 6

- Development organization: You build it, you & SRE run it

- Operations organization: SRE tool chain procurement and administration

- SRE organization: Dedicated SRE team and SRE infrastructure team


SRE Team Topology 7

- Development organization: You build it, SRE run it with a dedicated SRE team

- Operations organization: Dedicated SRE infrastructure team

- SRE organization: none


SRE Team Topology 8

- Development organization: You build it, SRE run it

- Operations organization: Dedicated SRE team and SRE infrastructure team

- SRE organization: none


SRE Team Topology 9

- Development organization: You build it, SRE run it

- Operations organization: SRE tool chain procurement and administration

- SRE organization: Dedicated SRE team and a dedicated SRE infrastructure team


The team topologies have different reporting lines and produce different cultural identities for SRE. The cultural identities are based on a triangle:

- product-centric identity vs.

- reliability user experience-centric identity vs.

- incident-centric identity.


Depending on the reporting lines, the SREs lean more towards one of the SRE cultural triangle vertices. A comparison of the 9 SRE team topologies above will put listeners into a position to evaluate the options well, helping to drive better SRE organizational decisions in their companies.

DV

Dr. Vladyslav Ukis

Head of R&D, Teamplay Digital Health Platform, Siemens Healthineers

Chapters

Full transcript

The complete talk, organized by section.

Dr. Vladyslav Ukis

Hello, welcome to this DevOps Enterprise Summit. I am really happy that you decided to attend that talk. Thank you very much for this. My name is Vladyslav Ukis and I work for Siemens Healthineers in Germany.

We have got a big digital health platform, which we develop and operate. It is called teamplay. You can Google it. And today I would like to share our experience with operating the platform using the Site Reliability Engineering, or SRE, methodology. Specifically, I would like to delve into the question: how do you organize for SRE?

With that, let me share my screen and then we can get started with the presentation. So sharing the screen in full-screen mode now. I hope you can see it and we can get started.

To give you some background of the context we operate in, let me share some insights about the healthcare industry. As you can see, in the healthcare industry there is a tremendous growth of data. There is nearly 50% year-by-year growth in healthcare data, but lots of that data is actually lost, and additionally only one in three hospitals in the US were able to electronically exchange that data. And all that against the backdrop of severe staff shortages, and therefore more and more work falling on the shoulders of the existing medical personnel.

In order to help that situation, we created a platform that empowers connectedness of the healthcare participants. The platform sits on top of a data layer, where the medical images and other data are generated. Then we have got lots of digital services that can number-crunch that data and with that enable the outcomes in different categories: performance outcomes, diagnostic outcomes, and collaborative outcomes.

And all that leads to an entire suite of applications available on the platform. So currently more than 75. The advantage is that they are all available from within a single user interface for the users. And there are also different deployments: cloud, on-edge, and hybrid, to serve the specific data protection and security needs in hospitals around the world. And by now we managed to connect more than 6,000 institutions in more than 90 countries to the platform, and it is growing on a daily basis. So it is a significant undertaking, and the question now is how to operate such a beast, how to operate such a platform.

There is a way suggested by Google that we employed, and that is called Site Reliability Engineering. This is a discipline that has got several principles, with the ones on top being kind of the most impactful ones. You see everything in operations through the lens of software engineering. You set service level objectives for your services, and you then establish procedures that ensure that if your service level objectives are broken, then you prioritize your reliability work over feature work. And you work also to minimize the manual effort, or toil, that is required around your services.

So now, how to organize teams in order to fulfill those principles? There is a misconception in the industry that this can only be done when you have got a central SRE team. And I hope to convince you by the end of that talk that nothing is further from the truth. You can have several setups and arrangements of teams and still fulfill the SRE principles. So how to do it? What are the options? What is the best option for you? Let us explore that step by step.

First of all, you need to answer a fundamental question: who builds the services and who runs them? And for that there is a spectrum. On the left-hand side we have got you build it, you run it, where developers are fully responsible for developing and also operating the services. And then at the other end of this spectrum, we have got a complete opposite: the developers build the services, then they hand over the services to the operations department, and they run them.

There are also a couple of options in between. You can have SREs or SRE teams also doing operations, supporting operations. Here you can have a shared setup where the developers build the services and then together with SREs they run them, or it can be a setup where the developers build the services and then they hand over the services to the SREs who run them. So it is an entire spectrum, and you can make a choice on that spectrum also, not necessarily for the entire organization, but you can do this also per team or per digital service. In any case, you need to make a decision.

And that decision also then drives the reliability incentives that are there within the development teams to implement reliability as they implement features. You can see the options here on the x-axis, and on the y-axis we have got the incentives for the dev teams to implement reliability. They are the maximum when the developers have got full skin in the game for running the services, and the minimum when the developers just hand over the services for operations to the Ops team. And therefore they do not really have any skin in the game of running the services in production.

Then we have got you build it, SRE runs it, so it is slightly less impactful for the developers. And then, with you build it and SRE runs it, it is again a bit less than with you build it, you run it and with you build it, you and SRE run it. So that needs to be kept in mind when making those decisions. It depends on what kind of behavior you want to generate within the development teams, what kind of attitude you want to generate within development teams to reliability, depending on the choice you make.

So then what you need to do next: you need to also compare the models from the spectrum. It is not just the reliability incentives but also other factors that we need to take into account when making that decision. These are things like, for example, knowledge synchronization between the teams, incident resolution times, service handover for operations, whether you want to establish a distinct SRE organization or not, who owns the SRE infrastructure, and so on and so forth. So there is a lot to compare there.

And once you have done that comparison, you can then analyze possible SRE team topologies that you might want to employ.

So let us look at you build it, you run it. We have got a development organization here and we have got the operations organization here. The operations organization has got an SRE infrastructure team, and here the development of the SRE infrastructure happens, and also the same engineers run the SRE infrastructure. That enables the development teams to do you build it, you run it.

That can be done also in three different ways. The development organization uses the SRE infrastructure provided by the Ops organization. In development team one, there are just developers and they are doing development and SRE on a rotation. Then there is another team, development team two. Here they have got a dedicated role SRE. So they do SRE and the developers do development, but they are part of one team. Therefore you have got team-internal knowledge sharing. Or you could have another option here in development team three, where there is an SRE doing only SRE, a dedicated one, and there is also a developer on rotation doing SRE together with the dedicated SRE. So you have got team-internal knowledge sharing here and you have got all the developers taking part in the operations activities on rotation.

Then there is another option, where you have got you build it, you and SRE run it, and you put SRE within the dev organization to create a dedicated SRE team within the dev organization. And then you lend the people from this very team to development team one and development team two. So you have got also developers on rotation here and developers on rotation here. We have got the cross-team knowledge sharing on operations topics within the teams, and the operations organization is like in the previous option: it provides the SRE infrastructure, it builds the SRE infrastructure, and it runs it.

Another way of doing this is to say, okay, we put the SRE team into the Ops organization, and then they do the same thing. So they support the dev teams. That is another option. Yet another option would be to say, okay fine, we create an SRE organization. We have got the SRE infrastructure team there and the SRE team that lends SREs into the development teams. They then kind of cross the organizational border and support the teams this way.

Then, you build it, SRE runs it. We can again have the SRE team within the development organization. So then we have got a so-called error budget policy, which is a concept from the SRE methodology that states what we will do when the services are outside of their error budget, so they break their SLOs. If you put the SRE team here, then the team as such stays inside the SRE team. You basically do not lend the SREs into the development teams like in the previous options, but they are then an autonomous team, and you need to do cross-team knowledge sharing here between the teams. But the error budget policy governs whether you still get the support of the SREs or not.

So basically, according to the error budget policy, if your services break the SLOs, then the SREs can, it is called, return the pager to the dev teams, and then they will refuse to support you if the services do not fulfill a certain service level objective. Then the operations organization still has the SRE infrastructure team that provides the SRE infrastructure and runs it for all the dev teams and the SRE team to use.

Then, you build it, SRE runs it. Again, the SRE team can be put also into the operations organization, and then the cross-team knowledge sharing will need to happen cross-organizationally. You can see an SRE doing SRE for dev team one, an SRE doing SRE for dev team two, and it is all governed also by the error budget policy again here. If your services do not fulfill the service levels, then the SREs will return the pager and you will have to, as a development team, run your services on your own until your services again have got a certain service level which then enables the SRE team to actually run your services for you.

So you build it, SRE runs it, but the SRE team is in a dedicated SRE organization. So here we have got this SRE organization and this SRE team is in there. And again, there is the cross-team knowledge sharing that is required across the organizations in order to ensure that SREs have got enough knowledge to run the services.

As you have seen these different options, you are probably thinking, okay, so what is actually the advantage of one or another? But the important thing to know is that all those options are viable options. You can choose the option that is most suitable to you from the cultural standpoint and also from other dimensions that I mentioned earlier when we were talking about the comparison of the models from the you build it, you run it spectrum. But here additionally, also by creating a certain organizational structure, you inspire a certain SRE identity. And that is important, and that of course is also shaping the culture. That is what I would like to talk about now.

So with SREs there is a so-called SRE identity triangle. SREs can have either an identity that leans more toward a product, so product-centric identity, where they really identify themselves with the specific product. Or their identity can lean more toward reliability user experience, kind of regardless of the product. They care that the reliability user experience is great. Or the identity can lean more toward incidents. So regardless of the product and regardless of the reliability user experience, they want to have as few incidents as possible. This is kind of a triangle, and it is interesting how you can inspire a certain kind of weight either toward this or toward that by creating a certain organizational structure.

So in a product-centric SRE identity, which can easily happen, so to speak, which can be facilitated if you put the SREs inside the development organization, you may create a culture where the SREs will actually identify themselves more and more with the particular product. They still of course care about the incidents. They still care about the reliability user experience, but they actually identify themselves a lot with the product.

Then what can happen within the Ops organization: if you put SREs in there, then they may easily lean more toward incident-centric identity, because it is all about the incidents inside the Ops org. Or if you create its own SRE org for SREs, then the SRE concepts will be front and center in the minds in the organization. Therefore the identity may lean more toward the reliability user experience. Still, of course, they will care about the products they take care of. They will of course look into the incidents as well. But actually the kind of the weight of the culture will be reliability user experience. That means setting the SLOs well. That means ensuring that the error budgets are not consumed prematurely and things like that.

So once you have made your decision, you need to do a transition from where you are right now to the selected setup, and then you need to set up the org. Here you have got the models from the you build it, you run it spectrum. So you transition from one of those, and then you have got again the models from the you build it, you run it spectrum. That is where you can transition to. As you can see here, these are kind of the possible options that you would run that transition. It is important to not just declare that this is where we are going, this is our kind of target setup, but really deliberately accompany the organization on a transition from A to B, because that transition will be different depending on your source model and depending on your destination model, and there is a lot of facilitation that is required in all the transitions from the source model to the target one.

The decision is two-dimensional. One is you need to decide on who runs the services, and that is actually easy to change, because you can also select this not for the entire organization but for a set of services or even for a service. You can say, okay, today we do you build it, you run it, and tomorrow we do you build it, SRE runs it. You build it, SRE runs it has also a return option built in, which is codified or set up in the error budget policy. So if your services fall below certain service levels, then the SREs will return the pager, and therefore you will then fall back into the you build it, you run it where you might be coming from. That is kind of easy to change. But the organizational setup is difficult to change. This is where you really need to think carefully about setting up an appropriate organization for your teams and taking into account all the various points that I mentioned before.

So what do we do at Siemens Healthineers teamplay digital health platform? We have got subscriptions to digital services that we offer, and we run this at scale. We have got six data centers, 130 countries, we have got more than 6,000 hospitals connected, and we have got a growing customer base and a growing feature deployment frequency. So in terms of continuous delivery, we are improving.

And our setup is such that Ops own the SRE infrastructure used by developers. The developers are on call 8/5, which means eight hours for five days a week, so we have got normal working hours here. We have got also Ops on call 24/5 for a few core use cases, and this is done in such a way that we have got a couple of people in different geographies, and therefore we can cover 24 hours using follow-the-sun methodology. And we have got an incident response process that guides how we mobilize the teams in case of a serious emergency. So overall, we can say that we are on a you build it, you run it setup, because our developers are fully on call for the services.

And our decision to go there was to maximize the incentives for the developers to implement reliability with the features as the features are getting created. For our organization that was the right decision because we came from the history of not operating services. So there was no experience in the organization before we set up the platform how to operate the services. Therefore for us, it was important not to dilute the incentives that the developers have got to run the services on their own.

We also rolled up our experience with setting up SRE in the organization in a recent book called Establishing SRE Foundations. It is a step-by-step guide to how you can introduce SRE in your software delivery organizations. In this presentation I rushed through the pictures. I rushed through the different setups organizationally that you can have. All that is described in depth in the book, and I am really looking forward to you going through the book and providing any feedback you might have.

And with that I would like to finish and say thank you for being in the presentation. If you would like to know more, then let us have a great chat in the Slack that is provided. Also looking forward to connecting with you on LinkedIn, and also looking forward to elevating the industry in terms of operations by applying SRE to organizations, because in the end this is what improves production and this is where the customers get in touch with the services and applications that we provide. Therefore it is really important.

Thank you very much for your attention, and looking forward to great chats in Slack. Thank you. Bye.