Shifting Left on Production Excellence with Observability

As DevOps leaders, we want to empower developers to own their code in production. Production ownership is the best way for developers to deepen their skills and deliver business value, but many leaders hesitate because it involves on-call and firefighting. How do we set up our teams for success and not burnout?


Join Shelby Spees and Liz Fong-Jones as they share how Honeycomb established and evolved a culture of production excellence—maintaining reliability and scalability while shipping features at the rate of a company ten times the size. They will detail how the Honeycomb team uses Service Level Objectives to measure how their services are doing and which users are having a bad time. And they will walk through how the team combines observability with progressive delivery to create tight feedback loops and guardrails for experimentation.


Attendees will walk away with strategies for establishing these practices across their organizations with strong buy-in at the practitioner level. Cultural change isn’t just top-down or bottom-up, it needs support at every level.

SS

Shelby Spees

Site Reliability Engineer, Equinix Metal

LF

Liz Fong-Jones

Principal Developer Advocate, Honeycomb.io

Transcript

00:00:12

As you have undoubtedly noticed, one of the prominent themes at this conference is next generation infrastructure and operations. We already heard from the team from Vanguard and Comcast, as well as the team from Google SRE, anyone who has even a remote interest in SRE observability, or the technologies that enable it is likely familiar with the work of Liz Fong Jones. I first got to meet her when she was staff site reliability engineer at Google, and to share some of those experiences at DevOps enterprise in 2018, she is currently a principal developer advocate at honeycomb. And I'm so delighted that she will be presenting with Shelby Speez, who is now a site reliability engineer at Equinex metal. They will be diving deep into what observability is. What's the goals and underlying principles are. And some fantastic ways that anyone who cares about delivering reliable services will care about and how to make modern production environments, something that we actually want to live in Shelby and Liz.

00:01:17

Thank you, Jean. Welcome to shifting left on production excellence with observability. We're here to talk about how observability enables production ownership sustainably at scale so that your engineering or can support your businesses need to move quickly in the market without forcing your software teams into the shitty abusive relationship to their systems. Your business wants to move faster. You need to be able to respond quickly to changes in market as well as ever evolving security and data privacy requirements. That velocity requires tight feedback loops because it doesn't matter how fast you're going. If you're pointed in the wrong direction, like an airplane, you need to keep adjusting your heading and the response to changes in your environment. You need to be able to adapt this feedback loop. We're here to talk about today's production. We need to shift production left, breaking down that wall, empowers developers to learn from production and not only build better software. It also put us in positions them to identify how that software can better support your business's needs. I'm Shelby Speese. I recently joined Equinex metal as a site reliability engineer. And before that, I worked with Liz Fong Jones as a developer advocate at honeycomb.

00:02:24

And I used to work with Shelby when we were both developer advocates and we've utilized or many years of expertise as site reliability engineers, to figure out what are the key lessons that you need to learn. When you're thinking about this journey towards shifting your production environment left

00:02:43

As an industry, we've already come a long way. Y'all deserve credit for that. You've probably adopted dev ops strategies like continuous integration and infrastructure as code and the parts of your org, where it's feasible and where it makes sense. That helps a lot with the work that used to be manual and error. We're catching things like build issues long before they have a chance to hit production. We've already gained so much, but there's still more we can do. So what's the next step. Shifting left. One question that frequently comes up is whether developers should be on call. It's true that if we want our developers to benefit from the feedback loop of production, putting teams on call for the software that you write, it's very effective, but it's critical that we're not setting our people up for burnout. The thing is production has become so complex.

00:03:30

This is especially true in large enterprise organizations where you might have some parts of your org moving to the cloud, adopting progressive delivery and implementing all kinds of modern deployment practices. While other teams are holding it down on legacy systems, in managing integrations with newer components, there's a huge range of technologies and skill sets. All of this makes production really intimidating on top of that. Traditional monitoring tools are inscrutable they're, they're speaking in the language of individual hosts, but they can't tell you which part of the application, things are going wrong. This is true. Even for seasons seasoned ops engineers, who often have to make decisions based on correlations between a blip in the graph and whatever they can find it logs around that timestamp, that expertise is valuable, but it only takes us so far because prod is always encountering new issues. We've done such a good job solving for so many known failure modes.

00:04:22

That what remains are these novel emergent failure modes, where there's rarely a singular root cause it's not just the latest change, but it's often that latest change interacting with some change from two months ago or two years ago. And maybe it only appears on certain kinds of traffic or another example. There's some external dependency that's changed. And now the ground has shifted beneath us. We have to scramble to update our stuff. In response stripes, 2018 report found that developers are spending 42% of their time on bad code in tech that our teams are always fighting fires. We're constantly in this reactive state. And so we can't make forward progress. We can't invest in improving systemic issues when, when something needs to be upgraded or migrated to a new technology, it takes forever, which really hurts trust between software teams and business stakeholders. Meanwhile, our heroes are exhausted.

00:05:15

There's a few go-to people. We rely on to keep things running, but they're too busy holding things together. They don't have time for knowledge transfer. So we maintain this low bus factor. It makes our sociotechnical systems fragile and people get so burnt out that they end up leaving their jobs, even jobs. They loved. It's a huge waste all around. So while we've made amazing progress in the last decade or so of the DevOps movement, we shouldn't stop here. What we need is production excellence. All teams need production excellence. Those of us here at DevOps enterprise summit, we've benefited from all these improvements so that as we continue to level up our organizations, we have a responsibility to pay it forward so that every software team can have a healthy relationship to their systems and production production excellence. Isn't just about what technologies you're using. Buying the alphabet soup doesn't guarantee better outcomes because you can't buy DevOps.

00:06:08

Rather once you have better feedback loops, then you can start making data-driven decisions about, for example, whether adopting Kubernetes can actually help your teams improve featured philosophy and maintain reliability. Is it worth the complexity, production excellence? Isn't about how many nines you have. It's about investing in people, culture and process because the people building and operating your software systems, they're the lifeblood of your organization. It's associate technical system where the tools are there to enable and empower your people to apply their expertise. This is why production excellence is business excellence. You want to invest engineering effort where it'll have the greatest impact for the business, not just this next quarter, but for the long haul. This is your north star, your guiding principle. So what's the first step. Start with observability let's level set a little bit.

00:07:00

The ability it's the ability to inspect and understand the system's internal state using the telemetry data it's already generating. Even if the internal state is something you didn't anticipate, for example, this isn't about going in and flipping your log level to debug and production. The backlogs. Now don't help you for the incident that started 30 minutes ago, and then inexplicably resolved itself. Also, observability is about going in and adding new timers around the blocks of code. You think might be introducing latency and then deploying that change. You don't want a deploy between asking your question and getting an answer. It's not that debug logs or adding timers aren't important or valuable it's they just don't give you observability into what's happening right now. Observability is for these hard to debug problems, these novel emergent failure modes that you can't possibly predict in advance. You can't know what dimensions, what attributes are going to be important someday, that sort of data is prohibitively expensive to store as tags on time series metrics and prohibitively, difficult to parse inquiry with traditional log aggregation, observability data gives us the answers for distributed systems.

00:08:07

We have traces. Identifying bottlenecks is a lot easier when every single event tracks his duration observability means making it cheap to capture lots of dimensions that right time, and then we can slice and dice it and filter it down at query time. And while we're capturing a much richer picture, all of it is still read only here's an example event. It includes a sort of data that we often see in flat logs, but parsing those logs is finicky and expensive. Labeling our attributes makes querying much easier on top of that, we can link events together into a trace. Each event has an ID and it can point to another event ID to say, Hey, that other event called me that's. My parent is a directed graph. Now at runtime, you can capture data from across all parts of the stack from the build ID to the user agent, to which payment processor was using that particular checkout transaction. All of this data is worth capturing because technical decisions are business decisions. Let me turn it over to Liz to talk about making the business case.

00:09:07

Thank you, Shelby. So you can capture all of this data, but does it really matter unless it's having an impact on the business? Let's talk about how you translate capturing observability data into actionable business insights. So how do we figure out what the impact of observability is? Well, we need to think about where our business needs us to invest our effort. How do we drive our business forward? So often when we're making decisions as engineers, we're often trying to do things like increased scalability or pay down technical debt or introduce new features. So we need to have a mechanism for deciding when we should speed up and when we should slow down and focus on the fundamentals, service level objectives, which are I concept from the discipline of site, reliability engineering can really help us get on the same page about what level of reliability we're targeting and whether we're achieving the results that we want.

00:10:10

They're a way for us to describe in a common language between business engineering and customer stakeholders, what success means for our customers and help us measure it with these telemetry data coming out of our systems for the entire life cycle of the customer journey. So there are a couple of books that I would recommend specifically the site reliability engineering series by Riley, as well as the service level objective book by Alex doggo, but let's level set briefly about what a service level indicator is. A service level indicator is a mechanism of a capsule rating. All of our critical user journeys, things that have customer impact, things like homepage loads, API calls, or user queries. And the good news is that if you've invested in observability as a foundation, you already have that rich data about customer workflows being captured inside of your application and admitted as telemetry data.

00:11:13

So our service level indicator transforms the flow of events coming into our system into a category categorization of those events from good events to bad events. We're able to set thresholds to say things like the homepage is expected to load within 500 milliseconds. And then for each homepage loaded visit that's executed against our service. We can determine whether it met that threshold, whether it was successful and fast enough. And then we can broaden the view from a service level indicator to a service level, objective and zoom out and say, I want 99.9% of the events over the past 60 days to succeed for instance. And then we can compare that to work target and see how are we doing versus our actual data and our app and our target. And that's the invariant that we're trying to maintain about our systems, because it doesn't matter how many features we ship.

00:12:11

If our customers can not trust in the reliability of our service. So we might want to have views such as the historical S low compliance. How are we doing on a rolling 30 day basis a week ago? How was the performance of day 37 to day seven? This enabled us to understand what's the longterm trajectory of our service and give us guidelines as to where we should invest her time. Now, some of you may ask why not a hundred percent, the answer is that we need to keep our users just happy enough. It is true that you can invest in their infinite nines, but if you do so, you're trading away your ability to do any kind of feature development. And you're trying to deliver so much reliability that external factors will be the primary driver of your customer's experience, levels, reliability, not the investment that you've made into your service.

00:13:07

For instance, many of us access services via our cell phones and your cell phone is only about 99.9% reliable. So why building service that is 99.99999% reliable if customers will never experience it. And all that effort have been for not most services are not necessarily life or death. So if your service is not life or death, it is okay for users occasionally to have to press the reload button. That's the price of progress and of having your service be sustainable and affordable. And we can use the difference between a hundred percent and your service level objective as a guideline to the amount of aloud and availability that we can have. This is the idea of an error budget. So for instance, that 99.9% service level objectives, that means that we are allowed to have one in 1000 requests fail. So if we're serving a million requests per month, a thousand of them can fail either being too slow or too for being an error.

00:14:13

And that's okay. We just need to make sure we're managing the rate of burn to make sure that we're not spending it all on one big outage or worse yet blowing it entirely. So this is the idea of the error budget burndown. And this helps us really think about when is it okay to take risks. If we've barely touched your error budget for the month, then we can release the chaos monkeys. On the other hand, if we overspent our error budget, that's assigned to us that we need to slow down. There's no point in hanging on to extra error budget. It's like keeping extra cash under your mattress that you could be investing in your innovation and in your business. It's a missed opportunity, but let's suppose that you've had a really, really rough couple of weeks in that case, instead of launching brand new services, we need to invest in reliability and pay down our technical debt.

00:15:06

And then we'll see better SLO performance over the following month after that. So in order to proactively manage our service level objectives and our error budget, we need to think about the idea of burndown alerts. We need to think about predicting based off of the most recent few hours of data to figure out am I going to run out of air budget in the next four hours? And if so, I need to wake someone up because otherwise without intervention, I'm going to result in an happy customers on the flip side, taking this mentality means that we no longer have to treat every single little problem as a life ending emergency. If I'm going to run out of error budget, instead of me two hours, this was a problem that can wait until the next, uh, until the next daylight hour. Certainly if not the next working day by switching to alerting on actual symptoms of user pain from our error budget burndown, instead of alerting every time a CPU flops of, of 90%, this really helps us have much more tolerable lights as developers.

00:16:13

And that enables us to feel comfortable taking on the pager rather than living in fear of it. But let's talk for a second about how we actually take this idea of surface level objectives and how we take this idea of observability and how we manage to use that investment in that era budget to get faster feedback cycles so that we can get higher velocity and higher reliability at the same time. So this is where we get to the idea of observability driven development and something that we weave into our development flow as a whole, rather than just bolting it on at the end. And this is why we talk about the idea of shift left observability. So at honeycomb, where I work, we practice continuous delivery and we've got an exceptionally good at it. We deliver code to production a dozen times or more per day, and there are many steps in this, but I want to focus on just three specific elements that I think are most key to our success.

00:17:19

First of all, we instrument our code. As we write that code for each change we ask, how is this going to behave in production? How will I know that it's working just like I would not commit code without unit tests. I also do not commit code without instrumentation that helps me understand its production performance and behavior. And then once I have that code written, and once it's appropriately instrumented, I am sure really fast feedback loops. We use circle C I to build on every commit on every branch. And then we also think about being able to deploy to production on demand, because if we keep the build any green state, if we achieve it so that every build that is built from the main branch is in a releasable state that keeps us nimble. And that keeps us on the ball rather than batching up, commits a hundred at a time to release into production.

00:18:17

But the most important thing is actually closing that feedback loop. If I wrote the telemetry as I went along then, it's important for me to look at that telemetry when my code reaches production, because it's a lot easier to find a critical problem, right after it's reached production, maybe an hour after I wrote the code, that state is fresh in my head, and that helps me debug it. Whereas if it waited hours or days later until I actually released it, and if I weren't looking at it, as it went out, I might have no idea what was going on or what I was thinking when I wrote the code. So by observing the behavior of our changes in production, we're able to verify and validate that they work the way that we expect, according to what we engineered into the code, and according to what our users are actually doing with it.

00:19:07

So we use telemetry data about the production data that our customers are sending to us in order to verify, not just correctness, but also things like usage patterns are people making use of the new feature that we added, because if they didn't, then all of our effort was for not. We have to really think about, are we designing features that delight users and are there things that we can do to decrease the risk in case there is a problem? So this means that we often do things like AB testing, things that are a little bit more experimental, so that instead of delivering large units of work more slowly, we deliver fast units of work, knowing that we can always roll back if there's a problem with their experiment. So high-income minutes should do this with 50 developers, but how do you make this practice? A production excellence scale for an enterprise company with hundreds or thousands bill uppers? The answer is to invest in the right places, in rolling out the culture, your systems and teams are unique. So you're going to want to pick the most fertile place to invest rather than trying to do it by Fiat all over the board. So that's why you need to think about finding the right teams to start with. And the first thing you're going to want to do with one of those teams is to use automatic instrumentation to generate the data that those teams are going to need.

00:20:38

The second ingredient that we feel is really critical is decreasing that feedback time to make sure that when you are developing code that you're getting it tested as quickly as possible that you're rolling it to production as quickly as possible. And therefore we think it's important to instrument your builds so that you know, what is stopping you from rolling out software every 10 minutes, every 15 minutes, rather than waiting hours or days for a bill to be complete and graduated to production. Finally, we think that it's really important to have executive vocal support, to have both a champion and a sponsor. You want to have someone with a strong sense of ownership, someone who really lives and breathes the dev ops mindset of kind of shifting that knowledge left of sharing everything that they learned and leveling up the people around them. You want to help those people develop their sense of observability skills and really focus on upleveling the rest of the organization around them. And you're going to need an executive sponsor who is willing to prioritize the investments in making those development teams able to move faster with the power of observability.

00:21:53

So some don'ts, don't just, you know, treat it as one and done. You want to think about instead of iterating over time. So for instance, maybe you don't pick the right service level objective to start with that's okay. Right. You can always go back and change that target to something that's more realistic. Similarly, it's okay to start off capturing a subset of the data that you need. As long as you feel like you're empowered to move forward and add additional instrumentation and telemetry. Next, you're going to want to make sure that you're able to scale things up to your organization, that what works on one pilot team, you need to be able to roll that out across dozens of teams. How do you do that? How long you need to prioritize the developer experience. For instance, if you have central libraries that every team uses adding open telemetry, those libraries will make it automatically instrumented across your entire team.

00:22:51

You need to make sure that people are onboarded and given help. As you start giving teams responsibility for their own pagers. You want to make sure that you have configures code set ups that keep are able to get the right set of dashboards and get the right set of graphs, and also have the ability to dig in and evolve beyond the canned graphs. So in order to substantiate this idea, that organizations that are much larger than 50 people can do this. I wanted to talk specifically about Vanguard and you'll be hearing more from Christina you're comin from Vanguard later in this conference, observability at Vanguard, really succeeded because of these four Cree key elements. They had an observability champion in the form of rich Anacor and the form of Christina. They prioritize knowledge transfer and sharing among their teams. They adopted open telemetry early on, and they adopted and went all in on service level objectives to help free up that time, their team. And they turned off a lot of old sale alerts that were too noisy and replaced them with SLS. And by replacing their old noisy alerts with disallows that enabled their teams to have confidence in the changes that they are shipping and to spend less time firefighting.

00:24:09

So it's one thing that Vanguard was able to do this successfully. And, you know, that's a no small part because of Christina and rich, but how do we deal with the fact that there are tens of millions of software development positions that are going to be open over the next decade at large enterprises? How do we solve this? Well, we, we can clone Christina, right?

00:24:31

I can't clone Christina and we can't clone the original contingent of SRAs and ops engineers. Who've been doing all this work. So let's talk about the next generation of developers. More and more software developers are doing remote CS programs, like what I did or they're graduating for boot camps or they're self-taught. So they don't have the chance to learn operational skills. The way the last generation of SRAs got to learn, we have to grow them. And so we have this opportunity to make it a much better experience for this new generation, with production excellence. They don't have to burn out, they don't have to live in on-call hell.

00:25:09

So that's what we have prepared for you today is how you can adopt the production excellence and observability driven development mindsets in order to shift production left in your organizations. And now I'd like to invite Shelby to share with you how you can help and how you can find us.

00:25:28

Thanks, Liz, we're really, really lucky to have a super active observability community. You can find us in the CNCF slack and the tag observability channel and open telemetry contributors are very active in the CNCF open telemetry channel. There's also a new community growing around open SLO, um, checkout open slo.com where you can find the link to join the slack. Finally, we'd love to hear from you on Twitter, about your stories of developing production excellence. Thank you so much.

00:25:58

Thank you for joining us. Have a great conference and we'll see you on the flack.