Las Vegas 2020

Adobe’s DevOps Journey: Finding — and Measuring — Customer Happiness

Adobe’s Digital Experience Operations has experienced more than a decade of extraordinary growth, both in volume (more than 8000% growth since 2009, with 377 trillion transactions in 2019) and complexity (nearly 3000 services, most of which are interconnected and interdependent).


With that incredible growth has come incredible challenges: how do we continue our mind-boggling (in size and sophistication) trajectory, while keeping individual customers happy? How, in short, do we scale?


Our answer? a shift from “hero firefighter” culture shift to a “build awesome fire suppression systems” culture. In other words, a DevOps transformation.


This is the story of how the Adobe Cloud Engineering organization has (and still is!) embracing a DevOps culture by leading from the roots and from the top, how we’re measuring our progress, where we’ve succeeded (and still are working to succeed), and what we’ve learned along the way.

BP

Brandon Pulsipher

Vice President of Adobe Cloud Engineering, Adobe

Transcript

00:00:12

Hello. It's great to be here. It's I'm coming to you from the Adobe building here in Utah. Uh, my thought I would be with you live in Vegas, and then I thought I'd be with you from my living room. And, and then we realized that this beautiful building down the street that was pretty empty. And so we were able to take advantage and, and film this session from here. But I look forward to being with you in Vegas next year, and I'm excited to excited to be here today. I'm Brandon Pulsifer. I am with Adobe, and I'm excited to share about the Adobe experience cloud journey around dev ops today. So again, happy to happy to share a little bit about our, our DevOps journey and hopefully some things that are helpful to you thought, I'd give you a little context on my background. I studied computer science.

00:00:58

I've been a QA engineer, a software developer. I've been an it admin and network administrator. And so really I've spent the last 25 years watching this hybrid of it and operations and software development come together. And so this has been really exciting for me. And especially over the last few years to see the fusion of these worlds come together is, is just really exciting. And it's been a lot of fun and it's been really amazing to see how that can impact the experience for our employees, our customers, and everyone around us. So excited to share that with you today, I spent the last 10 years with Adobe leading our technical operations, which is our sort of cloud operations and infrastructure organization. And, and we've now transformed ourselves into a cloud engineering organization. So you can see even in the way that we name things and think about things, it changes the way that we act and behave every day.

00:01:52

I wanted to first just take a minute and share a little bit of context with you. So most people are familiar with Adobe and our long history in, in the creative cloud space and the creative space. You've probably well familiar with Photoshop and illustrator and these, these products that have become verbs in our life and, and help create the beautiful content around me. Uh, the document cloud is, is something most people know. And especially through this COVID time, as people are doing even more online and more digital and doing more digital signatures is a really fundamental part of our business. I'm going to share today with you a little about our Adobe experience cloud. This is our, what started about 10, 11 years ago, and Adobe's entry into digital marketing and has really evolved into an exciting space around customer experience management and the ability to personalize and create a unique and personal experience for every consumer in every engagement.

00:02:51

And the awesome thing is this built entirely in the cloud. So I think is, as I share a little bit of, of that journey, just to ground you and frankly, some of our challenges and, and aspects that really led us here. If, if I look back at, uh, first of all, our sort of software engineering and our operations team, these have grown both organically, we've built and developed new products internally, but we've also made more than a dozen acquisitions. And so we have a variety of cultures, of geographies, of maturities, of companies that are very early startups that are mature startups that are public companies. And so we've really had to bring together this very diverse set of cultures and practices into our organization. And not just that, but as we've gone throughout there, as we've been making that transformation, we've also been on this incredible growth journey.

00:03:41

And, and we now processed more than a trillion customer facing transactions per day. That's not sort of backend database queries and things like that. This is really customer valuable customer facing transactions per day. And so dealing with these two things at the same time has created some really fun but unique challenges. And, and I'm excited to share with you our journey around that. So a lot of times we look at our products or our consumer, our customer looks at our offering as this single, this sort of single box or a single entity. And this is the way our experience cloud solution often shows up to our customers. They love the vision. They want to accelerate their digital transformation, especially in COVID times when we've all had to move to doing more and more things online. And this is where we start. But if we know much about what's behind the scenes, it's never this simple.

00:04:36

And what looks like a simple app is actually a very complex set of technologies and services that have to work together. And some are very large and some are very small and they're globally distributed. And this is a little bit of a view into our landscape and all of the services that make up our experience cloud offering. And if we go even a little further down into the way this is built and the way this works, I think this is more and more common, certainly in a cloud enabled environment and a cloud enabled space. We can, we map out all the dependencies and we map out the way everything has to interconnect and, and inter-operate with each other. And this creates just a massive amount of complexity and really makes our dev ops challenge even harder. So not just to, we have this complexity of all these services that have to inter-operate together, they're globally distributed and, and they're distributed across data centers and colos and, and public cloud environments.

00:05:33

And this is a bit of insight into our globally distributed footprint. So we have a massive footprint with massive scale and massive complexity, and that really sets up the challenge that we had. So what we started to see is we brought all these solutions together and w we saw a lot of things that we like, but we saw a few things that we didn't like. And, and these are some of the symptoms. We've all talked about symptoms and diagnosis a lot more in the last six months, probably than we want it to, but the symptoms of our, of our service delivery started to show up in a few ways. We saw some quality issues that we weren't pleased about. We saw some ownership, confusion, something goes wrong. Is it the developer? Is, is it the code base or we, is it the operational implementation, or is it the rollout?

00:06:16

And, and we started to, to click on that a little deeper and see that these are the symptoms, and we attack some of the symptoms and we saw some incremental progress, but really what we started to see, it are some themes. And we stepped back and said, really what we think this is, is a deeper lack of alignment between our operations teams and our engineering teams, and a lack of common goals. And so that really is what, what initiated our focus on dev ops to say, we are going to not just attack this incrementally, but, and sort of evolutionary, but make a real revolutionary, uh, step forward in how we build and deliver our cloud solutions. So for us, it really was about finding our why. And if that illness is the lack of alignment, then the why became our cure. And in finding our motivating principles that all of the teams could align around that could become our rallying cry was really exciting.

00:07:09

And for us, that became the customer experience that became the center of everything that we were doing, not while there's other great things that we do great innovation, great features, efficiency, all of these things. It really came down to centrally identifying our why and rallying the team around that. And so that is, that is where we started. And as we, uh, as we are, our hypothesis really was, as we did that, we would see the right outcomes in our scale and the collaboration of the teams and, and certainly around reliability, security efficiency. And I'm really pleased to say that's played out in a lot of positive ways and, and we'll share some of that. So how did we go about this journey? Well, first again, we started with alignment around the principles, and I think every organization has people that are passionate about this. And we had a handful of develop that actually wrote what they called the develop, the dev ops manifesto.

00:08:04

Um, now manifesto, isn't something that's necessarily motivational to an entire organization. So, um, but we took the concepts in that manifesto and we turned them into a set of dev ops principles. And we brought leaders together along with the, the engineering champions and, and really unified that to a common set of dev ops principles that we all agreed on. And from there, we not only did we take that and in sort of publish it via PowerPoint or a document, but we put it into get hub. We made it part of our code base. And, and I think it's so critical that we speak developer language when we're talking about deadlines. Um, the next step that we we sat took was once we have our principles in mind, we said, okay, let's, let's put this to work. Now we could have applied this across 200 teams and services.

00:08:52

And we said, that's, that's a lot to take on and a lot to manage. So let's start, let's prove this out. And then we'll show the teams the success and how to accomplish that within, within their organizations and their services. So we didn't want to just pick a bunch of sort of easy, easy solutions that we knew would work. It's certainly easier to take new services, our Adobe experience platform, which we built from the ground up, where we don't have sort of cultural or code legacy concepts to battle, uh, what was on our list. But, but we identified about a dozen services, um, 13 actually. And that's a variety of, of services that have been around for a long time and do have some of the maybe cultural legacy challenges, as well as new solutions and, and really a diverse portfolio so that we could feel like are the principles and concepts we're applying effective at, at any level.

00:09:44

And then we asked the teams to put some skin in the game. I think this is so critical. It's everyone has to be aligned. We have to commit, we have to pivot the responsibilities to match the principles, and then hold everyone accountable to ensure we deliver a great customer experience. So I'm really excited about the way that's played out and, and getting everybody aligned. And in order to pivot and get everybody aligned, this became unified became our word. So, you know, a unified everything, a unified engineering approach. And I think if there's anything, maybe it's the summary slide at the end, but, but maybe this, I think this is sort of the money slide, because this is really sort of where, where things went from concept and principles to result and, and the changes that we had to apply. So we started with a unified engineering approach and that meant unifying our on-call no longer were we going to bring ops into the war room first, if there was a problem, and then we'll figure out if we need engineering and we'll call them when, when someone hits the big red button and we bring the teams in, we, we, our system automatically calls out to the engineering point and the operations point, they both come in and help solve the problem together.

00:10:55

And that is really, really valuable. It's created a ton of value and a ton of insight for the engineering teams to maybe be able to learn some of what operations had to go through and for operations to start and think more with an engineering mindset. We unified our code. We unified everything about our code, not just the features and functionality and the sort of application stack, but all the way down, all the way to the infrastructure, our test and config, oper, uh, uh, our testing config models, our automation, our documentation, everything needed to live in, in our code repository. And that's really powerful. And then we unified our backlog and this was another transformational change. We used to have a number of buckets that we put some of this work into. And while they sat side by side as sort of features and innovation along with operational improvements, and then maybe cost efficiency and security, and these things were all very close, but slightly different.

00:11:53

And if so, powerful, once we said, we are going to unify all this into a single backlog and make prioritized decisions with our operations team, our product management team and our engineering teams around the most important issues facing our customers. It became really powerful. And we saw, we saw some, some interesting results. For example, when we did have events and outages, we've always conducted root cause analysis and RCAs and created create an action tickets out of that. And typically what we would see is teams would pick up the first few tickets that are really meaningful and impactful. And then the rest of it would kind of sit in JIRA, which is the system we use. And it may or may not get, get actioned over time. But what we saw with a unified backlog approach was teams were actually cited about this and capturing this and solving this.

00:12:38

And we saw our problem resolution ticket queues really, or problem ticket counts really go up, um, because they started to see the opportunity to solve things, even opportunistically, maybe as they're working on a rearchitecture or a feature, they could say, oh, I can solve that problem. And there, and we found there really was no additional work or very little additional work to solving that while we went through that. But that was visibility was so powerful. So some of the tools that we had to put in place and, and this sort of process people, technology certainly plays out in this space and the technology and the tools. Uh, we're, we're definitely a part of that. So we started with service level targets. We to know what our target is, if we don't know what we're trying to reach, we're never going to get there. If we don't know where our destination is, how are we ever going to know when we arrive?

00:13:26

So service level targets were our fundamental, what does good performance look like? Not just in nines, but, but holistically. So we started defining that. And then once we've defined that we have to measure that. So we put instrumentation and indicators via, via the SLI approach. And, and then once we had the measurement in place, we could define what was acceptable and anything that was acceptable was our budget, our air budget, as many in the dev ops and SRE space they're familiar with. And, and once we crossed that budget, we have to, we have to change our behavior and change our action. And so we have to measure and report and constantly inspect this, going through each of those a little bit. I think the SLT is the king. It defines the customer experience. If again, if we haven't defined the experience we're trying to deliver, we're going to argue all day about what's good enough.

00:14:15

It becomes subjective. It becomes debatable. We're gonna have some customers happy, some that aren't. And so this really takes some discipline. It takes engagement from the entire product organization and the business, and then it takes discipline and focus to, I stick to this, this, this really will test your, your leaders in, in terms of our commitment to say, this is what, this is what quality looks like, and this is the experience we want to deliver. And again, once you put that in place, once you've defined that you have to instrument, because it's not good enough just to, to know what our target is, but we've got to measure our progress against that. And I think traditionally, we've all looked at this sort of availability. What's our success rate, right? Does the transaction complete or not? And that might result in a number of nines, but it's not enough.

00:15:02

What's the response time experience going to be? What do we want it to be? Do we want, when someone clicks on a button or takes an action, should that happen in 10 seconds, 10 milliseconds, somewhere in between what's acceptable. So really have to, we decided we needed to define that we needed to understand our throughput and traffic levels, because that was important to us and deliver a quality, delivering a quality experience. And then we had to manage capacity. We have to know where our system is going to fail so that we can stay ahead of that and proactively address that. And we found a really, a really valuable element of that on capacity utilization, which was the better we understand that the better we can scale our system down when we don't need that capacity and we can save the business a lot of money.

00:15:46

And so this has been really powerful. We identified these as our four golden signals. These may or may not be yours, but these are what we found was a good fit for us. We debated, I think about six or eight different elements. And ultimately this is, this is where we landed. So once SLTs and SLIs are in place, we really have to manage against them. And again, this is another money slide I would say. So those, these error budgets are so important because once we've defined what our target is, we can also define what's going to happen when we miss our target and making this decision ahead of time is so much easier than making it in the heat of the moment and in the heat of the battle. So, um, the, the approach around air budgets and the concept is pretty simple, which is when, when you are above your error, budget and things are operating well, then, then you can continue to work on all the innovation and the features and the improvements that, um, that the organization needs when you fall below that threshold, everybody in the organization, and this comes back to unified engineering really then has to stop what they're doing and work on addressing that issue.

00:16:49

To me, I think a little bit of, of this kind of like my, I have a couple of teenagers and I think about this, like their grades and, and, uh, every Friday I get a report in an email that tells me what their grades are and if their grades are an a or a B or whatever, we've agreed, they're gonna going to be based on and what their targets are based on maybe the classes they're taking, I'm like, go out, have a great weekend, hang out with your friends. But if they come home and they have a C minus, then they've missed their budget and they're not going to stay in on Saturday and do homework. And, and not just that, but I'm going to sit down with them and go through that and figure out what they need to do and what they need to change in order to get back on track. And so I think this is a really valuable concept. We apply this in other places in our lives, but we don't always bring it into the software development life cycle. And I think it's such a valuable and applicable concept.

00:17:38

And again, once you've got all that in place, you've got to measure, you've got to measure, you got to report, you got to have clear metrics. So we took all these SLTs and SLIs. We built a DevOps quality scorecard. We shared that out with the teams. We agreed. We're going to sit down and look at that as an executive committee every week. Um, and not just a bunch of sort of executives sitting in the room, but we, but we actually, we started with that. And we said, well, we're not really facing the problems on the front line. So we brought our, our engineering and operations leaders in that understood the technical issues. And, and frankly it forced them to better internalize and understand the issues their teams were facing and for us to together make decisions. So this was really powerful as we just measure, report inspect on an ongoing basis.

00:18:22

And this is the exciting part for us. So this led us to a new normal, and if we look back at our 2019, then you can kind of see, we we've gone. As I mentioned earlier, we go through these growth cycles and continue sort of up until the ride. And we have periodic customer events or product launches or sporting events or world events. Um, but we, but we see this sort of normal growth pattern. Well, what we really saw and we've seen DevOps help us through some of these, some of these times, but what we saw this year as we entered COVID-19 and we all moved to doing everything digitally, you're now ordering your food. You're now ordering supplies and all the things that we're doing online, what we started to see is a new normal in our traffic. And our new normal was every day of the week, we were exceeding our holiday traffic levels.

00:19:13

And that just continued to go up and up and up. So this has been really exciting and our ability to be able to handle this because we have the foundation in place. And because we are operating with a dev ops principle pattern and, and culture in mind, the work from home transition that we made was very seamless. People could join. And we had already all the facilitation we dated for anybody to join from anywhere they needed. We had the tools, we had the structure, we had the responsibility in place, and we've been able to adapt real time. And this has been so exciting to see how successful we've been able to help our customers be with these outcomes. So as we sort of look at this journey and, and what's changed and, and how we've had to adapt, this certainly is, is an iterative process.

00:19:59

And, uh, while you want to stay committed to, to the approach, you've outlined as you start to see things, you definitely need to adapt and change. And what we saw is, is a lot of things fell into place. And, um, like our, our unified engineering, we went through a little period of, of a struggle to get our engineers on call and get every team. And we heard every excuse in the book of, of I'm busy. I'm a software developer. I don't do that. My country where council doesn't allow me to be on call. So we have some complex issues to work through, but we got through that. And then what we started to see is, uh, are some challenges like observability. And we saw how well we have good tools and good data. We have some gaps. And so we, we had to double down and invest in our metrics and observability.

00:20:45

We had to better automate our, our response and our on-call work. And then, and then we started to see the opportunities to say, well, we can automate the on-call and pulling people in, but what if we actually went a step further and just auto remediated this problem, and that's been really exciting to solve customer problems even faster. So we've stayed true to these principles. We've really continued to focus on the measurements and the data, and we've maintained that accountability across our teams. And, and that's been an exciting journey to see think, uh, this big question that I hear a lot is who owns dev ops, right? Engineering? Is it ops? Is it the executives is engineering? How do, how do we drive this? How do we get started? How do we begin? Is it a movement that's a groundswell activity, or is it an executive mandate?

00:21:30

And I think the answer has to be both. We have to find the champions and the passionate engineering and operational leaders that want to do things differently. And then we've got to have the right leadership alignment and, and bringing these two things together was really what made our, our journey finally accelerate while we were making incremental progress, a few of the changes that we made and the way we approach this again, to bring a common set of principles, to spend the time, to get alignment on those really allowed us to, to bootstrap and get our program going, and then just see it take off. So I kind of talked a little about this, but we, you know, we had good alignment from the leadership. We went a step further and we regionalized that and said, we're going to find, we're going to find a sponsor in each of the regions where we have a major, uh, presence for our engineering and operations team.

00:22:22

Um, and then, and then we also said, we're going to go find champions and in each of the teams. So as we went beyond those first 12 or 13 services and started to expand, we identified 120 champions, and these were our DevOps champions. And we've got a, you know, in, in this area of the business, about 2000, 2,500 engineers. And so, you know, three to 5%, and we said your, our champions, and it was probably a little bit of a, a volunteerism and being asked to volunteer, but we found those people who had passion and wanted to see things change and were committed around this. And, and with those champions really living this, breathing, this, acting this every day, talking to their peers, we, we saw this change. And again, this whole cycle of, of, of action and accountability at an individual level, and then leadership inspection and alignment on the outcomes was really exciting.

00:23:16

We even made some organizational changes and we took a bunch of our SRE, uh, folks. And we were, we felt like we've, as we made progress through this journey, we actually then embedded them into our engineering teams and, and sort of completed that journey around unified engineering. Now, organizations are all different, what, what we did, and there's a bunch of different models out there around dev ops may not work for you, but sometimes organizational change can be another catalyst to help people see we're really serious about this, and we're to make it, we're going to make some changes. So with that, I'll just do a brief recap. I think, first again, you've got to find your why, and that's got to become your rally cry. You've got to, you've got to be passionate and committed and sort of deeply committed to, to your why and, and be able to stand up in front of the organizations and link, arms as leaders, and then see the teams leak arms and commit to winning these battles.

00:24:13

Um, getting alignment across, across all these teams, showing a commitment to, to dev ops, not just sort of talking about it, occasionally it's gotta be built into the way we run the business and, and seeing that unified experience and this unified engineering concept really start to shift as we bring together software development and QA testing and, and site reliability engineering and, and user experience. And we start to see these things merged together and everybody really be responsible. You've got to find your champions. It's so important to have the experts on the team on the ground, embedded in the team that are going to drive that change. You pick a lighthouse, it doesn't make sense to try and solve all this at once. Pick a lighthouse, start with one, then two, then five, pick a few services that you can apply this in your organization with your culture and achieve some success.

00:25:10

And then you can champion and highlight that success across the business and let it be a living process. It's okay if it changes and tweaks a little bit, this is, this is the beginning of a journey and it should continue and evolve. And as your, as your world, as your, as your culture grows, as your customers grow really, as, as we go through this time, this is such an interesting time that we're all in the middle of, and I think it's can be, we are fortunate that we were able to start this journey before and, and put ourselves in a really good place to, to ride out this. Um, I don't even know what we call us anymore, but to ride out this pandemic and all that goes along with it in a really successful way, but I've also reflected on what if we hadn't, what if we hadn't done this then would we do it now?

00:25:59

And I think the answer is, yes, it may be a little harder. We may have to go about it a little differently, but I think if you're not there, don't wait to start this. Shouldn't wait another year, continue or get started on your dev ops journey. And I think you'll start to, you'll see, you'll see those benefits and you'll see that value happen. So, um, so I encourage you to, to really take the time to think about it, to apply it and, and, and let it happen. So with that again, thank you for your time. Uh, wish I could be together with you in person. We'd love to connect. I will, uh, you're welcome to reach out to me and connect directly with questions or, or comments or whatever you want to add. There's always things I can learn and we can learn. Um, I will be on slack to answer questions as well for the next little bit. So I look forward to engaging and hearing from all of you and wish you luck in your dev ops journey.