The Shift to a DevOps Model While Building Our Cloud Platform - You Build it, You Run it! (US 2021)

At Discover, we've built the next generation of container platform based on Red Hat's Open shift Container Platform which uses Kubernetes at its core orchestrator. This platform is the nucleus of a larger container ecosystem (a.k.a. "Tupperware"). Tupperware groups numerous products and services together to provide container builds and deployments, software defined networking, and brokered relational database, object store and caching public cloud commodity services. Discover’s multi-cloud architecture is centered on Tupperware’s cloud abstraction design and processes. In this talk, we will share how we transformed our engineering practices from one that was silo’d to one that combined development and operations team into a DevOps team. Before DevOps was implemented, development and operation teams worked as two independent squads, each with its own goals and objectives. The differences and lack of communication between these teams often impacted the product, which in return affected the consumers of the platform and Discover. There was a lack of ownership, accountability and the lack of feedback loop from operations back into our product backlog was affecting us adversely. We'll talk about techniques we used to drive the cultural transformation towards a DevOps oriented approach, practices that we followed and share what worked for us. We'll also cover the lessons learned through taking our teams around this DevOps transformation and the failures that we learned from. We'll discuss how we measured the efficacy of the transformation and our success criteria centered on consumer feedback and metrics like system uptime, number of incidents that caused production downtime, number of customer issues, etc. This shift in approach helped improve our system uptime, improved our team ownership and accountability and resulted in fewer consumer complaints. We'll also cover how we leveraged trainings to upskill engineers and blameless postmortem analysis techniques to drive root cause analysis when “incidents” happened in production that caused outage. We’ll talk about our practice of Site Reliability Engineering and the focus on reliability which was central to our transformation. You'll walk away with an understanding of what it takes to truly transform the practices within your product team, embrace a true DevOps mindset and the benefits that this approach entails. Throughout the transformation journey to shift our practices from one that was silo’d to a collaborative devops model, one of the key aspects we focused on was the culture of the organization. Our goal ultimately was to create better outcome for the consumers of the platform my offering improved reliability and better customer service. We created a set of teams that were self-organized and empowered to make the right decisions. We modeled this behavior within the teams by encouraging team members to make decisions for the scope of work they are responsible for and provide support as needed. We recognized that mindset is everything when it comes to these transformations and new ways of working. We provided incentives in the form of “bravo” awards and recognition for team members to embrace the culture of “You build it, you run it” mindset. When “incidents” happened in production, we leveraged blameless postmortem analysis to drive for root cause analysis and take actions. To summarize, these are some of the key aspects of how we embraced culture as an enabler to the “You build it, You run it” mentality within our teams: • Empowering team members • Encouraging self-organization and decision making • You Build it, You Run it: Mindset is everything • Blameless Postmortem Analysis

uslas vegasvegasbreakout2021

(No slides available)

SK

Sakthi Kasiramalingam

Director – Cloud Platforms, Discover Financial Services

BP

Bryan Payton

Senior Principal Enterprise Architect, Discover Financial Services

TRANSCRIPT

00:00:13

Welcome to the DevOps enterprise summit. My name is Brian presenting. You build it, you run it. The shift to dev ops model with discover financial services presenting with me today is Sarpy. Saki is the director of cloud and application platforms at discover financial services. Whereas I'm the senior printer principal enterprise architect focused on application platforms. Let's take a little deeper dive into our presenters. As I mentioned before, my name is Brian Peyton. Uh, I'm a technology, strategist and engineer. I spent the majority of my career focused on government intelligence and data analytics and working a lot in future state and applied strategy and engineering. Uh, I enjoy playing sports, doing anything outdoors and hanging out with my two boys pictured below. As you can tell, there are a handful and there are a lot of fun. I, you can reach me at my discovery email address. brianHayden@discover.com Sofie.

00:01:19

Thank you, Brian. Hello everyone. I'm Sharpie. I'm a director in our infrastructure products area here at discover financial services. I've been with discover for the last two and a half years. And before coming to discover, I spent the last 16 years of my career in various engineering roles in product development organizations. I'm a mother of two girls and, uh, being a mother has taught me how to ruthlessly prioritize my time. So I'm maximizing my time for myself, my family, and for my teams at work. Um, I'm a lifelong learner, always looking to learn something new and adding to my toolset on a daily basis. Uh, you can reach me through dot com. That's my email address.

00:02:05

As Brian said, we both work for discover and discover offers award-winning credit card and personal bank offerings. As you might be aware while we are a financial services company, discover is very much a technology for that. Leverage is a technology first approach to offer the best experience for our customers. We also embrace a culture that is deep rooted in innovation and volunteerism. I would encourage you to check out our technology openings at the carrier website that we blinked here. And as a reminder, all the views that are expressed in this presentation or hours individually, and not those of our employers, but that I'll pass it over to Brian to get us started.

00:02:52

And so you stop the, so what we want to talk today about our dev ops journey really focuses in on our application here at discover financial services. So Discover's application platform can be summarized as a container-based ecosystem that we've overlaid on top of public and private cloud infrastructure. This overlay ecosystem allows us to have a consistent, uh, capability and products and operating model environment for our developers experience some of the key attributes and highlights of this infrastructure. As you can kind of see represented across the globe, there is a full network and service mesh topology. So a network mesh and a service mesh overlaid on top of the application platform on all these little bubbles there that you see presented. This gives us a very high increased level of availability and disaster recovery, all the dependencies of the products and capabilities and the deployment and delivery of those products and capabilities across this environment is all abstracted through the common set of API APIs and all packaged and available through home.

00:04:11

So our, our developer experience, uh, in that fashion is always consistent and always delivering the same experience to the deployment and CSU processes, all of our common operations for these dependencies, such as database backup and recovery, everything is also centralized on that common set of API. So again, this really allows us to delegate the operations back to the application community. It gives them one control plane, one set of experience of API APIs for everything that they do within our environments. And then this, it gives us a very nice way to confine and consolidate our security architecture and enforce all of our applied standards on all these different private and public cloud infrastructures equally. And so this application platform has a lot of highlights. It sounds very challenging. It sounds very rewarding. Uh, and in that there's been some, there's been some struggles. And so that's really what led us into our DevOps journey, which we want to take a look at next.

00:05:32

So this is dev ops transformation that we went on, um, really rooted from a core set of problems. And the problems are highlighted here on this page and kind of illustrated with this car engine that that is fuming. What's something that we can all kind of relate to as a very, uh, inconvenient thing to happen to you in the middle of nowhere, right? Uh, that's kind of how we felt with our platform, our application platform, as we've been developing it is that we don't want to leave our application teams feeling stranded or feeling vulnerable, uh, not knowing what happened and why. So here's some of the things, uh, that we've ran into. So decrease platform reliability was the first indicator that we needed to do something. So our reliability of the platform, it can be measured as the uptime rates, the experience to the consumers, um, their applications uptime rate as well, right?

00:06:34

That kept going down and up, down and up over time. And it really was a wide set of problems in a wide set of different issues that resulted in that. But nevertheless, it was consistently inconsistent, lack of ownership and accountability was another thing that we identified pretty early on in our journey too, that we needed to correct. And what this really resulted in on a day-to-day basis was most environments and ours was, was not unique. We have a segregation between our operations support and our kind of platform engineering or core engineering teams. And so that turnover wasn't always great that communication isn't always great. And in our environment, that kind of resulted in a lot of finger pointing and Hey, we didn't get to the proper turnover or something was made, and it wasn't communicated to us. And, and you didn't face this as per the SOP or further documentation or using our automation, um, and just kind of a lot of that back and forth.

00:07:36

And so there wasn't really a good sense of ownership of these problems and these, these things that were causing our platform to be inconsistent, lack of product orientation, kind of surface itself, as well as being teams focused on troubleshooting and not really focused on the brand of our platform and promoting that brand and building around that product so that we were differentiating ourselves in our environment and differentiating ourselves to our consumers as here's the benefits. And here's your mindset into this platform? Feedback loop into the product backlog is where we were lacking taking that lessons learned from those different areas of, of issues that we were just expressing and putting those back into our agile backlog so that we could generate some successful remediation to those issues, or take a deeper dive and really find a root cause analysis that maybe we weren't able to identify during a post-mortem right or during a fall out. And so we were lacking that and constantly facing the same things over and over, and then time to market suffered our time to market and new products, delivering new capabilities into this platform or delivering the platform itself, uh, was starting to slow down and because of our time spent in all these other areas. And so we kind of narrowed it down to these areas and say, these are our core problem areas that we want to focus on in this transformation. So how are we going to start remedying or facing the correcting these problems?

00:09:27

Well, we had to identify to any, so they're going to use right. We, we sat down and said, what tools are we going to take out of our toolbox to actually solve these problems? Right? There's a lot of different ways that we can handle this. What's going to be the most effective. Let's sit down and think about this before we just present a bunch of problems to the team and say, start fixing these problems. Um, one thing we started off with is a single backlog for developer and operations. So we consolidated our operations and our development engineering teams and what that enabled us to do. And what we sought out to do with that is reestablished that accountability, that ownership, understanding that when you build something and you have to provide support for it after you build it, it really makes you focus on building it the right way, building automation and remediation, um, tasks around that documentation, those sorts of things.

00:10:23

When you're put into the situation of supporting that product really makes you focus on as an engineer. So we are rotating our folks through that process operations as a rotational responsibility was the segue from that. And so now that we've collapsed the teams and we, they can see the both sides of the picture. We don't want anyone to sit in operations too long. We don't want anybody to sit in engineering too long and lose that focus. So we rotate our people through operation cycles and engineering cycles on different sprint cadence in our agile framework. And that has allowed them to again, understand the issue, take it in and put it in the backlog and also see it when they're developing, how is this going to be handled on the operation side, single operating mechanism and scrum ceremonies. This is going back into our agile framework adoption here at discover.

00:11:22

And so making sure that our operations teams are following the same procedures, making sure that everyone is inclusive in the same agile processes and scrum ceremonies allowed people to voice things earlier, as you need to consider this in the user story, you need to consider their stared development and give more voices to be, to be heard earlier on and consolidate that instead of waiting until it's too late or waiting to a problem, occurred to somebody to address it upskilling through training and exercises. We've established a very, uh, rigorous training environment here, uh, at discover. And this has meant to develop our internal staff. And so we've built, um, many different courses, uh, many different developer courses and administration courses, engineering courses, and management courses all focused on educating and improving our engineering across different domains. And within our, our application platform in particular exercises was a way for us to make people get comfortable being uncomfortable.

00:12:36

And what is meant by that is we would set up in lab environments, we'd set up these mock environments and we would break things. And in breaking those things, certain people would have to respond and correct those things. And it was kind of a training exercise to evaluate how you handle the stress of fixing something when it breaks as you would, if you got a page towel and what documentation are you using when you fix it? What, uh, pipelines, what automation are you using around that when you fix it to make sure that people understand where our resources are to make sure people are getting comfortable in responding to issues in those situations. And it really just became a big learning environment for people. And we do that on a quarterly basis, blameless postmortem analysis. This was our way to level set after an incident and made sure that we got rid of the finger pointing and really focused on what was the issue, how are we moving forward from that? And how are we generating work to avoid this happening again, in operations review meeting, this is kind of a weekly thing that we do, uh, to evaluate how is everything going on the operation side. And again, because of our collapse engineering operations, the amount of feedback in those sessions and or activity in those sessions is really increased. So now that we've identified the techniques, we had to make sure that they're successful and to make sure they're successful, we have to measure those.

00:14:06

And so these are some of the things that we felt were important for us to measure as we're moving, progressing through the dev ops journey, a couple highlights of this, and there's more that, that anyone can include in their journey, but a couple of them that were helpful for us is, uh, exercise scores and metrics. We talked about doing this training exercise, you know, we gave scores out of how, you know, was it corrected? You know, was it corrected in the right way? Was it kind of band-aid fix, was it following a process? And so we kind of said, was this a pass or fail type of environment? And so that really gave our guys a sense of accomplishment and showed them areas where a, you need to do this the right way, because it has consequences, right? Uh, team exhaustion and inclusion surveys. That's something that is a way for us to understand, are we pushing on people too much?

00:15:00

Are we asking them to do too many things? Do we need to scale back capacity on our planning and make sure people have time to be effective in their jobs and be effective in operations and engineering work. And then the number of incidents that were escalated. And so the reduction of issues that are open the reduction of production incidents that were open from our consumers, because it really easy key indicator of are we taking what we've learned in generating new ways to automate that or correct that longterm. And so these are just some of the metrics that we decided to evaluate and measure and throughout our journey, but it's not just a, you know, a technical issues and techniques and measurements that were going to make us successful in this journey. We understood that there has to be a focus on culture. There has to be kind of a feedback loop between lessons learned and bumps in the road. And to talk more about that,

00:16:03

Have you card of the same culture eats strategy for breakfast? Well, I think if they're not careful, it's true that culture eats strategy for breakfast, lunch, and dinner. So far, Brian talked about the specific challenges we faced. Initially, the drove us to do something different. He talked about the, uh, the specific techniques we applied to overcome them along with how we measure the efficacy of our substance centered around certain objectives and key results, why they focused on applying specific techniques to bride this transformation, as well as measuring what matters. We recognized that conscience and mindset is everything. When it comes to change and transformation, the biggest challenge around DevOps and transforming to a new way of working is not the technology or the metrics, but it is the people and the behaviors exhibited on a daily basis. So we decided to focus on our people and leverage culture as a key enabler for this transformation.

00:17:14

So you might wonder how would we go about creating a culture within the team that enables this transformation? What did that even mean for us? First, I purposefully moved away from a command and control type of an approach. And instead focused on creating a set of themes that were self-organized and empowered to make the right decisions. We modeled this behavior within the teams, by encouraging team members, to make decisions for the scope of work they are responsible for. And we provided support as needed. I embraced an approach of leading with questions instead of answers, to guide the beam towards self organization. When incidents happen in production to take the finger-pointing and the blame out of the picture like Brian talked about earlier, we used techniques like blameless, postmortem analysis, and Firewise to drive for root cause analysis. And to learn from the failures when handling Rubin operational support our product owners encouraged our team members to get to the root cause of these incidents.

00:18:23

So they are resolved once for all. Most importantly, as a leadership team, be focused on establishing psychological safety within the teams, by being open, engaged, listening, to, and responding to team members, feedback. We centered our practices on the philosophy of you build it, you run it in order to break the knowledge silos that existed between our development and operations team. We have to establish a weekly learning series where our product teams came together on Friday afternoons for up-skilling and cross-training to eliminate the single point of failures that existed within the team. This led to a culture of being a learning organization that continuously lawns and improves.

00:19:14

Why do we had a fair share of success and excitement about the new way of working? We also faced some significant challenges, especially initially there were some initial failures that we faced, but we quickly inspected and adapted our approach to make some tweaks as required. This helped us to turn our failures into stepping stones for success. So let's take a look at some of our early failures and then how we overcame them. Number one, our technical exercises resulted in failures and lower model with an RP. Our technical exercises were nothing but simulated chaos tests of possible incidents that could happen in production so that our team members could practice learn from the experience and become ready to handle real incidents. Initially, when these exercises led to failures, because our team members were not able to get to the root cause of the internet. So we had to restructure our technical exercises with clear outfit, steps, misconfiguration, injections, expected outcomes, and lessons learned.

00:20:25

And we also recognized the top performers from these exercises to motivate. Secondly, the training material that was delivered was not very effective in helping meet the needs of our people. So we slowed down the training program to ensure quality training materials and lessons were delivered. We reviewed the material prior to, we also started loading these as tasks in our adult planner to help keep us on track. Last, our engineers were being overworked, balancing operations and product delivery. Based on the feedback from our team, we made adjustments to on-call rotations, so that engineers were not on support 24 7 when they were on call. We also are disturbed other product development cycles. So that engineers on call may not allocate it towards product development initiatives. During that particular time period, this helped back teams to maintain a better work-life balance. The onset of the pandemic and the complete remote work also led to an increased burnout for some of our team members. So we help establish forums like virtual happy hours, watercooler virtual team checks, and informal coffee chats to do help our teams have some social forums outside of our day-to-day operating model and product delivery initiatives.

00:21:52

So in addition to the initial failures, we faced, let's take a look at some of the key obstacles we faced. And most importantly, how we overcame them first, not all engineers were comfortable with operational responsibility. We overcame this by gradually exposing our engineers to customer issues through our tickets incidents, as well as recurring chaos test exercises. Using these approaches helped us to instill the confidence and the experience that the team needs to really embrace the approach of you build it. You run it. Secondly, there was a lack of prioritization for the teams, training exercises and documentation, but then extremely busy product backlog. And sure you can relate to that. It was very challenging for us to prioritize the time for team training and documentation feature development and product delivery initiatives took a higher priority over cleaning the B, but we had to make a conscious effort on being a learning organization and make these trainings a priority to ensure our team's overall success.

00:23:03

This sometimes meant being willing to put a deliverable back in our backlog so that we can properly train and mature our team documentation and practices finally addressing technical debt, but the huge volume of support issues that started coming over me and the demand from the enterprise. It was very easy for our team, but just resolve the issue at hand and call it done. But as part of a new operating model and the new way of working, we help establish a weekly operations review meeting, where our team developed a new muscle of looking at these operational backlog of issues through a different lens to see what repetitive patterns of issues came along. And then how can we put capabilities in place to improve automation and reduce technical debt?

00:23:57

So what are some key learnings that you can take away from other experience that can help you in your journey? First of course, it's people first get to know your people, what motivates them and how they like to be rewarded. There is no one size that fits all when it comes to motivating your teams and providing leadership, building a culture that is centered on trust and shared accountability takes a lot of time and energy, but it's very well worth it. So invest in your people and send out your transformation, leveraging a people first approach second, be flexible and learn from mistakes. When you do something new, there is a possibility that you might fail, especially initially, but that's okay. Set a clear vision and goal to help you stay focused on the outcomes, but remain flexible on how you will get to your desired outcomes.

00:24:59

Next focus on metrics that matters. It's important to measure what matters to ensure that you're making progress. It's also important to project visibility around these metrics with your entire team and your leadership. When you face obstacles, use the objectives and key results as a guiding light to keep the focus on where you need to go leverage data, to make decisions whenever possible next, be customer obsessed. Keep your focus on the customers of your product and focus relentlessly on delivering value for your customers. Find a way to get periodic customer feedback through surveys and other feedback channels and make this a regular practice for your product. Please finally evolve and grow. You're never done any change or transformation is a journey and it's not a destination. So I love yourselves and your teams to fail as much as they've had a successful transformation to a DevOps way of working over the last couple of years, we are not done. We also constantly have a newer set of challenges pop up, but it's about looking at these challenges through the lens of continuous improvement to see, uh, how can we iterate evolve and then grow through that process. So those are some of our key learnings from our experience in going through this DevOps transformation, hopefully you learned something from our experience that can help you in your environments as well.

00:26:36

Feel free to reach out the brand or meet through the emails that BIP share. We would love to hear from you. If you have any feedback or if you want ask us any questions. Thank you.