Pragmatic DevOps

"Universal law is for lackeys. Context is for kings." The best practices are the ones that make sense for you. Embracing DevOps culture is not a binary decision. This thinking often leads to inaction. Should we use distributed tracing? Should there be 100% unit test coverage? Should we use Infrastructure-As-Code? We should when the right time comes.


In this session, we will try to explore what DevOps practices make sense for companies at different stages operating in different industries, DevOps maturity models, and the role of platform engineering in all of this.


Vaidik will present the story of DevOps maturity over time at Blinkit (formerly Grofers) as well as their internal microservices maturity model deployed to help teams prioritize areas of focus that help improve DevOps practices across teams.

VK

Vaidik Kapoor

Technology Consultant,

Transcript

00:00:00

<silence>

00:00:13

Hi everyone. I hope you're enjoying DevOps Enterprise Summit. My name is re kapu. I'm a technology consultant. Before this, I spent six years at fus leading DevOps platform engineering and security. I love sharing stories and experiences with an emphasis on the why of many things that happened in those stories. When DFU started, we were far from perfect. A lot of things went wrong in how we set up our technology teams, and thankfully, we managed to course correct in the following years. Today, I want to talk about how, why many things kept going in the wrong direction for us, and how did we manage to course correct. What lessons of pragmaticism did we learn out of that journey? And finally, we will discuss the much disputed topic of maturity models and how we use them at fu and hopefully along the way, we will engage in conversations on Slack and later in other forums during the conference.

00:01:01

Let me start with our story. rufa started in turn 13 as a hyper-local grocery marketplace delivering orders. In 90 minutes, we could go on our app, find the near risk grocery store, place an order, and we will deliver to your doorstep. We start with a simple architecture, three services or applications, if you will. One has the backend for our consumer facing applications, one for catalog management, and one for everything to do with auto fulfillment. That is audit tracking, support, et cetera. This was simple and good enough worked fine for us initially to build quickly and roll out new features. Uh, the business was also simple enough. Everything was on AWS from day one. As the business grew, as we gained more traction, uh, we solved more problems, and especially as more people joined our tech team, it uh, became hard for us to work across these applications.

00:01:50

There were a lot of problems being worked on in parallel, so more people working on the same code basis at the same time, usually meant overstepping and involving handshakes. At some levels, more bandwidth seemed like the biggest bottleneck. Uh, so we acquired a company to double up our engineering strength. The timing was such that we needed to move really fast. Uh, we just couldn't take a pause and reflect on how we are going to collaborate on these code bases. We couldn't build the tooling practices and develop experience that would allow all teams to move fast with this setup. While there are companies that have managed to successfully work with monolithic code bases, uh, with far more developers than growers, I think we were a much more younger team, not mature enough to do that, especially with the growth pressure we had, being able to divide and paralyze seemed much simpler, uh, than being able to figure out how to make monoliths work for us.

00:02:44

So adopting microservices architecture seemed like the next best step. We started breaking applications into microservices to enable teams to work on problems independently. Every time we could see a new problem that could be independently worked by a team and a domain without dealing with the complexity and chaos of our existing code piece, we would spin off into microservice teams. Starting new microservices will choose their own stack to attack the problems at hand to make our teams truly autonomous. In early turn 16, we felt it was important that we give our teams ownership of systems end to end. So every time someone needs to run a tech experiment, if they get bogged on something because they don't have the access to it, uh, we are not moving fast enough. So we pushed the idea of developer on developer and team autonomy as far as we could adopted uBuild that you run philosophy and enabled our teams to make their own technical decisions as the entire stack, including infrastructure and operations like configuration management, scalability, resilience, and even handling instance.

00:03:43

The first bottleneck was provisioning infrastructure resources. So, um, we, we were on AWS but not really leveraging it request to launch and configure E two instances send Setting up databases would come directly to, uh, the extremely under-resourced infrastructure team. And with the product engineering group, uh, growing rapidly, it was impossible for us to keep up with, uh, the incoming requests. So we decided to get out of the way of developers as quickly as possible. We built, uh, tools, um, that, uh, allowed us to automate these end-to-end flows of provisioning infrastructure safely and, uh, without any intervention from the DevOps teams. And it really opened up possibilities for our teams to quickly try out things and test environments and put them in production. There was a pressure to move fast, and a big artificial bottleneck was out of the way. DevOps teams were responsible for governance, for providing processes and tools for developers to GD own the entire application lifecycle.

00:04:38

And one big responsibility for the DevOps teams was to coach developers when they did not have the right experience to, uh, help the architect for the club. For example. We felt that the con that configuration management as a practice was important for us. Uh, it would help manage changes better. It would help with CICD, and it would help with enough. Uh, it would provide enough automation that we can implement auto scaling for services easily. And, uh, to be able to scale config management. Well, we just push all the developers on how to use Ansible for their applications for configuration management, continuous integration, and, and autoscaling. And, uh, uh, in this case, it got to a stage where pretty much every developer could work with Ansible to an extent that, uh, almost every application was managed using an, uh, Ansible. So this was a great place to be.

00:05:23

We were proud that we were able to build the right behaviors, um, by coaching and working with developers very closely. And all this worked really well, or at least, uh, that's what we believed that was happening. So in early thousand 18, we realized that we had an illusion of agility. Teams were working independently on their microservices, deploying multiple times a day, but there were not enough guardrails for quality. So a lot of, uh, deployments would lead to bugs and incidents in production. We were creating waste. Um, we were shipping poor quality products that were frustrating customers, internal users and management. Um, our engineers were burning out as they were busy firefighting, uh, than shipping value to customers. Systems had become so complex that technical debt was getting worse. Uh, writing code was a terrible experience for most teams. Uh, we used to think that solving for just autonomy by creating boundaries, saying, you build it, you run, it is enough.

00:06:19

And our teams will own quality of what they ship. And to an extent, it happened. Our teams did what they felt was right and was within their control and boundaries, the best of their intentions, uh, but they did not have a systemic view of what was going on. And as a leadership, we team, we failed to provide, uh, an oversight over our entire architecture. We ended up with serious problems that changed the course of how we worked at ERs. We had a proliferation of microservices because of too much freedom and absolutely no guardrails. Teams could create new microservices as if we see fit, but we were continuously making our systems more complex. Uh, microservices eventually became hard to develop, test, release, and monitor introduction. In many cases, the boundaries between teams were not clear enough leading to handoffs, slow releases, and complete lack of ownership.

00:07:10

Uh, our quality feedback loops became extremely poor, so poor that we were mostly getting to know about bugs from customers, customer support, and often directly from the CEO. We also ended up with an extremely diverse tech stack. Uh, this slides doesn't paint the entire picture because it is not easy, honestly, to, uh, list out everything. Uh, since technical decisions were localized and democratized, we ended up with a pretty diverse tech stack. You name it, we had it. We had several tools that fulfilled the same purpose. This unnecessary diversity stopped us from achieving economies of scale because of lack of standards, common tooling, and most, most importantly, lack of mastery over anything. Every tech stack required a unique way of thinking about continuous delivery, which made our journey a lot more painful and expensive. The worst was that it took us a lot of time to figure out what really happened.

00:08:03

Um, when we realized that quality was an issue for us, uh, immediately we created organization focused to focus on, uh, to improve quality. The entire technology leadership was driving quality as an agenda. Teams were excited about improving quality writing tests was largely believed as the tech debt that we must pay off. We were using OKRs at the time for setting goals. So we would have OKRs like improved test coverage, but there's like 80% in all test services and nothing will actually get done. Uh, we are like, alright, maybe this is our first step. Let's try once again, be more realistic. So we take another attempt and, uh, with more realistic rules, reduce the scope of services. Uh, we made almost meaningless progress. Maybe two teams made some progress, uh, which we celebrated, but that was also not the state we wanted to be in. With all the organizational support and alignment, we couldn't make meaningful progress, and it was quite demotivating for all of our teams because they wanted to improve things and, uh, but we're not able to do it.

00:09:06

We felt we took a big goal that was impossible to achieve in the timeframe. So we decided to, uh, take another quarter with realistic goals and, uh, run another experiment. Uh, we reduced coverage target and focus only on a few critical services, but we made one more change. We allowed teams to pick up their localized problems. We had a very interesting observation. We saw them. Some teams make progress on some fronts, so some teams worked on, uh, improving, uh, documentation, onboarding new engineers was a problem for them. So, uh, so README has got better. Uh, integration across services and sharing of contracts was getting hard. So some India docs were written using swagger. Uh, recurring issues in production were hard to debug, so we improved on, uh, logs and monitoring. Uh, but uh, testing did not get better at all. So for some reason, we were not able to make progress on testing while we were able to make, uh, to able to improve on other fronts.

00:10:02

Um, the next quarter, we were even able to attack new problems like load testing and improving our architectural issues to support, uh, our largest ever online sale where we expected three x traffic of our regular traffic. Uh, we made things happen, and the sale was extremely successful. So this was the story. Uh, I wanted to tell with as many, as much details as I could, uh, in the little time we have, but of course there's a lot more that I skipped. But what I really want to share, uh, next is some of the lessons we learned in this journey. The first lesson is that, uh, there's no such thing as a best practice that you must follow. Best practices should probably be called, um, recommended practices. We almost never achieved any of the goals where we wanted to fix a practice across all the services.

00:10:53

Anything like, let's get, um, let's get test coverage to 80%, or let's define SLOs for all the services. Uh, were never really achieved. In retrospect, we didn't get anything done because we didn't need to follow all of those practices at the time. In all the places, the value was not clear and the effort was not worth it. For example, we started managing our infrastructure at school many years back, but it was not always necessary, uh, necessarily done with everything. Uh, we did. It was usually the parts that were fast moving, frequently changing, but expected to change frequently in the future, too critical for manual letter or had to be democratized, and that was good enough. Another story was with config management before roofers that was coming from the world of puppet, left puppet for what it was. And, um, even the puppet was probably a better technology for configuration management in my opinion.

00:11:43

Uh, when I introduced, uh, tro, our teams really struggled to get stuck with it quickly. Our reality back then pushed us to look at something that was simpler to understand for our teams, get adopted quickly and is extensible for most people. And Ansible was a better choice. Develops practices, uh, that have a clear plan for adoption, get adopted faster, and especially when the plan is attached to outcomes. Case in point, the time when our teams decide to improve documentation. If you don't have a culture for documentation, you have to be careful about how you introduce it and change the culture. What problems are you trying to solve? Really, we went from saying we need to improve documentation everywhere to, we need to improve documentation to help onboarding new engineers faster. It was a specifically clear problem. Uh, our teams felt that without minimal documentation, onboarding new engineers is becoming a problem.

00:12:37

It was affecting the teams directly. The outcomes and the associated tasks were clear enough. Every rep should have a read me with a brief description, clear and well tested, set up instructions, the recommended tooling for development and clearly defined owners. And so it got done without a lot of stress. We made good progress. Um, at the other extreme of this was testing. There were several holes in our plan to get better at testing. One big reason why we were not able to progress on testing was most engineers on our teams. They know what tests are valuable enough. Unit versus functional testing was a constant debate. And the people who were driving testing, um, as, as an initiative, uh, they, they took it for granted that everybody would understand, or it is not a complex topic to, for them to understand. Another big challenge for getting better at testing was a complex problem, deeply rooted in the problems of our microservices architecture, which acquired a completely different strategy.

00:13:34

Uh, we figured this out after constantly retrospecting over our many field attempts. I spoke about some of these challenges at DevOps Enterprise Summit last year. We found ourselves prioritizing instead of blindly following all the practices across the services. The cost of paying off technical debt all together was very high. Uh, but whatever felt like comes in the way of delivering value or was a big risk. There was usually someone, someone on our team pushing for, solving it hard. Uh, and then problems will get solved. Uh, phrases like critical services, uh, became common in our conversations, and that meant something. Uh, failures pushed us to adopting practices in critical services instead of all services. And even if we wanted to make changes in all services together without a clear execution strategy, nothing would ever get done to an accepted level. So having the, having some prioritization framework helps convey the urgency and make helps you make progress.

00:14:33

And progress is far more important than being perfect. Every team, and by extension, the services and code business owned by them could be dealing with different problems and might have different needs. And the solutions of those problems need to be looked at differently as well. Or the prioritization of problems, uh, to solve can be different. Um, I've, uh, often seen teams get stuck in objectives like standardization. Uh, while standardization is a great idea, standards and systems, uh, can also come in the way of moving fast, um, or delivering what is most important to what level should you standardize, should depend on economies of scale you want to achieve and not the doing the same way, just because that's how it should be. And often there could be something better to do than just <inaudible>. For example, our consumer facing systems had scale related challenges where our supply chain chain systems had the challenges of correctness and reliability. Every time we decided an <inaudible>, uh, technology investment that was not a real priority for everything like adopting slus. Uh, we will make progress where it is a priority, but other teams might not be able to keep up. In this case, our consumer teams, uh, were able to implement SLOs much quickly as compared to our supply chain tools.

00:15:52

So reflecting on a journey got us to learn some of the pla some of the, uh, places we were going wrong. We had to figure out where do we go from here and how do we internalize these learnings in our execution across teams. Um, unfortunately we couldn't think of an easy way. Uh, so we started monitoring. We felt that we needed to learn. Um, and this is a point where we got introduced to the concept of DevOps maturity, uh, mostly by reading a bunch of really nice books that this community already knows about. So here's the first maturity model. Someone in my team shared on a SPAC channel. Uh, there's a continuous delivery model, maturity model from the book, continuous delivery by Dave Foley and Jess Humble. Uh, in this, we found a way to articulate what we had learned. Um, DevOps practices do not get adopted on day one.

00:16:36

You move towards a vision and there are intermediate steps. Um, this framework highlights the importance of different aspects of continuous delivery to turn the concept into execution. Each of those roles in an is an area that is important for practicing cd, and the columns from basic to expert are levels of maturity. So you start from left and the expectation is that, um, you're moving towards right on each of the roles, hence maturing as in practice. Um, an important call out in this framework is first row culture. Uh, so maturing in engineering practices is not just about maturing how you use tools and technologies, but also your ways of working. Uh, with a framework like this, you can clearly define those intermediate steps and also use them as an internal, uh, as internal or external benchmarks. This was a good direction and it made sense, and, but we couldn't really take this to our teams and expect them to use it, uh, because it was too high level, not prescriptive enough about practices in, in a specific context. Um, solutions were missing, um, and it is aspirational, uh, in a sense that following engineering practices can become a goal in itself than delivering business value. So the question we were asking was, how do you operationalize a maturity model? How do you make yourself go from, Hey, we wish to be an elite team to a plan and a system that pushes you to get better every day.

00:17:55

Here's probably one sixth of, uh, the maturity model we developed at three nine, uh, at i, sorry, I couldn't share the entire of it because the document is quite large. Uh, but yeah, it is inspired by other maturity models and, uh, it incorporates some of our learnings. Uh, we call this, uh, the microservices maturity model. The idea was to look at all the practices while building systems instead of just one practice, like indigenous delivery from a distance. It seems similar to the one we, uh, just saw before, but there are quite a few, uh, differences here with noting. Uh, but let's look at, uh, what do we have here first? So on the left side, we have pillars in the first column. Um, on the second in, on the second column from left, we have areas within these pillars. Uh, this way is not as high level and gets to, uh, to little more details on what kind of practices do we want to see being followed in the organization.

00:18:54

And then a third column from left and on, which we have levels of maturity, level one through level four, a level four being the most mature state. So structurally very similar to the previous maturity model. The categories are, um, uh, or the first column are the pillars are sort of macro engineering practices. Um, while, uh, the areas, uh, within these pillars are more specific practices within the pillars, right? So, uh, we have these five pillars and, uh, then they have their several areas of practices under them. Um, this weight is not as high level as a previous maturity model and adds a little more detail. Um, these are things you can borrow from maturity model like we, we did from a, some other maturity models. Uh, but the key thing to understand here is that, uh, what you decide to put in your model has to be important for your business.

00:19:44

Uh, instead of focusing on everything, uh, remember it's a journey. Progress matter is not perfection. So depending upon your business, industry and journey, you can craft your maturity model that focuses on practices that are important for you today. And the ones that are in finite games, you must start playing. Maybe you're an e-commerce business like profu. Things like ability to release fast, uh, done many experiments in parallel without breaking customer experience is important. So you create a focus on agility, reliability, uh, experimentation, quality and resilience. Maybe you're a FinTech business. Then things like correctness, transactional guarantees, uh, security and compliance matters a lot more. Maybe you're a B2B SaaS business, then maybe reliability with compliance is more important. Maybe cost is very important for you. So you create a systemic focus on that.

00:20:39

The third column, uh, uh, the third column on that, we, we have levels just like, uh, in the containers delivery maturity model. But a difference here is that these columns have two sub columns, columns in them. One is called expectation and the other is called supporters. I will come to what expectation means later, but for now, let's just read it like we did the previous maturity model. So for example, uh, synthetic monitoring on level two says the expectation is that synthetic monitoring, uh, to be used in production with alerting and they just, just in support at ERs column specifies a recommended way to meet that expectation. In this case, we suggest that services must implement a well-defined smoke suite with P one test cases that can run in all the environments and can be periodically run in production using Jenkins. So we don't just set the expectation, but also prescribe how can those expectations be met?

00:21:36

That's what helps make a maturity model more prescriptive than open-ended. When a teams look at this, they know where they have to go and how they can get there. In ish context, one of the key differences in our approach, which comes from my learning, is that maturity model is not aspirational. It's actually risk driven. So we don't try to make our services and teams more mature just because they should become more mature. It's not like a career growth path. We get better because our business needs us to get better. And this is where we factor in different kinds of risks for our assessment. And this is what becomes a business requirement. So the levels in the columns are not the levels that you try to get to, but the levels are preci for every service because that comes from the criticality of that service. In our architecture, this was one of the key learnings we implemented.

00:22:26

We are not going to get better because we should get better. We will get better because we need to get better in certain areas. And this is why, why we call the first column or the previous slide as expectations. A service at a level that is expected to follow certain practices. For example, we have an area called service resilience, under which a level three service is expected to have circuit breakers implemented to avoid cascading failures while a level four service plus practice chaos engineering to continuously validate that failures don't lead to cascading failures. The levels are pre-calculated on multiple parameters, like frequency of code changes, number of active collaborators if it is in the critical part to serving the customer. And, um, and other parameters like that. And we, uh, try to mostly calculate the risk automatically and centrally to use a common, uh, logic and assign a level to every service, um, so that, uh, the teams now understand where their microservices are level-wise, and, uh, see where they're falling in the DevOps journey.

00:23:31

Once the level level are assigned, teams can self-assess and set the journey to get to the level of expectation as client did by the maturity model. This started to make a lot of sense. It was getting tied really well into a structure where microservices are owned by teams. Uh, after teams did, did a few self-assessment for their microservices, it started to become clear to them as to what are the areas they need to focus on, depending upon the nature of the service. Uh, right after we had, uh, the first quarter where most teams organically arrived at the most relevant goals that match their reality with minimal handholding, and again, just prescribing is not enough. Teams today have to deal with so many decisions and so many different kind of tools and technologies. There's so much cognitive, overdo, expecting everyone to make the best decision for everything they have to do is unjustified.

00:24:23

This is where platform thinking comes in clearly defining, defining how we can help teams at top various d DevOps practices without spending, uh, without expecting them to spend a lot of time making decisions or, and reduce the cost of transformation by achieving economies of scale. So all the things that you see highlighted in red and yellow in the supported Trish column, these are possible solutions that the platform teams came up with that could potentially help the teams adopt practices easily. But these solutions are not productionized today at Trish yet. So this way platform teams got a clear roadmap of things that they needed to build. And of course, we made sure that we are not tied to the solution. So this is more like an outcome roadmap for our platform teams. Um, it also stopped a lot of debate of we should do this or we should do that.

00:25:12

We now had a framework to accept or reject ideas and focus on platform execution. And I feel that is extremely important for platform teams because the impact of their work is usually not clearly visible sometimes even to themselves. An outcomes driven framework like this can help keep the platform teams aligned with the product engineering teams and the business. The ideal maturity models has been debated before, um, about the utility and effectiveness. So the doubt naturally, right? Still maturity models really work. In turn 17. Dr. Nicole first when presented a research at DevOps Enterprise Summit, uh, where she says that, um, where she said that maturity models don't work because they go go out of de too fast. Um, as, uh, technologies faced, uh, changing, uh, really quickly, the landscape changes really quickly. And, uh, while I don't disagree with the point of technology moving too fast these days, doesn't everything that we do today get outdated quickly as well?

00:26:07

Isn't that true with all the technologies, with or without maturity models that we use? Ways of working organization policies, our maturity models effective in the way we deployed them at roofers? I think only time will tell. Uh, but we committed to doing this and also committed to revising the maturity model itself at time. And because platform teams derive their goals out of this model, the relevance of everything the model says has been reviewed in questions several times after we release the first version. Uh, what we also notice is that practices that have stood the test of time don't really change the technology supporting the practices change. And that's fine because we deal with that kind of changes. Anyway, approaching a maturity model was a solution to help us scale engineering management with a team that was young and lacked the experience that, uh, that it needed to build, uh, systems at scale. Your reasons could be different. You'll have to see for yourself if this works for you. And, uh, that's it folks. I hope you enjoyed the session as much as I did presenting it. I would love to take questions on Slack or enjoy a virtual coffee slash beer to dive deep on this topic with you.