Pragmatic DevOps

"Universal law is for lackeys. Context is for kings." The best practices are the ones that make sense for you. Embracing DevOps culture is not a binary decision. This thinking often leads to inaction. Should we use distributed tracing? Should there be 100% unit test coverage? Should we use Infrastructure-As-Code? We should when the right time comes.


In this session, we will try to explore what DevOps practices make sense for companies at different stages operating in different industries, DevOps maturity models, and the role of platform engineering in all of this.


Vaidik will present the story of DevOps maturity over time at Blinkit (formerly Grofers) as well as their internal microservices maturity model deployed to help teams prioritize areas of focus that help improve DevOps practices across teams.

VK

Vaidik Kapoor

Technology Consultant,

Transcript

00:00:13

Hi, everyone. I hope you're enjoying DevOps enterprise summit. My name is Kapu. I'm a technology consultant before this. I spent six years at goers, leading DevOps platform engineering and security. I love sharing stories and experiences with an emphasis on the why of many things that happen in those stories. When gr started, we were far from perfect. A lot of things went wrong in how we set up our technology teams. And thankfully we managed to course correct in the following years today, I want to talk about how, why many things kept going in the wrong direction for us. And how did we manage to post? Correct. What lessons of did we learn out of their journey? And finally, we will discuss the much disputed topic of the models and how we use them at and overly along the way we engage in on slack. And later in other forums during the start with our story started in around 13 as a hyper local grocery marketplace, delivering orders in 90 minutes, we could go on our app, find the nearest grocery store place an order, and we will deliver.

00:01:15

We started with a simple architecture, three services, our applications, if you will, one has the backend for our consumer facing applications, one for catalog management and one for everything to do with auto fulfillment that is order tracking, support, et cetera. This was simple and good enough, but fine for us initially to build quickly and roll out new features. Uh, the business was also simple enough. Everything was on blues from day one as the business through, as we gain more traction, uh, we solve more problems and especially as more people joined our tech team, it, uh, became hard for us to work across these applications. There were a lot of problems being worked on in Pelham. So more people working on the same code basis at the same time, usually meant overstepping and involving handshakes at some levels. Well, bandwidth seemed like the biggest bottleneck. Uh, so we acquired the company to double up our engineering strength.

00:02:06

The timing was such that we needed to move really fast. Uh, we just couldn't take a pause and reflect on how we are going to collaborate on these code. We can build the tooling practices and developer experience that would allow all teams to move with this setup. Well, there are companies that have managed to successfully work with monolithic, uh, with far more developers than I think we were a much more younger team, not mature enough to do that, especially with the growth pressure we had being able to divide and paralyze seemed much simpler, uh, than being able to figure out how to make model work for us. So adopting microservices, architecture seemed like the next best step. We started breaking applications into microservices to enable teams to work on problems independently. Every time we could see a new problem that could be independently worked by a team and a domain without dealing with the complexity and chaos of our existing code piece, we would spin off into microservice teams, starting new microservices will choose their own stack to attack the problems at hand, to make our teams truly autonomous.

00:03:11

In early turn 16, we felt it was important that we give our teams ownership of systems end to end. So every time someone needs to run a tech experiment, if they get logged on something because they don't have the access to it, uh, we are not moving fast enough. So we pushed the idea of developer on developer and team autonomy. As far as we could adopted you build it Uran philosophy and enabled our teams to make their own technical decisions as the entire stack, including infrastructure and operations like configuration management, scalability, resilience, and even handling instance. The first bottleneck was provisioning infrastructure resources. So, um, we, we were on AWS, but not really leveraging it request to launch and configured easy two instances and setting databases would come directly to, uh, the extremely under resource infrastructure team and with the product engineering group, uh, growing rapidly, it was impossible for us to keep up with the incoming requests.

00:04:05

So we decided to get out of the way of developers as quickly as possible. We built, uh, tools, um, that, um, allowed us to automate these end-to-end flows of provisioning infrastructure safely and, uh, without any intervention from the DevOps teams. And it really opened up possibilities for our teams to quickly try out things in test environments and put them in production. There was a pressure to move fast and a big artificial bottleneck was out of the way. DevOps teams were responsible for governments, for providing processes and tools for developers to really own the entire application life cycle. And one big responsibility for the DevOps teams was to coach developers when they did not have the right experience to, uh, help the architect for the club. For example, we felt that the con that configuration management as a practice was important for us, uh, it would help manage changes better.

00:04:54

It would help with CS C D and it would help with enough. Uh, it would provide enough automation that we can implement auto scaling for services easily, and, uh, to be able to scale conflict management. Well, we just push all the developers on how to use answerable for their applications for configuration management S integration and, and order scaling. And, uh, uh, in this case, we got to a stage where pretty much every developer could work with Ansible to an extent that, uh, almost every application was managed using, uh, Ansible. So this was a great place to be. We were proud that we were able to build the right behaviors, um, by coaching and working with developers very closely and all this worked really well, or at least, uh, that's what we believed that was happening. So in early 2018, we realized that we had an illusion of agility teams were working independently on their microservices deploying multiple times a day, but there were not enough guards quality.

00:05:47

So a lot of, uh, deployments would lead to bugs and incidents in production. We were creating waste. Um, we were shipping poor quality products that were customers, internal users and management. Um, our engineers were burning out as they were busy firefighting. Uh, then shipping value to customers systems have become so complex that technical debt was getting worse. Uh, writing code was a terrible experience for most teams. Uh, we used to think that solving for just autonomy by creating boundaries saying, you build it, you run, it is enough, and our teams will own quality of what they ship. And to an extent it happened, our teams did what they felt was right and was within their control and boundaries with the best of their intentions. Uh, but they did not have a systemic view of what was going on. And as a leadership we team, we failed to provide, uh, an oversight over our entire architecture.

00:06:40

We ended up with serious problems that changed the course of how we worked at TRS. We had a proliferation of microservices because of too much freedom and absolutely no guardrails teams could create new microservices as if we see fit. But we were continuously making our systems more complex. Uh, microservices eventually became hard to develop desk release and monitor introduction. In many cases, the boundaries between teams were not clear enough leading to handoffs, slow releases and complete lack of ownership. Uh, our quality feedback loops became extremely poor, so poor that we were mostly getting to know about bug from customers, customer support, and often directly from the CEO. We also ended up with an extremely diverse text stack. Uh, this flies doesn't paint the entire picture because it is not easy honesty to a list of everything. Uh, since technical decisions were localized and democratized, we ended up with a pretty diverse tech stack.

00:07:36

You name it, we had it. We had several tools that fulfilled the same purpose. This unnecessary diversity stopped us from achieving economies of scale because of lack of standards, common tooling, and most, most importantly, lack of mastery over anything. Every tech stack required a unique way of thinking about continuous delivery, which made our journey a lot more painful and expensive. The worst was that it took us a lot of time to figure out what really happened. Um, when we realized that quality was an issue for us, uh, immediately we created organizations focused to focus on, to improve quality. The entire technology leadership was driving quality. As an agenda teams were excited about improving quality writing tests was largely believed as the tech tech that we must pay off. We were using OK. At the time for setting code. So we would have OKRs like improved desk coverage with cares like 80% in all debt services and nothing will actually get done. We are like, all right, maybe this is our first step. Let's try once again, be more realistic. So we take another attempt and, uh, with more realistic growth, reduce the scope of services. Uh, we made almost meaningless progress. Maybe two teams made some progress, uh, which we celebrated, but that was also not the state we wanted to be in with all the organizational support and alignment. We couldn't make meaningful progress. And it was quite de-motivating for all of our teams because they wanted to improve things. And, uh, but we're not able to do it.

00:09:06

We felt we took a big goal that was impossible to achieve in the timeframe. So we decided to, uh, take another quarter with realistic goals and, uh, run another experiment. Uh, we reduced coverage target and focus only on a few critical services, but we made one more change. We allowed teams to pick up their localized problems. We had, uh, very interesting observation. We saw them, some teams make progress on some fronts. And so some teams worked on, uh, improving, uh, documentation, onboarding new engineers was a problem for them. So, uh, so readme has got better, uh, integration across services and sharing of contracts was getting hard. So, so maybe the docs were written using swag, uh, recurring issues in production were hard to Debu. So we improved on, uh, logs and monitoring. Uh, but, uh, testing did not get better at all. So for some reason, we were not able to make progress on testing while we were able to make, uh, uh, able to improve on other fronts.

00:10:02

Um, the next quarter, we were even able to attack new problems, like load testing and improving our architectural issues to support, uh, our largest ever online sale, where we expected three X traffic of a regular traffic. Uh, we made things happen and the sale was extremely successful. So this was the story. Uh, I wanted to tell with as many, as much details as I could, uh, in the little time we have, but of course there's a lot more than, but what I really want to share, uh, next is some of the lessons we learned in this journey. The first lesson is that there's no such thing as a best practice that you must follow best practices should probably be called, um, recommended practices. We almost never achieved any of the goals where we wanted to fix a practice across all the services. Anything like let's get, um, let's get test coverage to 80%, or let's define SLOs for all the services, uh, were never really achieved.

00:11:01

In retrospect, we didn't get anything done because we didn't need to follow all of those structures. At the time in all the places, the value was not clear and the effort was not worth it. For example, we started managing our infrastructure scored many years back, but it was not always necessary necessarily done with everything. Uh, we did, it was usually the parts that were fast moving frequently changing or expected to change frequently in the future, too critical for manual later or had to be democratized. And that was good enough. Another story was with config management before GRS, that was coming from the world of puppet left puppet for what it was. And, um, even the puppet was probably a better technology for configuration management. In my opinion, uh, when I introduced operators, our teams really struggle to get stuck with it quickly. Our reality back then pushed us to look at something that was simpler to understand for our teams get adopted quickly and is extensible for most people.

00:11:58

And Ansible was a better choice develop practices, uh, that have a clear plan for adoption get adopted faster. And especially when the plan is attached to outcomes case in point the time when our teams decide to improve documentation, if you don't have a culture for documentation, you have to be careful about how you introduce it and change the culture. What problems are you trying to solve with? We went from saying, we need to improve documentation everywhere to, we need to improve documentation to help onboarding new engineers faster. It was a specifically clear problem. Uh, our teams felt that without minimal documentation onboarding new engineers is becoming a problem. It was affecting the teams directly, the outcomes and the associated tasks were clear enough. Every should have a read me with a brief description, clear and well tested set of instructions, a recommended tooling for development, and clearly defined owners.

00:12:51

And so it got done without a lot of stress. We made good progress. Um, and the other extreme of this was testing. There were several holes in our plan to get better at testing. One big reason why we were not able to progress on testing was most engineers on our teams. They know what tests are valuable enough unit versus functional testing was a constant debate. And the people who were driving testing, um, as, as an initiative, uh, they, they took it for granted that everybody would understand, or if it is not a complex topic for them to understand another big challenge for getting better at testing was a complex problem, deeply rooted in the problems of our microservices architecture, which acquired a completely different strategy. Uh, we figured this out after constantly retrospecting over many field tense, I spoke about some of these challenges that DevOps enterprise summit last year, we found ourselves prioritizing instead of blindly following on the practices across the services, the cost of paying off technical debt all together was very hard.

00:13:54

Uh, but whatever felt like comes in the way of delivering value or was a big risk. There was usually someone, someone on our team pushing for solving it hard. Uh, and then problems will get solved. Uh, phrases like critical services, uh, became common in our conversations. And that meant something, uh, failures pushed us to adopting practices in critical services instead of all services. And even if we wanted to make changes in all services together without a cleared execution strategy, nothing would ever get done to an accepted level. So having the, having some prioritization framework helps, can be the urgency and make helps. You make progress and progress is far more important than being perfect every team. And by extension the services and code business owned by them could be dealing with different problems and might have different needs. And the solutions of those problems need to be looked at differently as well are the prioritization of problems, uh, to solve can be different.

00:14:52

Um, I've, uh, often seen teams get stuck in objectives like standardization, uh, while standardization is a great idea, standards and systems, uh, can also come in the way of moving fast, um, or delivering what is most important to what level should you standardize to depend on economies of scale you want to achieve and not the doing the same way, just because that's how it should be. And often there could be something better to do than just for example, our consumer facing systems had scale related challenges where supply chain chain systems had the challenges of correctness and reliability. Every time we decided an <inaudible> technology investment that was not a real priority for everything like adopting S uh, we will make progress where it is a priority, but other teams might not be able to keep up. In this case, our consumer teams, uh, were able to implement SLOs much quickly as compared to our supply chain teams.

00:15:52

So reflecting on a journey got us to learn some of the PLA some of the, uh, places we were going wrong. We had to figure out where do we go from here? And how do we internalize these learnings in our execution of costumes? Um, unfortunately we couldn't think of an easy way. Uh, so we started monitoring. We felt that we needed to learn. Um, and this is a point where we got introduced to the concept of DevOps maturity, uh, mostly by reading a bunch of really nice books that this community already knows about. So here's the first maturity model. Someone in my team shared on a slack channel. Uh, there's a continuous delivery model. Majority model from the book contains delivery by Dave Farley and Jaun. In this, we found a way to articulate what we had learned. Um, DevOps practices do not get adopted on day one.

00:16:36

You move towards the vision and there are intermediate steps. Um, this frameworks, the importance of different aspects of continuous delivery to turn the concept into execution. Each of those rules in an is an area that is important for practicing CD. And the columns from basic to expert are levels of maturity. So you start from left and the expectation is that, um, you're moving towards right on each of the rules, hence maturing is in practice. Um, an important call out in this framework is first row culture. Uh, so maturing in engineer practices is not just about maturing, how you use tools and technologies, but also your ways of work, uh, with a framework like this, you can clearly define those intermediate steps and also use them as an internal, as internal or external benchmarks. This was a good direction and it made sense. And, but we couldn't really take this to our teams and expect them to use it, uh, because it was too high level, not prescriptive enough about practices in, in a specific context, um, solutions were missing. Um, and it is aspirational, uh, in a sense that follow engineer practices can become a goal in itself than delivering business value. So the question we were asking was how do you operationalize a maturity model? How do you make yourself go from, Hey, we wish to be an elite team to a plan and a system that pushes you to get better every day.

00:17:55

Here's probably one sixth of, uh, the maturity model we developed at three I, uh, at profu I, sorry, I couldn't share the entire of it because the document is quite large. Uh, but yeah, it is inspired by the majority models and, uh, it incorporates some of our learnings. Uh, we call this, uh, the microservices maturity model. The idea was to look at all the practices while building systems, instead of just one practice like indigenous delivery from a distance. It seems similar to the one we, uh, just saw before, but there are quite a few, uh, differences here with, uh, but let's look at, uh, what do we have here first? So on the left side, we have pillars in the first column. Um, on the second, in, on the second column from left, we have areas within these pillars. Uh, this way is not as high level and gets to, uh, to little more details on what kind of practices do we want, uh, to see being followed in the organization.

00:18:54

And then a third column from left and on which we have levels of maturity level one through level four level follow being the most maturity state. So structurally very similar to the previous maturity model. The categories are, um, uh, are the first column are the pillars are sort of macro engineering practices, um, while, uh, the areas, uh, within these pillars are more specific practices within the pillars, right? So, uh, we have these five pillars and, uh, then they have their several areas of practices under them. Um, this rate is not as high level as a previous maturity model and adds a little more detail. Um, these are things you can borrow from maturity model, like we, we did from a, some other maturity models. Uh, but the key thing to understand here is that, uh, what you decide to put in your model has to be important for your business instead of focusing on everything, uh, remember its journey progress matter is not perfection.

00:19:51

So depending upon your business industry and journey, you can craft your maturity model that focuses on practices that are important for you today. And the ones that are in finite games, you must start playing, maybe you're an e-commerce business. Like PROFIS things like ability to release fast, uh, run many experiments in parallel without breaking customer experience is important. So you create a focus on agility, reliability, uh, experimentation, quality, and resilience. Maybe you are a fining business when things like correctness, transactional guarantees, uh, security and compliance matters a lot more. Maybe you are a B2B business, then maybe reliability with compliance is more important. Maybe cost is very important for you. So you create a systemic focus on that.

00:20:39

The third column, uh, uh, the third column on that we, we have levels just like, uh, in the continuous delivery maturity model, but a difference here is that these columns have two sub columns and columns in them. One is called expectation and the other is called support ATS. I will come to what expectation means later, but for now, let's just read it like we did the previous maturity model. So for example, uh, synthetic monitoring on level two says the expectation is that synthetic monitoring, uh, to be used in production with alerting. And they just, just in support a column specifies a recommended way to meet that expectation. In this case, we suggest that services must implement a well defined smoke suite with P1 test cases that can run in all the environments and can be periodically run in production using Jenkins. So we don't just set the expectation, but also describe how can those expectations be met. That's what helps make a maturity model most restrictive than open ended. When a teams look at this, they know where they have to go and how they can get that into <inaudible> context.

00:21:48

One of the key differences in our approach, which comes from our learning is that majority model is not aspirational. It's actually risk driven. So we don't try to make our services and teams more mature just because they should become more mature. It's not like a career growth path. We get better because our business needs us to get better. And this is where we factor in different kinds of risks for our assessment. And this is what becomes a business requirement. So the levels in the columns are not the levels that you try to get to, but the levels are predecided for every service, because that comes from the criticality of that service in our architecture. This was one of the key learnings we implemented. We are not going to get better because we should get better. We will get better because we need to get better in certain areas.

00:22:33

And this is why, why we call the first column or the previous slide expectations, a service at a level that is expected to follow certain practices. For example, we have an area called service resilience under which a level three service is expected to have circuit breakers implemented, to avoid cascading failures while a level four service plus practice, chaos engineering to continuously validate that failures don't lead to cascading failures. The levels are pre-calculated on multiple parameters, like frequency of code changes, number of active collaborators, if it is in the critical part to serving the customer and, um, and other parameters like that. And we, uh, try to mostly calculate the risk automatically and centrally to use a common, uh, logic and assign a level to every service. Um, so that, uh, the teams now understand where their microservices are level wise and, uh, see where they're falling in the DevOps journey.

00:23:31

Once the level levels are assigned teams can self-assess and set their journey to get to the level of expectation as sky did by the maturity model. This started to make a lot of sense. It was getting tied really well into a structure where microservices are known by teams. Uh, after teams did, did a few self assessment for their microservices. It started to become clear to them as to what are the areas they need to focus on, depending upon the nature of the service, uh, right after we had, uh, the first quarter where most teams organically arrived at the most relevant goals that manage their reality with minimal handhold.

00:24:11

And again, just prescribing is not enough teams today have to deal with so many decisions and so many different kind of tools and technologies. There's so much cognitive expecting everyone to make the best decision for everything they have to do. It's unjust. This is where platform thinking comes in, clearly defining, defining how we can help teams at top various DevOps practices without spending, uh, without expecting them to spend a lot of time making decisions or, and reduce the cost of transformation by achieving economies scale. So all the things that you see highlighted in red, in yellow, in the supported PROFIS column, these are possible solutions that the platform teams came up with that could potentially help the teams adopt practices easily by these solutions are not productionized today, ATS yet. So this way platform teams got a clear roadmap of things that they needed to build.

00:25:00

And of course, we made sure that we are not tied to the solution. So this is more like an outcome roadmap for our platform teams. Um, it also stopped a lot of debate of we should do this, or we should do that. We now had a framework to accept or reject ideas and focus on platform execution. And I feel that is extremely important for platform teams because the impact of their work is usually not clearly visible sometimes even to themselves and outcomes driven framework like this can help keep the platform teams aligned with the product engineer teams and the business. The ideal maturity models has been debated before, um, about the utility and effectiveness. So the doubt naturally, right, two maturity models really work in turned 17, Dr. Nicole first been presented a research at DevOps price summit, uh, where she says that, um, where she said that maturity models don't work because the girl go out of D too fast.

00:25:52

Um, as, uh, technology is faced, uh, changing, uh, really quickly, the landscape changes really quickly. And, uh, while they don't disagree with the point of technology moving too fast, these days, doesn't everything that we do today get outdated quickly as well. Isn't that true with all the technologies with, or without majority models that we use ways of working organization policies are majority models effective in the way we deployed the at, I think only time will tell, but we committed to doing this and also committed to revising the maturity model itself at time. And because platform teams derive their goals out of this model, the relevance of everything the model says has been reviewed in questions several times after we released the first version. Uh, what we also noticed is that practices that have stood the test of time. Don't really change the technology, supporting the practices change. And that's fine because we deal with that kind of changes anyway, approaching a maturity model for a solution to help us scale engineering management with a team that was young and lack the experience that, uh, that it needed to build, uh, systems at scale. Your reasons could be different. You'll have to see for yourself if this works for you. And, uh, that's it folks. I hope you enjoyed the session as much as I did presenting it. I would love to take questions on slack or enjoy a virtual coffee slash to dive deep on topic.