Pragmatic DevOps

"Universal law is for lackeys. Context is for kings." The best practices are the ones that make sense for you. Embracing DevOps culture is not a binary decision. This thinking often leads to inaction. Should we use distributed tracing? Should there be 100% unit test coverage? Should we use Infrastructure-As-Code? We should when the right time comes.


In this session, we will try to explore what DevOps practices make sense for companies at different stages operating in different industries, DevOps maturity models, and the role of platform engineering in all of this.


Vaidik will present the story of DevOps maturity over time at Blinkit (formerly Grofers) as well as their internal microservices maturity model deployed to help teams prioritize areas of focus that help improve DevOps practices across teams.

VK

Vaidik Kapoor

Technology Consultant,

Chapters

Full transcript

The complete talk, organized by section.

Vaidik Kapoor

Hi everyone. I hope you're enjoying DevOps Enterprise Summit. My name is Vaidik Kapoor. I'm a technology consultant. Before this, I spent six years at Grofers leading DevOps, platform engineering, and security.

I love sharing stories and experiences with an emphasis on the why of many things that happen in those stories. When Grofers started, we were far from perfect. A lot of things went wrong in how we set up our technology teams, and thankfully, we managed to course-correct in the following years.

Today, I want to talk about why many things kept going in the wrong direction for us and how we managed to course-correct. What lessons of pragmatism did we learn out of that journey? And finally, we will discuss the much-disputed topic of maturity models and how we used them at Grofers. Hopefully along the way, we will engage in conversations on Slack and later in other forums during the conference.

Let me start with our story. Grofers started in 2013 as a hyper-local grocery marketplace, delivering orders in 90 minutes. You could go on our app, find the nearest grocery store, place an order, and we would deliver it to your doorstep.

We started with a simple architecture: three services, or applications if you will. One was the back-end for our consumer-facing applications, one was for catalog management, and one was for everything to do with order fulfillment: order tracking, support, et cetera. This was simple and good enough. It worked fine for us initially to build quickly and roll out new features. The business was also simple enough. Everything was on AWS from day one.

As the business grew, as we gained more traction, solved more problems, and especially as more people joined our tech team, it became hard for us to work across these applications. There were a lot of problems being worked on in parallel, so more people working on the same code bases at the same time usually meant overstepping and involving handshakes at some levels.

More bandwidth seemed like the biggest bottleneck, so we acquired a company to double up our engineering strength. The timing was such that we needed to move really fast. We just couldn't take a pause and reflect on how we were going to collaborate on these code bases. We couldn't build the tooling, practices, and developer experience that would allow all teams to move fast with this setup.

While there are companies that have managed to successfully work with monolithic code bases with far more developers than Grofers, I think we were a much younger team, not mature enough to do that, especially with the growth pressure we had. Being able to divide and parallelize seemed much simpler than being able to figure out how to make monoliths work for us. So adopting microservices architecture seemed like the next best step.

We started breaking applications into microservices to enable teams to work on problems independently. Every time we could see a new problem that could be independently worked by a team in a domain without dealing with the complexity and chaos of our existing code base, we would spin off a microservice. Teams starting new microservices would choose their own stack to attack the problems at hand.

To make our teams truly autonomous, in early 2016, we felt it was important that we give our teams ownership of systems end-to-end. Every time someone needed to run a tech experiment, if they got blocked on something because they didn't have the access for it, we were not moving fast enough. So we pushed the idea of developer and team autonomy as far as we could, adopted the "you build it, you run it" philosophy, and enabled our teams to make their own technical decisions and manage the entire stack, including infrastructure and operations like configuration management, scalability, resilience, and even handling incidents.

The first bottleneck was provisioning infrastructure resources. We were on AWS but not really leveraging it. Requests to launch and configure EC2 instances and set up databases would come directly to the extremely under-resourced infrastructure team. With the product engineering group growing rapidly, it was impossible for us to keep up with the incoming requests. So we decided to get out of the way of developers as quickly as possible.

We built tools that allowed us to automate these end-to-end flows of provisioning infrastructure safely and without any intervention from the DevOps teams. It really opened up possibilities for our teams to quickly try out things in test environments and put them in production. There was pressure to move fast, and a big artificial bottleneck was out of the way. DevOps teams were responsible for governance, for providing processes and tools for developers to really own the entire application lifecycle.

One big responsibility for the DevOps teams was to coach developers when they did not have the right experience to help them architect for the cloud. For example, we felt that configuration management as a practice was important for us. It would help manage changes better, it would help with CI/CD, and it would provide enough automation that we could implement auto-scaling for services easily.

To be able to scale config management well, we pushed all the developers on how to use Ansible for their applications for configuration management, continuous integration, and auto-scaling. It got to a stage where pretty much every developer could work with Ansible, to an extent that almost every application was managed using Ansible. This was a great place to be. We were proud that we were able to build the right behaviors by coaching and working with developers very closely.

All this worked really well, or at least that's what we believed was happening. In early 2018, we realized that we had an illusion of agility. Teams were working independently on their microservices, deploying multiple times a day. But there were not enough guardrails for quality, so a lot of deployments would lead to bugs and incidents in production.

We were creating waste. We were shipping poor-quality products that were frustrating customers, internal users, and management. Our engineers were burning out as they were busy firefighting rather than shipping value to customers. Systems had become so complex that technical debt was getting worse. Writing code was a terrible experience for most teams.

We used to think that solving for just autonomy by creating boundaries and saying, "You build it, you run it," was enough, and our teams would own the quality of what they shipped. To an extent, it happened. Our teams did what they felt was right and was within their control and boundaries with the best of their intentions. But they did not have a systemic view of what was going on. As a leadership team, we failed to provide oversight over our entire architecture.

We ended up with serious problems that changed the course of how we worked at Grofers. We had a proliferation of microservices because of too much freedom and absolutely no guardrails. Teams could create new microservices as they saw fit, but we were continuously making our systems more complex. Microservices eventually became hard to develop, test, release, and monitor in production. In many cases, the boundaries between teams were not clear enough, leading to hand-offs, slow releases, and a complete lack of ownership.

Our quality feedback loops became extremely poor, so poor that we were mostly getting to know about bugs through customers, customer support, and often directly from the CEO.

We also ended up with an extremely diverse tech stack. This slide doesn't paint the entire picture because it is not easy, honestly, to list out everything. Since technical decisions were localized and democratized, we ended up with a pretty diverse tech stack. You name it, we had it. We had several tools that fulfilled the same purpose. This unnecessary diversity stopped us from achieving economies of scale because of lack of standards, common tooling, and most importantly, lack of mastery over anything. Every tech stack required a unique way of thinking about continuous delivery, which made our journey a lot more painful and expensive.

The worst was that it took us a lot of time to figure out what really happened. When we realized that quality was an issue for us, immediately we created organization focus to improve quality. The entire technology leadership was driving quality as an agenda. Teams were excited about improving quality. Writing tests was largely believed to be the tech debt that we must pay off.

We were using OKRs at the time for setting goals. We would have OKRs like improving test coverage, with key results like 80% in all the services, and nothing would actually get done. We were like, "All right, maybe this is our first attempt. Let's try once again, be more realistic." So we took another attempt, with more realistic goals, reducing the scope of services. We made almost meaningless progress. Maybe two teams made some progress, which we celebrated, but that was also not the state we wanted to be in.

With all the organizational support and alignment, we couldn't make meaningful progress, and it was quite demotivating for all of our teams because they wanted to improve things but were not able to do it.

We felt we took a big goal that was impossible to achieve in the timeframe. So we decided to take another quarter with realistic goals and run another experiment. We reduced the coverage target and focused only on a few critical services, but we made one more change: we allowed teams to pick up their localized problems. We had a very interesting observation.

We saw some teams make progress on some fronts. Some teams worked on improving documentation. Onboarding new engineers was a problem for them, so READMEs got better. Integration across services and sharing of contracts was getting hard, so some API docs were written using Swagger. Recurring issues in production were hard to debug, so we improved on our logs and monitoring.

But testing did not get better at all. For some reason, we were not able to make progress on testing while we were able to improve on other fronts. The next quarter, we were even able to attack new problems like load testing and improving our architecture issues to support our largest ever online sale, where we expected 3x traffic of our regular traffic. We made things happen, and the sale was extremely successful.

This was the story I wanted to tell with as much detail as I could in the little time we have. Of course, there's a lot more that I skipped. But what I really want to share next is some of the lessons we learned in this journey.

The first lesson is that there's no such thing as a best practice that you must follow. Best practices should probably be called recommended practices. We almost never achieved any of the goals where we wanted to fix a practice across all the services. Anything like, "Let's get test coverage to 80%," or "Let's define SLOs for all the services," were never really achieved. In retrospect, we didn't get anything done because we didn't need to follow all of those practices at the time in all the places. The value was not clear, and the effort was not worth it.

For example, we started managing our infrastructure as code many years back, but it was not always necessarily done with everything we did. It was usually the parts that were fast-moving, frequently changing, or expected to change frequently in the future, too critical for manual error, or had to be democratized, and that was good enough.

Another story was with config management. Before Grofers, I was coming from the world of Puppet. I loved Puppet for what it was. Even though Puppet was probably a better technology for configuration management, in my opinion, when I introduced Puppet at Grofers, our teams really struggled to get started with it quickly. Our reality back then pushed us to look at something that was simpler to understand for our teams, get adopted quickly, and be extensible for most people, and Ansible was a better choice.

DevOps practices that have a clear plan for adoption get adopted faster, especially when the plan is attached to outcomes. Case in point: the time when our teams decided to improve documentation. If you don't have a culture for documentation, you have to be careful about how you introduce it and change the culture. What problems are you trying to solve with it?

We went from saying, "We need to improve documentation everywhere," to "We need to improve documentation to help onboarding new engineers faster." It was a specifically clear problem. Our teams felt that without minimal documentation, onboarding new engineers was becoming a problem. It was affecting the teams directly. The outcomes and the associated tasks were clear enough. Every repo should have a README with a brief description, clear and well-tested setup instructions, recommended tooling for development, and clearly defined owners. And so it got done without a lot of stress. We made good progress.

At the other extreme of this was testing. There were several holes in our plan to get better at testing. One big reason why we were not able to progress on testing was that most engineers on our teams didn't know what tests were valuable to the cause. Unit versus functional testing was a constant debate. The people who were driving testing as an initiative took it for granted that everybody would understand, or that it was not a complex topic for them to understand.

Another big challenge for getting better at testing was a complex problem deeply rooted in the problems of our microservices architecture, which required a completely different strategy. We figured this out after constantly retrospecting over our many failed attempts. I spoke about some of these challenges at DevOps Enterprise Summit last year.

We found ourselves prioritizing instead of blindly following all the practices across the services. The cost of paying off technical debt altogether was very high. But whatever felt like it came in the way of delivering value or was a big risk, there was usually someone on our team pushing for solving it hard, and then problems would get solved. Phrases like "critical services" became common in our conversations, and that meant something. Failures pushed us to adopt practices in critical services instead of all services.

Even if we wanted to make changes in all services together, without a clear execution strategy, nothing would ever get done to an acceptable level. Having some prioritization framework helps convey the urgency and helps you make progress. Progress is far more important than being perfect.

Every team, and by extension the services and code bases owned by them, could be dealing with different problems and might have different needs. The solutions of those problems need to be looked at differently as well, or the prioritization of problems to solve can be different.

I've often seen teams get stuck in objectives like standardization. While standardization is a great idea, standards in systems can also come in the way of moving fast or delivering what is most important. To what level you should standardize should depend on the economies of scale you want to achieve, not doing the same way just because that's how it should be. Often there could be something better to do than just standards.

For example, our consumer-facing systems had scale-related challenges, while our supply chain systems had challenges of correctness and reliability. Every time we decided on an org-wide technology investment that was not a real priority for every team, like adopting SLOs, we would make progress where it was a priority, but other teams might not be able to keep up. In this case, our consumer teams were able to implement SLOs much more quickly as compared to our supply chain teams.

Reflecting on our journey got us to learn some of the places we were going wrong. We had to figure out where we go from here and how we internalize these learnings in our execution across teams. Unfortunately, we couldn't think of an easy way, so we started wondering. We felt that we needed to learn. This is the point where we got introduced to the concept of DevOps maturity, mostly by reading a bunch of really nice books that this community already knows about.

Here is the first maturity model someone in my team shared on the Slack channel. There's a continuous delivery maturity model from the book Continuous Delivery by Dave Farley and Jez Humble. In this, we found a way to articulate what we had learned. DevOps practices do not get adopted on day one. You move towards a vision and there are intermediate steps.

This framework highlights the importance of different aspects of continuous delivery to turn the concept into execution. Each of those rows is an area that is important for practicing CD, and the columns from basic to expert are levels of maturity. You start from left and the expectation is that you're moving towards right on each of the rows, hence maturing as you practice.

An important call-out in this framework is the first row, culture. Maturing in engineering practices is not just about maturing how you use tools and technologies, but also your ways of working. With a framework like this, you can clearly define those intermediate steps and also use them as internal or external benchmarks.

This was a good direction and it made sense. But we couldn't really take this to our teams and expect them to use it because it was too high level, not prescriptive enough about practices in a specific context. Solutions were missing. It is aspirational in the sense that following engineering practices can become a goal in itself, rather than delivering business value.

The question we were asking was: how do you operationalize a maturity model? How do you make yourself go from, "Hey, we wish to be an elite team," to a plan and a system that pushes you to get better every day?

Here is probably one-sixth of the maturity model we developed at Grofers. Sorry, I couldn't share the entirety of it because the document is quite large. But it is inspired by other maturity models and it incorporates some of our learnings. We call this the microservices maturity model. The idea was to look at all the practices while building systems instead of just one practice like continuous delivery.

From a distance, it seems similar to the one we just saw before, but there are quite a few differences here worth noting. Let's look at what we have here first. On the left side, we have pillars in the first column. On the second column from left, we have areas within these pillars. This way it is not as high level and gets into a little more detail on what kind of practices we want to see followed in the organization.

Then, in the third column from left and onward, we have levels of maturity: level one through level four, level four being the most mature state. So structurally, it is very similar to the previous maturity model.

The first-column categories, or pillars, are sort of macro engineering practices, while the areas within these pillars are more specific practices within the pillars. We have these five pillars, and then they have several areas of practices under them. This way it is not as high level as the previous maturity model and adds a little more detail.

These are things you can borrow from maturity models, like we did from some other maturity models. But the key thing to understand here is that what you decide to put in your model has to be important for your business, instead of focusing on everything. Remember, it's a journey. Progress matters, not perfection.

Depending upon your business, industry, and journey, you can craft your maturity model that focuses on practices that are important for you today and the ones that are infinite games you must start playing. Maybe you're an e-commerce business like Grofers. Things like the ability to release fast and run many experiments in parallel without breaking customer experience are important. So you create a focus on agility, releasability, experimentation, quality, and resilience.

Maybe you're a fintech business. Then things like correctness, transactional guarantees, security, and compliance matter a lot more. Maybe you're a B2B SaaS business. Then maybe reliability with compliance is more important. Maybe cost is very important for you, so you create a systemic focus on that.

The third column onward has levels, just like in the continuous delivery maturity model. But a difference here is that these columns have two sub-columns in them. One is called expectation, and the other is called "supported at Grofers." I will come to what expectation means later, but for now, let's just read it like we did the previous maturity model.

For example, synthetic monitoring on level two says, "The expectation is that synthetic monitoring be used in production with alerting," and the adjacent "supported at Grofers" column specifies a recommended way to meet that expectation. In this case, we suggest that services must implement a well-defined smoke suite with P1 test cases that can run in all the environments and can be periodically run in production using Jenkins. So we don't just set the expectation, but also prescribe how those expectations can be met.

That's what helps make a maturity model more prescriptive than open-ended. When a team looks at this, they know where they have to go and how they can get there in the Grofers context.

One of the key differences in our approach, which comes from our learning, is that the maturity model is not aspirational. It's actually risk-driven. We don't try to make our services and teams more mature just because they should become more mature. It's not like a career growth path. We get better because our business needs us to get better, and this is where we factor in different kinds of risks for our assessment. This is what becomes a business requirement.

The levels in the columns are not the levels that you try to get to. The levels are pre-decided for every service because that comes from the criticality of that service in our architecture. This was one of the key learnings we implemented. We're not going to get better because we should get better. We will get better because we need to get better in certain areas.

This is why we call the first column on the previous slide expectations. A service at a level is expected to follow certain practices. For example, we have an area called service resilience, under which a level three service is expected to have circuit breakers implemented to avoid cascading failures, while a level four service must practice chaos engineering to continuously validate that failures don't lead to cascading failures.

The levels are pre-calculated on multiple parameters like frequency of code changes, number of active collaborators, whether it is in the critical path to serving the customer, and other parameters like that. We try to mostly calculate the risk automatically and centrally to use a common logic and assign a level to every service, so that teams now understand where their microservices are level-wise and see where they're falling in the DevOps journey.

Once the levels are assigned, teams can self-assess and set their journey to get to the level of expectation as guided by the maturity model. This started to make a lot of sense. It was getting tied really well into a structure where microservices are owned by teams. After teams did a few self-assessments for their microservices, it started to become clear to them what areas they needed to focus on depending upon the nature of the service. Right after, we had the first quarter where most teams organically arrived at the most relevant goals that matched their reality with minimal hand-holding.

Again, just prescribing is not enough. Teams today have to deal with so many decisions and so many different kinds of tools and technologies. There's so much cognitive overload. Expecting everyone to make the best decision for everything they have to do is unjustified. This is where platform thinking comes in: clearly defining how we can help teams adopt various DevOps practices without expecting them to spend a lot of time making decisions, and reducing the cost of transformation by achieving economies of scale.

All the things that you see highlighted in red and yellow in the "supported at Grofers" column are possible solutions that the platform teams came up with that could potentially help the teams adopt practices easily. But these solutions are not productionized today at Grofers yet. This way, platform teams got a clear roadmap of things that they needed to build. Of course, we made sure that we are not tied to the solution, so this is more like an outcome roadmap for our platform teams.

It also stopped a lot of debate of, "We should do this," or "We should do that." We now had a framework to accept or reject ideas and focus on platform execution. I feel that is extremely important for platform teams because the impact of their work is usually not clearly visible, sometimes even to themselves. An outcomes-driven framework like this can help keep the platform teams aligned with the product engineering teams and the business.

The idea of maturity models has been debated before, about their utility and effectiveness. So the Disciplined Agile folks write, "Do maturity models really work?" In 2017, Dr. Nicole Forsgren presented her research at DevOps Enterprise Summit, where she said that maturity models don't work because they go out of date too fast as technology is changing really quickly and the landscape changes really quickly.

While I don't disagree with the point of technology moving too fast these days, doesn't everything that we do today get outdated quickly as well? Isn't that true with all the technologies, with or without maturity models, that we use, ways of working, organization policies?

Are maturity models effective in the way we deployed them at Grofers? I think only time will tell, but we committed to doing this and also committed to revising the maturity model itself with time. Because platform teams derive their goals out of this model, the relevance of everything the model says has been reviewed and questioned several times after we released the first version.

What we also noticed is that practices that have stood the test of time don't really change. The technology supporting the practices changes, and that's fine because we deal with that kind of change anyway.

Approaching a maturity model was our solution to help us scale engineering management with a team that was young and lacked the experience it needed to build systems at scale. Your reasons could be different. You'll have to see for yourself if this works for you.

And that's it, folks. I hope you enjoyed the session as much as I did presenting it. I would love to take questions on Slack or enjoy a virtual coffee or beer to dive deep on this topic with you.