Think Again Before Migrating to Kubernetes - Grofers | US 2021

Login or create a new account

US 2021

Slides not available

Think Again Before Migrating to Kubernetes

Kubernetes has really gained mindshare with technology teams. There is a strong interest and desire in teams to migrate to Kubernetes and other new tech in the cloud-native ecosystem. However the journey to migrate to a cloud-native stack is complex and long and has a lot of hidden costs.

A little over two years back, we took the decision to leave behind our Ansible based configuration management setup for deploying applications on EC2 and move towards containerisation and orchestration of applications using Kubernetes. We have migrated most of our infrastructure to Kubernetes. It was a big undertaking and had its own challenges — from technical challenges of running a hybrid infrastructure until most of the migration was done to organizational challenges such as training the entire team on a completely new paradigm of operations.

This talk is our experience report of adopting Kubernetes and other cloud-native technologies, the challenges that came our way, dealing with complexity of legacy, delivering a great developer platform and the way forward for us.

By attending this session, attendees will get insights into:

- understand possible reasons to migrate to Kubernetes and how to decide if they should migrate or not (with possible alternatives)

- practical challenges of migrating to Kubernetes

- the factors to consider before they decide to adopt Kubernetes

- how to prepare themselves and their teams before on the journey (culturally and functionally)

This talk is not to deter you, but merely to help you understand your reasons clearer before you make the decision.

VK

Vaidik Kapoor

VP Engineering (DevOps & Security), Grofers

Transcript

00:00:00

<silence>

00:00:15

Kubernetes has been getting a lot of attention and there are good reasons behind that. At growers, we desire to migrate to Kubernetes almost two years back. Today I wanna talk about why we went in that direction and how it worked out for us. I work at ERs. We are one of the largest online grocery service in India. We have been around for eight years. We started as a hyperlocal marketplace of grocery stores where customers cannot get online and get their orders within 90 minutes. There was a need for a convenience service like this in India, and we were solving this important need. We grew rapidly as like most startups. We started with a crazy idea and faced multiple challenges along the way. We changed our core value proposition in the business model four times in the last eight years, moving from a hyper-local marketplace to a centralized warehousing model for fulfillment to becoming a marketplace again that now delivers orders under 10 minutes.

00:01:06

Each of these changes were pivotal in our journey. We would not have been able to survive and be where we are today without the tremendous agility built in our organization. We realized that our agility and speed is one of our core strengths for us to do, continue to be relevant and keep transforming ourselves. Today we have about 2000 employees directly working. Working at Roers technology function is about two 50 people, including engineering, product design, and data. Technology is truly at the heart of everything that we do. Pretty early we realized that giving power to our developers to make their own technology decisions will be vital for our agility. We pushed the idea of developer and team autonomy as far as we could. Adopted uBuild Euronet philosophy and enabled teams to take their own technical decisions as the entire stack, including infrastructure and operations like configuration management, scalability, resilience, and even handling production incidents.

00:01:59

DevOps teams are essentially responsible for taking care of governance, for providing processes and tools for developers to really own the entire application life cycle. Before Kubernetes, we were deploying to EC2 instances using Ansible and Jenkins. Common tooling was standardized but not locked. So that also gave us the autonomy to go ahead and explore to other tooling that may be better suited for certain use cases. So this set of led to a very diverse stack that drew organically, rapidly public cloud. AWS in our case, made this autonomy possible to quite an extent, and then we move to Kubernetes. Here's what we have been able to to what we have done with Kubernetes. At a high level in the last two years, 75% of our targeted production services are migrated to Kubernetes. By target we mean that we don't intend to migrate everything like stateful services and some extremely slow moving legacy legacy services.

00:02:52

There's a lot of benefit that comes when using Kubernetes or development and CICD. We practically develop in the cloud. Developers treat the stage environ stage Kubernetes cluster as an extension of their laptops. Complex integrated dev environments with more than 20 microservices can be created by anyone on demand under 10 minutes. And what it means for our developers is that we use a combination of tools built on open standards that work well for specific purposes, but at the same time, we limit them as well. While we have moved to Kubernetes, we still continue to use Ansible and uh Ansible, our Ansible based tooling for certain kinds of workloads where we feel it works best in the larger scheme of things. Platform teams continue to build abstractions that reduce s overhead, this cost security and reliability so that developers can truly own things end-To-end, it took us a couple of years engineering time from product and platform engineering and hundreds of thousands of dollars to get to this.

00:03:49

It was a big investment, so was it worth it? In early 2018, we realized that we had an illusion of agility. Teams were working independently on the microservices, deploying multiple times a day, but there were not enough guardrails for quality. We were creating waste and shipping poor quality products that were frustrating customers, internal users and management. Our engineers were burning out as they were busy firefighting and shipping value to customers. We used to think that solving just for just infrastructure management and the delivery pipeline was enough, saying, you build it, you run, it is enough, and our teams will write tasks and own quality as they see appropriate. And to an extent, it happened. However, what we ended up was the proliferation of microservices that the teams created to my technology within their boundaries as they understood. And more often than not, the teams were not considering the impact of introducing new microservices on our overall architecture.

00:04:40

We ended up with auto autonom autonomous teams creating microservices independently to solve problems within their boundaries and under their control. Due to this missing guidance of an overall architecture, we ended up with microservices that were hard to develop, test, release, and monitor in production. In many cases, the boundaries were not clear enough leading to slow releases our quality feedback. Feedback loops are extremely poor, so poor that we were mostly getting to know about bugs from customers, customer support, and sometimes directly from the CEO. This is just unacceptable. Lack of technical oversight made this microservices architecture complex. While the infrastructure level processes were decently managed, application releases were manually tested, orchestrated and testing for certain kinds of behaviors was just not possible. Whenever we would try to approach testing microservices, we would not be able to make any meaningful progress as testing. When any one microservice was not enough, our microservices lost their well-defined boundaries.

00:05:37

Over time, we figured that we were make now we were now dealing with a distributed monolith that had become hard to reason about. The one lesson we learned in all of this is that we, if you allow your teams to launch more components auto autonomously and independently through self-serve tooling, they will do it. But if there's no framework to surface problems in your architecture and engineering practices, the complexity will become too hard to comprehend and more and more mass will just pile on. Pile on. Our developers were unhappy because of poor developer experience. We were regularly dealing with bugs and incident and production due to lack of testing. It was a stressful atmosphere at that point of time. In March, 2018, we, we decided to slow down to find a viable way forward, to ship fast enough without continuously compromising on quality and resolve the mess we created ourselves.

00:06:29

Since our microservices were not really independent, independently testing them was not enough. We decided to run automated regression tests on a distributed monolith to ensure that a change in any microservice should not break the product product experience, which essentially meant running behavior tests on the entire backend. For every little change, our bet was that this would help us increase our deployment frequency, again, without compromising on quality. At the same time, it would give us a safety net to re-architect. We call this initiative project, ship it, and we felt like we knew how to get where we wanted to go. We started experimenting with Docker and Docker compose to build on demand CI environments in FAB 2018. In about a month's time, we could orchestrate our complex backend and run tests over it, but it was all too slow and unstable for any real rules. We made efforts to finally stabilize it and got to some acceptable level of stability to give us push until we realized that we now have a new problem at hand.

00:07:30

Dev brought disparity. We were using Ansible to deploy to production and Docker composed in our test environments. This led to test passing in ci, but deployments causing out IT and bugs. Finally in SEP September two 18, we realized that our tooling is not going to work. We were building container orchestration with a tool tool that was built for local development and not production. We needed something production. Great. The industry's momentum was towards Kubernetes, so that's where we decided to go as well. By Jan 2019, we had put a few critical services in production to prove scale. Finally by Marshall in 19, we started migrating away from our docker composed BA C CIS setup, and also start migrating those services to production. We were finally on the path to running tests and to end on a stable CI environment with dev brought parity and also reusing the tooling for streamlining global dev experience. This is how and why we started using Kubernetes. We moved to Kubernetes for a better dev experience and for more agility and Kuber was, Kubernetes was a way to do that in our team's context.

00:08:37

So while we migrated to Kubernetes, we didn't migrate to it just because our goal to release high quality software while being able to re-architect in indirectly push us to adopt Kubernetes, the process of which was quite complex and expensive. The learning I want to share here is be clear of your reasons before you move to Kubernetes. Is it more agility or better reliability, like faster order scaling streamlined operations resilience? Or is it cost optimization? Or maybe is it portability across heterogeneous infrastructure? Be clear of your reasons and align them to some business outcomes. The interesting thing is that Kubernetes may not be the only way to achieve all of this. It's most likely just as much possible to do these things with your existing infra infrastructure tooling. Kubernetes just does make things easy so that you don't have to engineer your way for most, for most things.

00:09:29

But remember that there are examples of high performance engineering teams where Kubernetes is still not in use or made it to production pretty late in that journey. It comes down to what it comes down to is to, is the amount of effort you may have to pertain with your existing tooling, which might be significant, but at the same time, even migrating to Kubernetes will require you to pretty much rebuild everything that your DevOps teams have built so far. So the goal really is between redoing everything that you have done so far versus committing to improving your existing platform. We committed to Kubernetes. Was it worth it? Absolutely. We get value out of it every day, and it has made so many things so much simpler for us. It has opened more options for us for future, but I think that question isn't complete. The right question is, did we go about adopting Kubernetes the right way?

00:10:16

And the answer to that is that we have mixed feelings. It took us a year and a half to get to build a baseline Kubernetes platform at par with what we could do on EC2 with Ansible and be able to provision our complex messed up back in on demand for developers under 10 minutes. So did we achieve the goal of as CI loops in safer releases, but rolling out Kubernetes in production for achieving dev propriety made the problem a lot more complex. Running Kubernetes in production is not straightforward. Even with managed services like EKS or GKE, we started believing that our journey from docker to Docker composed to Kubernetes was obvious progression into the container world and the only way to solve our problem. We actually never really cared to look back and reevaluate if it really needed Kubernetes or containers to solve our problems.

00:11:05

In retrospect, we could have taken two tracks, a short term track, probably a quarter to build on top of our EC2 Ansible and console based setup, streamline our usage of service discovery, speed of Ansible, build some clue tooling to spin up new VMs for every developer to set up the complex microservices environment on it. This would've not been an ideal experience overall, but it would've most likely work to help us to get to value faster, and we could have started another long-term track to build a rock solid Kubernetes platform to simplify operations, get rid of complex in-house tooling, bring stability, operate at low cost, and most importantly, build, build on open source open standards instead of building a ton of our our own in-house tooling, which will never be as good as what the community will be building. We went through a ton of complexity, which honestly couldn't, uh, couldn't have been avoided if we were migrating to Kubernetes now or in the future, but we didn't care to evaluate if it was time for us to look into it at that point of time.

00:12:02

Point being Kubernetes comes with a lot of benefits, meaning from developer experience to the operations to better cost to security and governance. But does your company really need all of it? There's usually a way you can even find yourself to believe that it will help, because it will. But then does your company really need it now? It's a big undertaking. So if you do feel the need to adopt Kubernetes, you have to adjust your mindset a bit before getting onto it. Kubernetes in itself does not pro solve any problem unless you understand what it really brings to the table. If your teams are accustomed to doing things manually or infrastructure management is siloed between development and op operations teams, Kubernetes may not help at all, or it could even make things worse. To a large extent. There's an overlap with problems, uh, overlap of problems when migrating to the cloud from on-premise.

00:12:51

So the principles of cloud adoption apply on Kubernetes as well. For some teams. Adopting cloud engineering practices before migrating to Kubernetes may be actually a better approach to get the taste of operations and automation at scale. And to truly appreciate the abstractions that Kubernetes can provide operations on Kubernetes is different in its own way. There's a learning curve for your cloud engineering team and development teams where it is well worth it for your teams to go through the learning curve to unlock the full potential of Kubernetes. We first invested in the infrastructure platform team to learn about Kubernetes and the paradigm with satellite learning time learning sessions, sponsored trainings in Kubernetes certifications, which by the way are great. But soon we realized that even having the engineering leadership outside the platform scope with some senior engineers would've paid off in having them understand the paradigm faster and co build the Kubernetes platform with us that was suited to our needs.

00:13:42

Doesn't matter if Kubernetes is set up by your team on bare metals or if it is Kubernetes set up by a major cloud provider out of the box. Kubernetes is just never enough. Kubernetes is more like what a Linux operating system is for for VMs. You have to configure it with all the bells and whistles to suit the needs of your teams. If you think that dropping in Kubernetes and then migrating all applications to it is enough, we're going to struggle later. The collection of components and the Kubernetes ecosystem tied together to craft a certain kind of developer experience is what makes a pass not Kubernetes alone. One example of an open united Kubernetes distribution is OpenShift by Red Hat, which comes with a lot of components pre-configured. Another one is VMware tan zu. For some teams, it may really be worth it to explore these options instead of setting up everything from scratch.

00:14:32

One of the mistakes we made was to not figure out ingress and directly use a WS elastic load balance for every application. Instead, while getting started was easy, as we understood how ELB works, this led to a high number of load balances and high operational cost due two applications that were not getting as much traffic in production. The costs were even higher in stating and development environments where we were spinning upload balances for non-production environments, increasing by a factor of number of active developers in a day while building Kubernetes platform, we realized that a lot of factors of how you work as a business and team end up shipping how things may end up looking like. It is essentially a reflection of what your business and product product context is, what your engineering teams are capable of, how are, how your existing applications and their infrastructure is, and your ability to spend money to migrate, uh, on a new way of working.

00:15:26

When talking about the bells and whistles, there are tons of things to take care of. There's a list that the the of things that we had, we remembered to sort of take care of, and I'm sure that I've missed out some things in this list as well. And each of those items on the right are corresponding solutions for each of the items on the left. Those are the decisions we had to con consciously make. A special mention is of managing cost. In Kubernetes, you're going to lose visibility big time unless you invest in it, and you're still going to be limited. Limited because the ecosystem lacks enough tools. Today you'll find a ton of totals, posts and conference stocks showing what is possible with Kubernetes, but almost none of them account for legacy applications, which most likely do not closely follow the 12 factor pattern.

00:16:12

This can be really limiting. We found ourselves balancing between changing our, our applications or giving up our vision of an ideal platform. Doing both, most likely would mean more time and higher costs. An example of this in our journey was that was dealing with configuration management. We could not use config maps and secrets and Kubernetes, which required applications to read configurations in a certain way, which is environment variables. Uh, our applications had the configurations externalized using configuration files, which were render by Ansible. So that was our old way. Moving to the Kubernetes way would mean rewriting configuration management in our poly cloud ecosystem. In the end, we found using, we found using console template with console would help us solve configuration management without the overhead of changing every application and how it was handling configuration management. So let's say you figured out what a Kubernetes setup should look like and now uh, you have a working cluster with a couple of example applications.

00:17:07

How do you get it adopted and migrated? We had an interesting challenge, even with the blessings of A CTO and the set and the teams taking up quarterly goals to migrate their applications, nothing would actually move. We would struggle month after month with getting any serious traction to migrate any microservices to production. Managers and teams would start a new quarter over and over and over, but not make any significant progress taking their workloads to production. We would even come as much close to writing some manifests and completing training courses, but whatever we would expect to achieve would never would not get done. We were frustrated about this start progress on looking for the reasons we learned a bunch of things that we are taking for granted. By undertaking a transition of this complexity and scale, everyone agreed to migrating to Kubernetes, but only superficially. It's not that our teams did not want to explore Kubernetes or the teams got closer to execution.

00:18:03

They were not able to clearly see value in migration, migrating the paths to getting value involved, migrating 20 most critical microservices by five teams that were also working on new features. Doesn't matter whose blessings are behind a project. It was impossible to get any of our teams to do anything unless they were clear about the benefits of something. Like I said earlier, the first version of Kubernetes setup was just as good as a easy to set up. The risk of change was too high. The value unclear. Even an exciting core New tech was not enough motivation to migrate a significantly a a learning here was that a significantly better platform, which can serve some value, either more reliability or better dev experience or better observability or better CICD or whatever is a better incentive to increase intrinsic motivation and take any risk. The debate will always be about balancing value versus risk.

00:19:00

We expected the teams to learn Kubernetes using popular online resources, which we had shortlisted while delivering, while delivering product features, which usually get a lot more attention than a migration project to support our teams. We offered office hours, but that didn't seem to be effective. Absence of adequate migration and post-migration supported to fear of what might happen when something goes wrong, we feel we could have done a much better job here of offering proactive support to help with the migration. While we were confident on Kubernetes overall, there were things like service discovery, configuration management, immutability of infrastructure that we had to figure out how it'll work out for us. That made us run parallel stacks in production, which got another level of complications in terms of technical decisions and operational challenges. We had to support existing tooling to deploy changes to both stacks and then be able to roll back from both the stacks as well.

00:19:52

Monitoring became complex as well. When we started using Kubernetes, AWS did not offer a managed Kubernetes service in our region. So we had to self-manage our cluster. It was a nightmare. Getting started may not be as hard, but the real issues come when you deploy workloads. We had issues ranging from the cluster, not auto-scaling at the right time to overlay networking issues to unreasonably high DS latencies. We've seen many painful issues and especially with networking because our team is not an expert of networking and we didn't wanna become that either. So we always recommend going for managed Kubernetes service, but even with the managed Kubernetes service, you have to worry about upgrades. The community is moving really fast and cloud providers are also deprecating. Old versions, very fast. Cost of cost of workloads become hard to measure as well. Like I said earlier, this space is still maturing.

00:20:42

Existing cloud cost tools just don't, uh, work and you are left scratching her head at the end of the month. Why does this cost us so much more? Infrastructure? Right sizing would require you to look at tuning resources and limits concepts you would not have to think about, be, uh, think of before. Again, it's a different paradigm. We do see some benefits like running pretty much our entire environment on spot instances, but then there are some challenges on the point of running spot instances in production. I say that with a disclaimer, there are GA gotchas with that. Security and governance is a different game altogether. In case of Kubernetes, the attack vectors increase significantly. But al you also have a lot of control through automation. So again, embracing the new paradigm is really important to be effective. Speaking of embracing the new paradigm, again, the concept of CRDs operators, controllers, admission controllers and mutating webhooks are ex are important concepts to really push our bud's platform to provide a superior experience.

00:21:42

These concepts allow platform engineers and operators to build abstractions over existing processes to make infrastructure infrastructure management more declarative and streamline operations for developers, we feel that Kubernetes is a great, is great, but it's even better when you're using these features to simplify daily operations and provide a more integrated experience to our developers. We use CRDs to turn some of our workflows to a workflows to a more decorative style of infrastructure management. Today we have a handful of use cases, use cases specific to our setup that we have abstracted using CRDs and custom controllers, many governance and security policies built using admission controllers and mutating web hooks that make lives of product and platform teams. Engineering teams easy with the abstractions built with Kubernetes developers get a more consistent user experience. That's the beauty of it. So our philosophy is to manage as much as possible using Cube CTL to give our developers a simple and consistent user experience.

00:22:41

Well, I've shared what we've learned, uh, from our journey. There's a lot of value in in doing all of this. I want remind you to ask selves if all this is necessary for your business right now, do you have the liberty to make a long-term investment? Or there are more short-term priorities that you must address first, can you make a, can you find a middle ground with your existing stack? Instead of dealing value to by making an investment this big, there's a decision that you as technical leaders must make. So where are we going now? The platform must evolve. In fact, it is still evolving for us. We treat Kubernetes as a product for developers. So we are always out there to discover new use cases and unmet needs. These needs change and will continue to change in localized context of a team and the global context of fer, but the solutions will rise.

00:23:28

There's really no one good answer for a platform, but only one that works today in in your organization's context. As for us, we have built a multi-year Kubernetes roadmap where the architecture of applications and Kubernetes itself is changing together. We are building on top of the, the concept of golden path that solves for 80% of the use cases out of the box while leaving the flexibility for developers for remaining 20% to choose, and also for power users who would like to actually go and explore more ways of doing the same thing, uh, to to learn and to build technical expertise. And we are trying to bake resilience and security in all of this. We are excited for what is coming next. That was our journey. My name is va. You can find me on Twitter, LinkedIn, and Medium, and I'd love to take questions.