Think Again Before Migrating to Kubernetes (US 2021)

Kubernetes has really gained mindshare with technology teams. There is a strong interest and desire in teams to migrate to Kubernetes and other new tech in the cloud-native ecosystem. However the journey to migrate to a cloud-native stack is complex and long and has a lot of hidden costs. A little over two years back, we took the decision to leave behind our Ansible based configuration management setup for deploying applications on EC2 and move towards containerisation and orchestration of applications using Kubernetes. We have migrated most of our infrastructure to Kubernetes. It was a big undertaking and had its own challenges — from technical challenges of running a hybrid infrastructure until most of the migration was done to organizational challenges such as training the entire team on a completely new paradigm of operations. This talk is our experience report of adopting Kubernetes and other cloud-native technologies, the challenges that came our way, dealing with complexity of legacy, delivering a great developer platform and the way forward for us. By attending this session, attendees will get insights into: - understand possible reasons to migrate to Kubernetes and how to decide if they should migrate or not (with possible alternatives) - practical challenges of migrating to Kubernetes - the factors to consider before they decide to adopt Kubernetes - how to prepare themselves and their teams before on the journey (culturally and functionally) This talk is not to deter you, but merely to help you understand your reasons clearer before you make the decision.

breakoutuslas vegasvegas2021
VK

Vaidik Kapoor

VP Engineering (DevOps & Security), Grofers

TRANSCRIPT

00:00:15

Kubernetes has been getting a lot of attention and there are good reasons behind that. I grew up as we decided to migrate to Kubernetes almost two years back today, I want to talk about why we went in that direction and now it worked out for us. I work at drovers. We are one of the largest online grocery service in India. We have been around for eight years. We started as a hyper-local marketplace of proceeds stores where customers can order online and get the orders. Within 90 minutes, there was a need for a convenient service like this in India. And we were solving this important need. We grew rapidly as like most startups. We started with a crazy idea and faced multiple challenges along the way we changed our core value proposition and the business model four times in the last eight years, going from a hyper-local marketplace to a centralized warehousing model for fulfillment to becoming a marketplace again, that now delivers orders or the 10 minutes, each of these changes were pivotal in a journey.

00:01:10

We would not have been able to survive and be where we are today without the tremendous agility Britain, our organization, we realized that our agility and speed is one of our core strengths for us to, to continue to be relevant and keep transforming ourselves. Today. We have about 2000 employees directly working, working at Rufus technology function is about two 50 people, including engineering, product design and data technology is truly at the heart of everything that we do really early. We do laser giving power to our developers to make their own technology decisions will be vital for agility. We pushed the idea of developer and team autonomy. As far as we could adopt it, you build it, you run it philosophy and enable teams to take their own technical decisions by the entire stack, including infrastructure and operations like configuration management, scalability, resilience, and even handling production incidents, DevOps teams, essentially responsible for taking care of governance for providing processes and tools for developers to really own the entire application life cycle.

00:02:08

Before Kubernetes, we were deploying two easy two instances using Ansible and Jenkins. Common tooling was standardized, but not block. So that also gave us the autonomy to go ahead and explore other tooling that may be better suited for certain use cases. So this setup led to a way diverse stack that grew organically rapidly public cloud AWS in our case, made this autonomy possible to quite an extent. And then we moved to Kubernetes. Here's what we have been able to do. What we have done with Kubernetes at a high level. In the last two years, 75% of our targeted production services are migrated to Kubernetes by target. We mean that we don't intend to migrate everything like state food services and some extremely slow moving legacy legacy services. There's a lot of benefit that comes when using Kubernetes for development in CACD, we practically develop in the cloud developers treat the stage environment, state Kubernetes cluster as an extension of the laptops complex integrated dev environments.

00:03:05

With more than 20 microservices can be created by anyone on demand under 10 minutes. And what it means for our developers is that we use a combination of tools built on open standards that work well for specific purposes, but at the same time, we limit them as well. While we have moved to Google ladies, we still continue to use Ansible and, uh, and to build our Ansible based tooling for certain kinds of workloads, we feel it works best in the larger scheme of things. Platform teams continue to build abstractions that reduce ops overhead to discuss security and liability so that developers can truly own things and went and took us a couple of years, engineering time from product and platform engineering and hundreds of thousands of dollars to get to this. There was a big investment. So was it worth it? In early 2018, we realized that we had an illusion of agility teams are working independently on the microservices deploying multiple times a day, but there were not enough guardrails or quality you are creating waste in shipping book.

00:04:05

What did you products have a frustrating customers into users and management? Our engineers were burning out as they were busy firefighting and shipping value to customers. We used to think that solving just for just infrastructure management and the delivery pipeline was enough saying you build it, you run. It is enough. And our teams will write tests and on quality as they see appropriate. And to an extent it happened. However, what we ended up was a proliferation of microservices that they use created to manage technology within their boundaries, as they understood. And more often than not, the teams were not considering them back of introducing new microservices on that overload architecture. We ended up with autonomic autonomous teams, creating microservices independently to solve problems within their boundaries and under their control due to this missing guidance of an overall architecture. We ended up with microservices that we'll have to develop test release and monitor in production.

00:04:57

In many cases, the boundaries were not clear enough leading to slow releases of quality feedbacks. Feedback loops are extremely poor. So poor that we were mostly getting to know about backs from customers, customer support, and sometimes directly from the CEO. This is just unacceptable. Lack of technical oversight made this microservices architecture complex while the infrastructure level processes with decently managed application releases were manually tested, orchestrated and testing for certain kinds of behaviors was just not possible. Whenever we would try to push this in microservices, we will not be able to make any meaningful progress as testing when anyone microservice was not enough, microservices lost their well defined boundaries of the time we figured that we would meet. Now, we are now dealing with a distributed monolith that had become hard to reason about the one lesson we learned in all of this is that if you allow your teams to launch more competence or Darnel, mostly and independently through self-serve tooling, they will do it.

00:05:56

But if there's no to surface problems in the architecture and engineering practices, the complexity will become too hard to comprehend and more and more mess. We'll just buy out. Pylon developers were not happy because of poor developer experience. They will regularly dealing with bugs and ends and production due to lack of testing. It was a stressful atmosphere at that point of time in March, 2018, we decided to slow down to find a viable way forward to ship fast enough, without continuously compromising on quality and resolve the mess. We created ourselves since that micro services were not really independent independently testing them was not enough. We decided to run automated regression tests on a distributed monolith to ensure that the change in any microservice should not make the product product experience, which essentially meant running behavior tests and the entire backend for every little change. A bet was that this would help us increase our deployment frequency again, without compromising on quality.

00:06:53

At the same time, it would give us a safety net to the architect because this initiative project ship it. And we felt like we knew how to get where we wanted to go. We started experimenting with Docker and Docker compose to build the undermine CA environments. In fact, design 18 in about a month's time. We could orchestrate a complex back in and non test over it, but it was all too slow in and stable for any deal groups. We made efforts to find these stabilize it and got to some acceptable level of stability to give us push until we realized that we now have a new problem at hand dev broad disparity, we were using Ansible to deploy to production and Dr. Composing in our testaments, this led to test passing in CGI, but deployments causing outages and bugs. Finally in September, 2018, we realized that our tooling is not going to work.

00:07:46

They were building container orchestration with a tool that was built for local development and not production. We needed something production grade. The industry's momentum was to work scability. So that's where we decided to put as well. By the end of the 19, we had put a few critical services in production to prove scale. Finally, by Marshall in 19, we started migrating away from our Docker composed CIS set up and also started migrating those services to production. Then finally, on the path to running tests and went on a stable environment with dev parity and also reusing the tooling for streamlining global live experience. This is how and why we started using Kubernetes. They moved to Kubernetes for a better dev experience and for more agility and Google was was a way to do that in our teams context.

00:08:37

So while we migrated to Kubernetes, we didn't migrate to it just because I go to release high quality software while being able to rearchitect it indirectly pushed us to adopt NetEase. The process of which was quite complex and expensive. The learning I want to share here is be clear. Your reason before you move to Kubernetes, is it more agility or better reliability, like faster scaling, streamlined operations resilience, or does it cost optimization or maybe is it portability across heterogeneous infrastructure? Be clear a few reasons and align them to some business outcomes. The interesting thing is that Kubernetes may not be the only way to achieve all of this. It's most likely, just as much possible to do these things with your existing infrastructure tooling. Kubernetes does make things easy so that you don't have to engineer your way for most, for most things. But remember that there are examples of high-performance engineering teams where Kubernetes is still not in use or made it to production pretty late in their journey.

00:09:37

It comes down to what it comes down to is two is the amount of effort you may have to put data with your existing tooling, which might be significant, but at the same time, even migrating to Kubernetes will require you to pretty much rebuild everything that your DevOps teams have built so far. So the goal really is between redoing everything that you have done so far versus committing to improving your existing platform, we committed to Kubernetes. Was it worth it? Absolutely. We get value out of it every day and it has made so many things so much simpler for us. It is open more options for us for future. But I think that question isn't complete. The direct question is, did we go about adopting Kubernetes the right way? And the answer to that is that we have mixed feelings. It took us a year and a half to get, to build a baseline Kubernetes platform at par with what we could do on easy to with Ansible and be able to provision a complex messed up back in, on demand for developers and the 10 minutes.

00:10:31

So did we achieve the goal of associate loops and safe releases, but rolling out Kubernetes in production for achieving dev dot piety made the problem a lot more complex. Running Kubernetes in production is not straightforward. Even with managed services like EKS or GK. We started believing that our journey from DACA to Docker, compose to Kubernetes was all this progression into the container world. And the only way to solve a problem, we actually never really get to look back and evaluate if we really needed Kubernetes or containers to solve our problems. In retrospect, we could have taken two tracks, a short term track, probably a quarter to build on top of our easy to Ansible and constantly set up streamline our usage of service discovery, speed of beds, a little bit, some glue tooling to spin up new VMs for every developer to set up the context, microservices and alignment on it.

00:11:22

This would have not been an ideal experience of all, but it would have most likely worked to help us to get to value faster. And we could have started another long-term track to build a rock solid Kubernetes platform to simplify operations, get rid of complex in house tooling, bring stability, operate at low cost. And most importantly, built on open source, open standards instead of building a ton of our own in-house tooling, which we'll never be as good as what the community will be building. We went through a ton of complexity, which honestly couldn't have been avoided if we were migrating to Kubernetes now or in the future, but we didn't get to evaluate if it was time for us to look into it at that point of time point being Kubernetes Scouts with a lot of benefits from developer experience to operations, to better cost to security and governance.

00:12:10

But as a company really need all of it. There's usually a way you can even find yourself to believe that it will help because it will, but then does the company really need it? Now it's a big undertaking. So if you do feel the need or the opportunities, you have to adjust your mindset a bit before getting onto it, Kubernetes in itself does not solve any problem unless you understand what it really brings to the table. If your teams are accustomed to doing things manually, or infrastructure management is siloed between development and operations teams, who've been at these may not help at all, but it could even make things worse. To a large extent, there's an overlap of it. Problems overlap of problems when migrating to the cloud from on-premise. So the principles of cloud adoption apply on Kubernetes as well. For some teams adopting cloud engineering practices before migrating to Kubernetes, maybe actually a better approach to get the taste of operations and automation at scale.

00:13:02

And to truly appreciate the abstractions that Kubernetes can provide operations on Kubernetes is different in its own way. There's a learning curve for your cloud engineering team and development teams, but it is well worth it for your teams to go to the learning curve, to unlock the full potential of Kubernetes. We first invested in the infrastructure platform team to learn about communities and the paradigm, et cetera, died, learning time, learning sessions, sponsored trainings and culinary research certifications, which by the way are trade. But soon we realized that even having the engineering leadership outside the platform, stoop with some senior engineers would have paid off in having them understand the paradigm faster and go build the Kubernetes platform with us that was suited to our needs. Doesn't matter if Kubernetes is set up by your team on bare metals, or if it is Kubernetes set up by a major cloud provider out of the box, Kubernetes is just never enough.

00:13:53

Kubernetes is more like what our Linux operating system is. For four years, you have to configure it with all the bells and whistles to suit the needs of your teams. If you think that dropping in Kubernetes and then migrating all applications. So it has enough, you're going to struggle in the collection of competence and the Kubernetes ecosystem tied together to craft a certain kind of developer experience is what makes it pass, not Kubernetes alone. One example of distribution is open shift by red hat, which comes with a lot of competence pre-configured and other one is being very Tanzel for some teams. It may really be worth it to explore these options instead of setting up everything from scratch. But at the mistakes we made was to not figure it out English and directly use AWS elastic load balancer for every application instead, but getting started was easy as we understood how you'll be works.

00:14:43

This led to a high number of load balances and high operational cost. You two applications that were not getting as much traffic in production, the gospel, even higher staging and development environments, where we were spinning up load balances for non-production environments increasing by a factor of number of active developers in a day, the building Kubernetes platform, we realized that a lot of factors of how you work as a business and team ended up shipping, how things may end up looking like it is essentially a deflection of what your business and product product context is, what your engineering teams are capable of, how it exists, how your existing applications and their infrastructure is, and your ability to spend money to migrate, uh, on a new way of working when talking about the bells and whistles that tons of things to take care of. There's a list of things that we had we'd remembered to sort of take care of.

00:15:36

And I'm sure that I've missed out some things in this list as well. And each of those items on the right, the corresponding solutions for each of the items on the left, those are the decisions we had to consciously make, especially mentioned as of managing costs and Kubernetes, you're going to lose visibility big time unless you invest in it. And you're still going to be limited, limited because of ecosystem lacks enough tools. Today, you'll find a ton of to totals posts and conference stocks showing what is possible with mentees, but almost none of them account for legacy applications, which most likely do not closely follow the perfect pattern. This can be really limiting. We found ourselves balancing between changing applications or giving up our vision of an ideal platform. Doing both most likely would mean more time and high costs that an example of this in our journey was I was dealing with configuration management.

00:16:28

We could not use config maps and secrets and Kubernetes, which required applications to read configurations in a certain way, which is environment variables, uh, applications, and the configurations externalized using configuration files, which would ended by Ansible. So that was at all way, moving to the Kubernetes way would mean rewriting configuration management and our political system. In the end we found using, we found using console template with console would help us solve configuration management without the overhead of changing every application and how it was handling configuration management. So let's say you figured out what the Kubernetes set up should look like. And now you have a working cluster with a couple of example applications. How do you get it adopted and migrated? We had an interesting challenge even with the blessings of a CTO and the setup and the teams taking import illegals to make data applications, nothing would actually move.

00:17:19

We would struggle month after month with getting any serious traction to migrate any microservices to production managers and teams would start a new border over and over and over, but not make any significant progress, taking their workloads to production. We would even come as much close to writings and manifests and completing training courses, but wherever, whatever we would expect to achieve would never would not get done. You were frustrated about this, start progress on digging for the reasons we learned a bunch of things that we are taken for granted by taking a transition of this complexity and scale. Everyone agreed to migrating to Kubernetes, but only superficially is other teams are not one to explore Kubernetes or the teams got closer to execution. They were not able to clearly see value in migration, migrating the bots to getting value in world migrating 20 most critical microservices by five teams that were also working on new features.

00:18:15

Doesn't matter whose blessings are behind the project. It was impossible to get any of our teams to do anything else. They are clear about the benefits of something. Like I said earlier, the first version of a Kubernetes setup was just as good as easy to set up. The risk of change was too high. The value unclear, even an exciting, cool new tech was not enough motivation to migrate, uh, significantly, uh, learning here was that a significantly better platform which can solve some value, either model liability or better dev experience or better observability, a better CACD or whatever is a better incentive to increase intrinsic motivation and take any risks. The debate will always be about balancing value versus just we expected the deans to learn Kubernetes using popular online resources, which we had shortlisted while delivering while delivering product features, which usually get a lot more attention than a migration project to support that NGO offered officers.

00:19:13

But that didn't seem to be effective absence of adequate migration and post migration support introduced fear of what might happen when something goes wrong. We feel we could have done a much better job here of offering proactive support to help with the migration. But if we were confident in, on Kubernetes overall, there were things like service discovery, configuration management, mutability of infrastructure that we had to figure it out, how it would work out for us that made us run parallel stacks in production. We've got another level of complications in terms of technical decisions and operational challenges. We had to support existing tooling to deploy changes to both stacks, and then be able to roll back from both the stacks as well. Monitoring became complex as well.

00:19:56

When we started using Kubernetes, AWS did not offer a managed security service in our region. So we had to self manage a cluster. It was a nightmare. Getting started may not be as hard, but the dealerships, when you deployed workloads, we had issues ranging from the cluster, not auto scaling at the right time to overlay networking issues to under use. The big high DNS latencies has seen many painful issues and especially with networking because our team has done an expert of networking and we didn't want to become that either. So we always recommend going for managed committee service, but even with the managed community service, you have to worry about upgrades. The community is moving really fast. And top providers are also deprecating old version. Very fast cost of cost of workloads become hard to measure as well. Like I said earlier, the space is still in maturing, existing cloud cost tools just don't work.

00:20:45

And you're left scratching our head at the end of the month. Why does this cost so much more infrastructure, right? Sizing would require you to look at tuning resources and limits concerts. You will not have to think about, uh, think of before. Again, it's a different paradigm. We do see some benefits like running pretty much our entire environment on spot instances, but then there are some challenges on the point of running spot instances in production. I say that with a disclaimer that are, that got us with that security and governance is a different game altogether in case of Kubernetes, the attack vectors increased significantly, but you also have a lot of control through automation. So again, embracing the new paradigm is really important to be effective. Speaking of embracing the new paradigm, again, the concept of CRDs operators, controllers, admission controllers, and maintaining web books.

00:21:36

Addicts are important concepts to really push out, but it is platform to provide a superior experience. These concepts allow platform, engineers and operators to build abstractions of existing processes, to make infrastructure, infrastructure management, more declarative, and streamline operations for developers. We feel that Kubernetes is a great, it's great, but it's even better when you're using these features to simplify daily operations and provide a more integrated experience to your developers. We use CRDs to turn some of our workloads to workflows, to a more declarative style of infrastructure management. Today, we have a handful of use cases, use cases specific to a setup that we have abstracted using CRDs and custom controllers, many governance and security policies built using admission controllers and mutating, the books that make lives of product and platform teams, engineering teams, ease with the obstructions. We'll do a Kubernetes developers get a more consistent user experience.

00:22:31

That's the beauty of it. So our philosophy is to manage as much as possible using cube CTL to give our developers a simple and consistent user experience. Well, I've shared what we have learned from my journey, and there's a lot of value in, in doing all of this. I want to remind you to ask ourselves if all this is necessary for your business right now, do you have the Liberty to make a long-term investment? Or there are more short-term priorities that you must invest first. Can you make, can you find a middle ground with that existing stack instead of building value by making an investment despair, there's a decision that you as technical leaders must make. So where are we going now? The black foam must evolve. In fact, it is still evolving for us. We treat Kubernetes as a product for developers. So we are always out there to discover new use cases and unmet needs.

00:23:20

These needs change and will continue to change and not place context of team and the global context of proffers. But the solutions will rise. There's really no one good answer for a platform, but only one that works today in your organizations context. As for us, we have built a multi-year Kubernetes roadmap where the architecture of applications and Kubernetes itself is changing together. We are building on top the concept of golden parts that solves for 80% of the use cases out of the box while leaving the flexibility for developers for the remaining 20% to choose. And also for pod users who would like to actually go and explore more ways of doing the same thing, uh, there to learn and to build technical expertise. And we're trying to bake resilience and security and all of this, we are excited for what is coming next. That was our journey. My name is ethic. You can find me on Twitter, LinkedIn, and medium, and I'd love to take questions.