Scaling Enterprise BizOps by Automating DevOps Practices

Imagine you have to scale operations to cater to more students than the largest education board in the country. Imagine the company growing at a breakneck speed and more services were being added while increasing the customer base. Imagine that a Pandemic hits and the scale has to be increased further.


That imagination came true a year ago.


And we have since then seamlessly scaled the BizOps without any hiccup mainly because of the automated Enterprise DevOps practices that were followed. The scale was not restricted to the systems alone, but the Dev team was doubled and automated DevOps enabled us to easily enable the teams to build-test-deploy, which at the peak reached 800 new builds a week.


This talk is to share an Enterprise BizOps story on how DevOps was preparing when the weather was fine anticipating rainy days. It has been an arduous and overwhelming journey that has a lot of learnings which included our failures and small incremental successes that led to the scale BYJU'S is at now - The World's largest EdTech company.

P'

Prashanth 'Praz' BN

Asst. Vice President of Technology, BYJU'S

AM

Akash Mahajan

Founder/DevSecOps Team Lead, Appsecco

RK

Ramesh Karra

VP Business Development, BYJU'S

Transcript

00:00:13

Hello, everyone. Hope you're all you're doing good. In these tough times, we are excited to be here at the DevOps enterprise summit 2021. Today I will talk about how the scale enterprise business operations by automating majority of our DevOps practices while taking care of drought and data security. Along with me, I have cached Mahajan he's the CEO and founder of cloud was the lead cloud native security architect and the VP of GTM and business advisors who has helped us understand the business needs. My name is , but I go by the name plus it was my team who was tasked with the scaling of business operations during the massive global danger that we are facing right now, it is my privilege and honor to share our learning and approach that all of you today, I especially want to mention I'm in Jean was guided us with empathy for us to get here in front of you in the interest of time, I will lead the presentation and handle Ramesh and wherever necessary over the next, uh, yeah, the largest market, right? Uh, and the reason is quite simple because it has the largest growing population in the world with close to 60 million students in India. Uh, but the quantity of learners is, um, is very much, we still rank rate as follows.

00:01:41

become a member Beverly school, uh, uh, uh, I'm believing is basically, um, I mean, there's not enough high quality teachers. I felt that he contained, um, the right access to the big challenge. Uh, obviously that I use are from bed penetration, smartphone, penetration. Um, there's a great opportunity to address this. Uh, let's go back to, um, go back and go back to April 22. Uh, this was an unfortunate time, but it was also a very critical time and some of them for us, what happened when he was punished by lockdown, but on some level, a lot of countries across the world, right? Uh, all in a big way, controlled the pandemic without them is pretty intense. And yeah, apart from the fact that everybody was confined to them, obviously one of the key things that happen when schools are shut, I'm stuck at home and earned it. This is the time when schools and other, uh, any kind of money that's when Bibles decided that we had to step in was more of a responsibility. And, you know, looking at mission forward, what we did was we did a number of things. One was, of course we have map which had access for some number of people, but you're putting it up for pretty much .

00:02:55

It was, um, it was an awful school, um, solution for students. In this case, it pretty much became dependent solution. Given the schools and even online when schools are pretty much shut off, uh, that we are, their schools get access to online classes from the best vestiges, but Denver at the same time that organization was confined. Um, but we're sitting at home and pretty much this entire mission had to do done remotely. It was challenged all the stakeholders, but ex expected DevOps DevOps was pretty much the, um, the, the backbone that had to, um, ensure that this mission can flip the life. That's pretty much the context. Um, I, I'm having a look, a pass to walk you through how we, how we address. Thank you, Ramesh for sharing the business perspective with us now that all of you have an idea of the enormity of making this happen.

00:03:45

Here's what we figured we will do be that clear from the beginning that these are our three pillars from our frame of reference, everything that we do bare bone within these three pillars, the pillar of data security meant that all aspects of DevSecOps security alerts preparing for an incident in case an incident happens. And how do we react to it? When the bad stuff happens and benchmarking with the global standards was encompassed within the data security pillars. The pillar of continuous delivery meant that automating all of the bills, deployments to servers without humans manually running commands on service and logs of all of these builds, but available to all the developers and doctors as our application containers was made available because of continuous delivery. The pillar of continuous integration meant approval based much for production semantic versioning for all the applications and artifacts verifiable releases by following entire bill, by plane and integrity checks, and also monitoring the entire pipeline for its performance and build time.

00:04:52

This was enabled through the continuous integration pillar before we move further. I believe in show-and-tell. So here are the results, a simple customer focused metric. The impact of our work meant that 12,000 students could simultaneously attend live teacher led classes during the nationwide lockdown with all the schools shut. I'm not sure how big are the schools in Europe, but for us, this is equal into about 10 full fledged schools running. When the weather was actually fine, we were preparing for rainfall. But what we did not expect was a massive thunderstorm called COVID that took over all of our lives. In the last one year, we were never planning to go fully remote as an engineering team, but we had to do that. And this was unexpected. They were used to using tools like slack Hangouts, Google meet zoom and all that. But that was only when working across geographies or when we couldn't travel.

00:05:48

We had to move our entire sales team of a thousand people to work from home model, from a feet on street model, the entire business operations team had to use digital tools that we had to develop and deploy during the pandemic to this massive scale while dealing with a completely new way of working by the senior leaders have had the experience. Our average age of engineers was 25 years, and this was the first time ever that they would be without mentors or someone who's onboarding them, physically present with them. And this was a new challenge for us.

00:06:27

And the way we did this is by templatizing the pipeline. When anyone new join, it was easy to get their code merged into the development plant. And once the code was merged by the tech lead, the developers could test their code. That development environment was maintained with data being auto-generated. We use custom scripts to push the data into the development databases so that no Pia information existed, that the environment acted as a testing ground for new features. Once the developers tested their code in this environment, now they could request it to be most of the main branch. Good much to the main branch could be done by multiple teams who had access to these repositories. Good much to master brands to go to build on staging environment, which was meant for automated testing by the QAs. Once a release tag was added to the main branch, the code deployment was triggered with the semantic versioning of the code deployed that's could also be added to Parker release for later.

00:07:24

Use alerts on slack would ensure that the deployed code could be quickly tested on all the various environments. So we are not affecting the day-to-day operations with this approach. We were able to achieve more than 500 deployments in a week when it was at peak. If you remember for us, the first pillar is about data security. So handing over to our cash to take us through this, thanks as we realized, you know, that if COVID goes on for some time, everything is going to scale. Skate can scale even more. And the time to solve for security was at the beginning, when we were just embarking on this, we came up with some principles or guardrails, if you will, on how we proceed. Relying on Google accepted standards for application security, like was in front cloud security. You know, like the CIS benchmarks we felt was the way to go.

00:08:19

And by adding a step of approval to the pipelines, the team leads and managers felt reassured that they won't get nasty surprises. Suddenly, suddenly something will break in fraud. So they had that. We were already very big on automation, so creating a hard boundary between fraud and the rest of the entire infrastructure was a no-brainer. This allows us for, uh, you know, reducing the attack surface and have a blast radius. If things do go horribly wrong, we additionally deployed centralized log solution based on elastic cloud for, with, uh, you know, data masking and access controls to make sure production environments don't require the ops to remotely manage anything. Most of what we did, we documented in, uh, you know, about down based knowledge base, uh, using continuous delivery. So that was fun. And the important thing to remember is that these are principles for us that, you know, we want to be guided by, but we want dogmatic to become too far, let them to become a blocker.

00:09:24

Right? So keeping in mind the principles and being pragmatic, you know, we designed the pipeline to be a simple release based model. And the primary interface for engineers was access control based on can have comments with the pull request that team leads approved and merged one satisfy. You know, anyone can make a release to staging by using a specific tag, and this automatically triggers a build, right? And this will not be Jenkins. Once an app has been stage and the team is happy data flow to production, right? That's the approval, uh, step here. And we deploying to this far target right after the QA was done, the way we were doing it is by creating a artifact of, uh, already running application, right. Writing it to us, you know, that's how secure S3 and this particular Jenkins only had access to write and more listing.

00:10:27

I think, uh, with Diane permissions, once the artifact was fully copied to S3, you know, this triggers and they went to start a job in production Jenkins, and no developer has access to this. You will know this while the applications are running on managed service. Uh, you know, you know, the target, we chose to keep a Jenkins servers running on ECP. It works without hitches, and there is a ton of documentation we can refer to if required. And this is us, you know, being pragmatic here. The important thing to highlight, right, is that all the secrets required for the production were being added by a pipeline, you know, during the job. So we would have subscribers and, you know, the CSR would interact with the secret manager service, which is a cloud native thing, and figure out the dev and test secrets. Once you would have the artifact.

00:11:19

And now the CSR, what is going to create a production secret, right? So their access to that, they will take it in the application and the container will run in prod. Thanks out cash. Those are some great takeaways moving on for us. What was the real triumph of shared culture? Automation monitoring aspects of DevOps was that when the business was geared to scale massively our team not only delivered and continue to deliver all the apps that were already working as before, but also was able to release and massively skin new applications without having to train them on a new way of development. But you see, on the one hand, the business was trying to launch new features, build new applications, and also was trying to scale all the applications that existed. While on the other hand, our development and dev ops team ensured that the built-in efficiency that was in the system and the continuous monitoring and the shared knowledge that was already there ensure that all of these could happen without disrupting any of our existing applications.

00:12:28

This also meant that people were not stressed. In fact, by not having to commute to office, many of them experienced a more fulfilling Workday imagine hundred plus engineers was solving a higher purpose of education for all. Well-dressed stayed secure in their homes, enjoying building applications for the future and not getting bogged on for the toilet that is getting introduced as a security measure or heartburn, DevOps transformations. I personally didn't lose any of my sleep because the transition was so smooth for us. I'll hand over the security aspects to our cash. Again, the approach, uh, present the business, uh, you know, have towards making all their teams are taking care of is, is a great example of emulate Ella, working with these folks, uh, for these reasons now that we have, you know, you pass the questions such as, how do I transfer my business digitally?

00:13:21

What role would COVID play in that transformation? I think over it basically accelerated digital transformation where every month we are talking about security around that we have seen that growing digital and leveraging the public cloud has become a baseline for everyone who is not limited by industry or regulatory compliances, but at the same time, COVID has stress tested the security of the best of security companies. And at cloudy, we ended up working with a bunch of customers were compromised when their privileged users started working from home or remote IP address based security. Wasn't really, you know, a variety anymore. And this is where the majority of the Bible is ops a team actual truth because of the focus on data security, right automation, our level of maturity, extensive something is, you know, as commonplace as, uh, as such log into a production. So sometimes it's quiet for the, you know, production troubleshooting.

00:14:22

When someone logs in to an SSH using SSH on Linux server, it triggers an alert for the team and the person will need it to SSH becomes part of a mandatory cost analysis process. And when everyone is dealing with massive change, this allows us, the security team to educate and make everyone aware of the responsibilities and the level of fear they may be facing without realizing it themselves. Right. This is a great example of what we managed to, uh, as part of the cloud transformation, we envisioned, uh, you know, the security team to always be an enabler in stopping a blocker when employees are well-rested, you know, they not only perform better than likely to adhere to the processes that keep data secure, this reduces their potential stress levels. And we can do all of that. And all of us can do with that in these times.

00:15:13

And this is what prize had to save when we, uh, were able to achieve all this, that cloud monitor our entire counting FRA and provided a full view of security, enabled us with critical alerts that required immediate attention, right? We're using alert. Fatigue is a huge thing. Our engineering team grew 10 X, but we still follow the straightforward GitHub release patterns while security magic happens, natively. That's a great, uh, you know, uh, testimonial for me to, uh, validate us being enablers, you know, instead of being a blocker, we always envisioned the security team to be an enabler instead of a blocker. You know, we did this by building a continuous stream monitoring stream with automation to ensure that it didn't matter where the admin was sitting while doing the work right, because COVID based on tenderness, a stream of data, you know, we reduce toil by issues getting added to the issue tracker, uh, automatically, uh, if issues, security issues were being reported, they were getting added to one ability management tracker, uh, automatically this, uh, really helps with reducing the grant work and also updating slack when a bill got deployed and, uh, by bringing a lot of these moving parts of operations, including, you know, issue tracking to a single pane, we enabled discussions and conversations with the teams slack, while everyone may or may not participate this, there is a culture of openness.

00:16:43

And when new members of our team feel that they can contribute with discussions, you know, as soon as they're ready, because it's all happening in front of them. Uh, this is what I want to highlight here is that there's a lot of culture automation sharing that we paid attention to. We seen as a, you know, enablers, you know, the traditional sentiment that security teams have of being gatekeepers and, you know, just making everything difficult for everyone. That's part of the security aspect we had to basically continuously monitor the alternative services for security. And the reason do that is that we needed to have visibility because without visibility, how would we know, uh, do we have data security? Do we have it now? And cloud is a big big-ass API, right? So it doesn't have the security now, or someone made a change, right? This is problematic.

00:17:34

So we want to answer questions like, do we have EBS elastic, block storage, you know, uh, with encryption, because if we don't, then they may not be compliant. And in some cases we ended up migrating older legacy, uninterrupted, uh, you know, EBS by automating the copying of, uh, uh, the volume structures. Do we have incident response readiness? Are we ready to face security incidents is the process that we have. Is it based on the incident response guidance provided by AWS or the NIST cybersecurity framework? They, you know, they can overlap security alerts where you need visibility for that. What happens if a developer disables that were fed to the rubbish account, right. Do we get alerted? Should we get worded? Is healer working? Do we get alerted if an S3 bucket becomes public? If a new security group is open to the world with the management port available, we also want to do some automated remediation, uh, you know, of using CloudWatch events.

00:18:37

And, uh, this was great because we ended up doing this completely down native CloudWatch events and Lambdas and games, target based containers. What are we remediating for disabled console access of user, if no, to make public, uh, S3 bucket private, if we don't have a specific kind of tag that we wanted. So primarily we're monitoring for security. And we wanted to focus on that. That's why doing it the cloud native way and not like anything else which is ready. It really be available in the market. We wanted to, uh, you know, enforce security processes. You can call them guard rails or whatever it is, right. Guidelines, which have to be enforced in runtime during operations. Can we make sure that no, uh, what your machine to machine can come up without the IMDs V2, which prevents problematic attacks against applications like the server side request for three, right.

00:19:28

It's a huge thing, huge challenge. We wanted to kind of do the coding standards, checking the pipeline. We want to see if dev is like committing secrets in the wrong place, right. And then getting committed in GitHub. And it's a problem because people are really pushing, um, one inability is discovered to open source vulnerability management software. Like if I go Joe and using tags to ensure that resources had proper metadata or they removed. And we also ended up doing a lot of, uh, regular security audits, whether it was how the GitHub, uh, access control the AWS, uh, you know, scan for cloud perimeter or the internal scans, or just auditing the privileges of IAM rules and policies regularly, right? The continuous part is the keyword. This got, uh, reflected in many lessons. We learned, we are also to, uh, get past some unique constraints that required new thinking.

00:20:28

Uh, in India, we encountered, uh, you know, the fact that teams were spread Statler, what 25 plus states with varying levels of, uh, interactions and quality Lexus, the scale of this, you know, maybe a bit difficult to imagine for the audience. Uh, just, uh, I know for, uh, I've ended the summer at that the entire European union is just about 30% bigger than India and all those countries in the world in, in European union and the landmasses is 30% bigger. So, uh, it's massive in India and the, you know, the internet connection, equality of access where you are, if you're on mobile, uh, 4g, all of that can have an impact. So using the old fashioned way of doing access control at network layer, you know, the jump hosts, the multi hop, uh, kind of bastions, which may required like proprietary VPNs or software was not really an option, right?

00:21:21

We wanted to move fast. Everyone is working remote. We deployed, uh, viral God, uh, using enterprise code whenever there was a need to provide so that we could say that, Hey, to the database that someone needs access to, we don't know what static IP is. They be coming from, but, uh, by allow access with wire, God, we may made sure that we cater to the low bank situations in different, multiple different, you know, internal, uh, services and databases, uh, no SQL databases and even a hosted Kafka, but all provided with access control like this. Thank you, our gosh, we continue to grow in terms of number of engineers. 40 just added in April applications that are being used by our business. And the number of deployments are ever growing. We've been in a unique position when it comes to the last one year. So our DevOps and the way to work with automation first approach has taught us that this movement has been truly transformational for us.

00:22:22

We are able to support the incredible goal of education for all vial. A lot of people are in crisis all around us. We feel that it is our privilege and our duty to support and share what we were able to cause I'd like to thank our caching Claudel. They have provided us with the ability to monitor and take action without having to change how we do things. This makes it easy for us to teach new team members. And that has proven to be the biggest advantage that we have got out of them times to Ramesh for sharing the business need and enlightening us with the thought process behind the business decisions that were taken in the last one year. Thanks to the organizers for giving us this big opportunity to present in front of all of you. Thank you, everyone for sticking around. We'll be happy to take questions on the slack channel. See you all day.