DevOps Story of a Crisis & Conquest (US 2021)

The ongoing pandemic hasn't just challenged health and movement of people worldwide - the sheer unpredictability has also taken a toll on businesses, strategies and their technical delivery. The Platform team at STARZPLAY one of the largest video streaming providers in the MENA region has some bittersweet memories and lots of learnings from this time. On one hand was a sudden surge in viewership by almost 3 times (fueled by the lockdowns and people started binge watching) which was a good sign for business on the other hand was handling the challenges of scaling up the infra, keeping up with the software delivery schedules intact. Adding to the woes were spikes in cyber attacks to the service to exploit the vulnerable times and the challenge to equip remote-working capabilities to the organization. This talk is a true story about the challenges presented to an Engineering Team amid the COVID-19 pandemic and how DevOps principles guided them out of troubled waters. The lessons learnt will go a long way in setting the right organizational culture to overcome crisis.

breakoutuslas vegasvegas2021
PS

Prasanjit Singh

Engineering Manager (DevOps), STARZPLAY

TRANSCRIPT

00:00:12

Hello and welcome to the DevOps enterprise summit 2021. My name is and today I'll tell you a DevOps story of crisis and conquest. This is an account of the challenges faced by a cloud infrastructure team. So being millions of customers in the face of the pandemic, we'll also talk about the lessons that this crisis starters and how DevOps principles helped us overcome these challenges. So before we go on with the story, let me quickly introduce myself. My name is Preston David and I am the director of cloud and DevOps practice at staffs play the way along with, uh, platform engineering. I'm passionate about learning and teaching and have trained over 20,000 students at Coursera and other tech platforms. And, uh, with my demon status play, I am responsible for deployments and site reliability. Now status play is a video on demand service that streams movies, DVCS series documentary, kids entertainment and life's books for 20 plus countries across the middle east and north Africa region.

00:01:44

And we also support the other OTT platforms across the Asia Pacific region. Uh, in addition, we have, uh, uh, life subscriber base more than 2 million and, uh, apps installed, uh, over more than 10 million devices worldwide. So that brings us to more than 50,000 requests, ready to bombard the platform every minute around the clock. And that is a large number. Now, let me tell you a story or rather you can call it a diary account of an infrastructure engineer, uh, with the incidents that occurred over a month since the endemic broke out in 2020. So here we go. 11th, March, 2020 world health organization declared COVID-19 outbreak a pandemic. Well, the who on March 11, declared COVID-19 a pandemic pointing to over 180 and thousand cases of the coronavirus illness in over 110 countries and territories around the, and the sustained risk of, for the global spread.

00:03:05

Well, this is not just a public health crisis. It is a crisis that he touched every sector. I called this from Dr. Tedros who director general at a media briefing. So he said every sector and every individual must speed involved in this fight. Well, that's not something you hear every other day, this was happening. And this was real. You were unsure. How does food affect us as a company being from the media and entertainment industry and serving, you know, the traffic, uh, the servers, the infrastructure. We were not sure what to expect. We knew that we would have to be careful health wise, but we had no clue. We would have to scale up our servers as bad, 16 months, 2020 half of the employees that are asked to work from home. But looking at the severity of the spread, we decided to have 50% of the workforce at office and the rest 50% to start working from home.

00:04:20

So that would reduce the chances of the spread by half, at least for the employees in our office. And that's how we thought about it then. And with this announcement, uh, V the infrastructure engineers and DevOps advocates in the company found ourselves in that limelight to help maintain the connectivity and the communications to hold the organization together during this great corporate dispersal, we had our deployment pipelines in place and DevOps tools for the software development life cycle processes that we have. And it was then that we realized the value of these more than we ever did before. And, uh, and the, say the productivity and the continuous delivery of the services by it departments through the crisis, rested on our shoulders and, and addition to aligning and sinking of, um, developer initiatives with operation cadence, um, everything was on us. And that reminded me of the DevOps principle and to, and responsibility.

00:05:39

So as a practitioner of DevOps, you have to responsible for everything. You can call yourself a Jack of all trades and you have to do it, but we had no clue yet of what was about become, however, we just, uh, reacted prepared or VPN servers, or how to half of the people that are going to work from home, scaled up our user accounts of, uh, zoo glutamate and rejig our layouts to highlight binge-watching of movies. Like I am legend wires and contagion because these were on high demand in those times, but then comes for apple 2020 nationwide loved down declared by now, the pandemic was wreaking havoc in Spain and Italy, especially with the huge increase in numbers from all over the world. And on 4th of April, a nationwide lockdown was declared and the entire city of deliver work can do a stance, too. We were all logged, um, in our homes and our work had to deep going a hundred percent of the work from home for media and entertainment house means complete dependence on the technology team.

00:07:09

So that's when we had to roll up our sleeves and get into our actions. Um, and in this time of crisis, DevOps practices really came in handy. I tell you how well we have a DevOps principle automate everything you can. So as organizations, center employees home for work, suddenly it became a lot more apprehend on how many men would processes are at work within the organization. We realized we also, in spite of having, um, DevOps pipelines in place, automations in place, there were lots of places where manual processes are still at work within the organization. And we were dependent upon people being at the desks, running certain scripts, and, uh, to be able to talk to each other, uh, directly and working out those challenges. Um, the more methods you have within your organization that are not automated, that requirement on intervention that may even require a physical presence, the more pain you're going to experience during this rapid shift of the new work paradigm. Uh, even if, uh, that shift is temporary in nature. So in order to reduce this pain, the only option was to moving to a DevOps paradigm, uh, and that is to put people, processes and technology to work in order to eliminate these manual processes, manual bottlenecks, and the need for a team to be sitting in the same room, uh, in order to deliver functional software and safe database change.

00:09:12

Well, think of automation as not only a software development process, continuous delivery, including continuous integration, continuous deployment, but also as a whole infrastructure landscape, and one that allows infrastructure to be worshiped and treated as school as well, to automate a process. It needs to be converted into basic code. All processes, normally action through console will have to be transformed into a series of API calls and script run commands. So that's what we did. Whatever was being done by people in the console were converted into scripts if they were already. So that helped to reduce the manual tasks and make things automated. So when people work from home, they could conveniently just triggers a jobs or even the jobs with can you, so there was less dependencies on team members. So even if they were dispersed, the processes were in place and they proceeded as desired.

00:10:21

So with these things slowly, everything, uh, get, uh, get settled. And we were getting used to the new veil working. And now that people started calling this new way of working the new normal, that brings us to six, April, 2020. That was when we decoupled our deployments and releases. Now, this is a lesson that we learned from certain challenges that we faced. Let me, uh, tell you more about it. Uh, there's a DevOps principle that calls out continuous improvement and in a bit to improve our productivity in this new found, uh, scheme of things, we made certain changes to our delivery pipelines. The concept of decoupling the climate from regions is a key thing for any DevOps team to be aware of. And it is important to understand how feature flags, uh, can make that possible, uh, before we delve in a bit more. And before I tell you more about it, let's start by looking at what a decoupling is.

00:11:39

Well, deployment is pushing your call, uh, into some part of the infrastructure and release is, uh, exposing a code execution ability to decouple, deploy from release means that you're able to push go anywhere without exposing the code. And therefore, without impacting your users, this then allows you to gradually release the new feature to assist in internal testing dogfooding and a progressive rollout. But what is most impressive is that if done correctly, you can compare the health of the system metrics and user behavior between the users who have access to the new feature against the users who do not have access to yet to this feature. And this learn much sooner if there are any issues. So this gives you an ability to roll out features faster and get them tested from the live users themselves, because you're not ha exposing the entire use of this.

00:12:49

You're exposing a certain fraction of the user base. Now feature flags are what make decoupling within the feature release possible. A feature flag or a feature toggle is implemented as a function call that controls access to a particular code, but unlike a traditional component and flags command line flags of configuration file entries, feature flags operate on a user by user basis and not, uh, for server images and are remotely controlled from outside by the application, which means you can just toggle it, change it by changing a property and without pushing. So that is how we started working on this. And there are many reasons, uh, one would want to do this one. I already mentioned it is to make deployments faster. And, uh, it is also important if your DevOps team utilizes, um, trunk based deployments. So all goals is committed to the master, at least once every day, without decoupling all the work in process school would go live.

00:14:05

So you should have feature flags, which you can toggle and then switch it off and required. And the second reason is to enable safe testing in production so that your entire user base doesn't suffer well. We found that this worked very well for us in the given situation, and, um, it resulted and getting good to the production sooner. Even if the whole team wasn't in the office working together then comes 8th of February. When we enabled another feature toggling mechanism in software development feature toggle is a mechanism that allows code to be turned on or off remotely. So feature toggles are commonly used by product engineering and DevOps teams for releases. So we did that and it allowed us, uh, again, to toggle on features quickly whenever we needed them. So that is work we did. And one more thing we did with disperse team to function then was to divide them into autonomous squads.

00:15:20

So we divided the teams into squares, which had one member with an expertise on one particular line. For example, there would be a squared with one database admin, one, a person who is very well conversant with DevOps tools, by pipelines, a person who is a developer, someone who can merge code. And so, so every team would be self sufficient. And even if they're working remotely, they will, uh, in tandem with each other so that everyone doesn't have to jump into a call. Say, for example, you have a big team with 30 people to having everyone in sync is difficult when we're working remotely rather have smaller teams, uh, with, uh, one expert from every area. And this squared becomes, uh, more functional and more productive than bigger teams. So that is another thing they did well. That brings us to 10th of April, 2020. And because of this nationwide is logged down.

00:16:32

We found that, um, most of the people were at home by, at the February two thirds of the whole world was endorsed in fact, and, uh, as expected, um, the, the one, the risk bite for everyone was watching movies and web series, and it applied to adults and kids as well. And this led to, uh, a huge surge in traffic. So we were seeing, uh, more than three times what we see in a normal day, and this meant scaling our infrastructure absolutely was databases, caching systems, load, balancers, everything that we have on the infrastructure level, because the, um, traffic just Cru three times, and we were being hit by lots of requests and all organic requests. So that's fine. Again, dev ops practices of infrastructure as code came to bail us out and, uh, um, being on the cloud, the last city of cloud also came to the rescue here.

00:17:49

So the clouds, API driven model facilitated exchange of, uh, uh, the exchange between developers and system administrators. We used API APIs and tools like Terraform and cloud formation to rapidly scale up our systems. And we used Ansible to configure our systems on the go. And because we had this practice in place, we could easily scale up and, uh, we could take on this challenge of three X traffic, um, uh, within a couple of hours and scaled up to serve our customers well. So since infrastructure is mainly core engineers, uh, they're able to interact with the infrastructure infrastructures. And so those could be duplicated continually or updated, uh, as per requirement, we could, uh, ramp up our servers in no time to deal with the surgeon traffic that we saw. And again, I would say Bravo to DevOps, but, um, that leads us to 11th of April. And this was exactly one month from then the worldwide pandemic was declared.

00:19:10

And that's when we saw that as if this triple surge of traffic wasn't enough, we were now seeing malicious attempts to, or servers, DDoS attacks and, uh, and Y uh, warranted stress on our systems. So most of these threats, uh, intensified because of opportunities that have arisen during the COVID-19 outbreak for hackers failure, different things, because they already know systems are under stress. However, having proper DevOps practice in place help bailout us from this problem as bad, um, DevOps focuses on, uh, examining the entire process. So the, uh, we had this objective of monitoring and detecting troublesome areas of process and analyzing the feedback from the team and end users to note, uh, occurring problems and, uh, uh, better improve the quality of so having good observability practices in place, and those quickly detect the threads and mitigate them. So we were able to solve those problems as well, but our monitoring dashboards and monitoring tools in place to be able to quickly detect them and block them a platform to seize, uh, to stop them from happening.

00:20:45

Okay, well, so that was how it, um, the mud to them, right from the time and they make, was declared. And then we moved on for a test ourselves and found, also set up much better place today, uh, exactly one month into the pandemic. If I speak about 11th of April, 2020. And, uh, and now, uh, this has been more than one and a half years from now. And then looking back, uh, I am happy that fever able to sail through, and, uh, I'm here to share the story of our experiences. So in spite of challenges, we overcame know these times and, um, having DevOps practices in place helped us immensely. And, uh, I would say the pandemic is a cautionary tale to businesses that companies that refuse to evolve and change the core of their existence, DevOps is here to stay. And the advantages possess possesses will ultimately give your business the things that it would mean to. So, uh, over an expected future, such as this, and I would say, no, the events in the past year have only made, um, there are more relevant. Thank you.