The DevOps Journey in an Enterprise

Agility, CI/CD, DevOps are often challenging for large Enterprises. In this session, Anders will present his 10+ year story from how 100+ developers moved from 2-3 releases per year to 30+ production deployments per day.


With Anders' cartoon presenter style you will probably have some fun, but also bring home some learnings from his success and failure stories.

AL

Anders Lundsgård

Cloud Solutions Architect Consultant, tretton37

Transcript

00:00:13

Thank you Jean, for having me at the DevOps enterprise summit, this conference, along with many of your reports and books have given me and my colleagues so much inspiration. So when you ask Neil to join us as speaker, I of course just couldn't say no. Hi, I'm on the stone squad from 1337, the Stockholm office. I am a cloud solutions architect with our real passion for dev ops. And I'm here to share my story about the DevOps journey in an enterprise. And that enterprise is Scania and engine bus, and have a way to truck manufacturer.

00:00:59

My passion in my working life is to enable developers to work much more efficient that I've ever been made myself. I learn developers how to operate a code in cloud environments. And I also try, we test shown to be a little bit more complicated operations guys, how to code since I've been involved in the cloud adoption of scorn. Yeah, for six to seven years, I see cloud as an enabler, a technical enabler for DevOps, a disclaimer, before I continue, this is not a way for me to define DevOps. I will share the story that me and hundreds of other colleagues, a score on their connect services have made during a 10 year journey to be work more efficient with how we produce and put software in the hands of the end users. And we have had dev ops as a bus word to Google about, to have some opinions about, and I think it's really good. We have this good books like Phoenix project and the dev ops handbook, but we do not have a single page, like for example, the agile manifest, but that's, that's fine. I think it's good that everyone can have their own opinions. And I guess that's the reason for we having these kind of conferences.

00:02:23

If we step back in time, my very first job where on the startup, I would own the guide at a company of two that know how to write some code. So I wrote the code, I wrote the tests, I made some kind of deployment pipeline to push out the code into the web server. And that also put on some key metrics. So that could see that my end-users could get some web traffic. At least I would a guy that the end-user called and something went wrong. Dev ops was not current back then, but it was a kind of a dev ops situation because I've worked with the code and I ensured that the end-users could use that code 2008. I decided to move to an enterprise of score now, and I was faced with a total new situation for me as a developer. When I started the first week, I got to explain to me that the code you're writing right now ain't cannot be deployed into production until the next year, nine months of the, I wrote the code that would go live into production. And I thought that maybe this is how it works in large enterprises, but during the years, I've come to understanding that it does not have to be the situation. And I also understand that this tug of war we see on the screen is between developers with of key metrics and another one for operations. And this is not technical challenge. Of course there are technical challenges or the most of the challenges have absolutely been cultural. And I will try to focus on those in this talk.

00:04:11

Here's a picture taken from a very big moment. It was the very last manual deployment production back in 2015, it was me and about 10 other engineers that went into the office on a Saturday morning to do the release. And this was a very, I would say, a non dev ops situation. For many reasons. One we can see at the left, we had the release plan release plan stating to us what fish should do to have a successful release that Saturday morning at the top of the respon it states that we should call the network guy to bring out the servers from the load balancer and turned on the maintenance page because it was the fact that we have downtime for the end users while we were doing the release. And that was the reason we went thinking early on a Saturday morning, then on the screens, we have four or actually the six web servers that me and other colleagues moved, some files to put live in production with the new version of the code behind me, we had the BA that at some point in the release plan, got the responsibility to do the schema changes in the database in the middle.

00:05:29

We have a small post-it and then on this post, it, it was the phone number of the engineering manager because when things went wrong, which sadly happens, sometimes we need to escalate it and calling some developers that could help us troubleshoot and find the issue, the red bulls to the right, they, it might just be a symbol for that. This was a big event. We need to go in on a Saturday morning and we did it only once a month. And it was a stressful situation.

00:06:08

So we had this setup that are probably many are seen a wall between the development and operations department. And it was actually not one office space that's gone. Yeah. I were in one building operations guys. I had to go 20 minutes to another department to meet this guys. And we from development, we were need to have more features out in a more rapid, quickly and operations guys. They needed to keep up the stability. And these two goals have shown to be very often in conflict with each other. So the guy at the top of the ball here was me sometimes, actually the thing you're pointing to the operations guys, because they were the bottleneck in our release improvement process.

00:06:59

Today, I have turned out to be that security guy, since I've moved into cloud adoption, I've also worked a lot with security configurations, and I I've actually that developer is still on my shoulder telling me, but we really need to increase the pace, reduce the batch size for each release to production. But on the other hand, I've also seen that if you're only let loose all configurations in a cloud infrastructure, bad things can happen. If we step back until the date, when we had manual deployments before the vision in our department, a vision that, and then near should one day be able to work up with an idea of exchange could be a big or small, could just be changed the color of a button here, or she could write some tests and made a code. And after launch, deploy that code into production totally on his own. And they bid next day, or hopefully the same day he or she could evaluate if that deployment turned out to be successful for end-user or not. What we didn't know back in 2015 was that that was actually going to have been older the next year,

00:08:32

The key foundation, I would say the technical enabler for us to move in the right direction to a more agile way of working where the practices around continuous integration. So continuous integration has a lot of pillars, but the four key pillars to us were firstly, version control, everything, code tests, configurations schema changes in the database. It should also be in the version control. So the DBA have to start use get later when we moved into cloud, we also saw that we could version control the infrastructure automation. Of course we should have automation. And that I think everyone agrees on that. We use the release plan as a backlog, basically to automate our deployment process. So the first step was to connect to the network architect and ask, how can we automate the process of instead of calling the network guy, have that as a bash script or whatever.

00:09:45

And luckily, some months later we get an API that could be intrude introduced into our deployment tool, trunk based development, or today I think we should mention the maintenance based development. Probably the practice that many of us have missed is that each developer on the code base should deploy or merge the code into mainline at least once a day. And that means no long lived feature branches, claim of a blame. Now to these, this picture with the lamps in the ceiling, green obviously is a good color. Yellow is actually when things have been failure, we had a failure, but it's actually a good color because something went wrong. The red light went off and someone took responsibility to fix that issue. So there's only one bad color here. And that's the red. If the red lamp is red over 30 seconds, then we have some problem.

00:11:01

We have a failure that no one cares about. And this way of highlighting this was very good because then we also got management on into the day at work. They can see that we have issues, probably very stressful situation for the development teams ended up that we had the red light for a long time. So reduce that time when the red light was on, was a key metric for us. I have to stick back developers, version control, everything except secrets. Okay. Version control secrets is of course, something that can be, uh, secrets can come in the wrong hands, but it also reveals that you probably have some things to do when it comes to automation, some metrics in our deploy frequency that connect to services back in 2011, we have software projects, two and three in parallel and about three deployments to production every year.

00:12:06

The last very last Mandel deployment that I talk about was in 2015, when we had agile teams, one deployment per month, the key enabler for that step was continuous integration, but then something happened 2016. This deployment metric went high up in the roof. Third of deploys per day. The key enabler I want to say for this was a microservice architecture where each development team own their own part of the system. And they could independently deploy those into production. We challenged and improved infrastructure later process like for example, the network configuration, but they can also not have a change management meeting every second Tuesday. If we want to do a third to the authorities per day, trust encouraged from management previously, we have had go, no go live meetings before the Saturday morning stating that our senior manager need to approve the deployment to production. Those meetings are gone and a problem. They are gone forever. And we'll say, before I continue, this was all prem. So this was not the cloud and enable us to do these kind of, uh, deployments, these frequent short about the branching strategies and the evolution. Over this time, back in 2011, we had main line release branches where we did the release and a bunch of each abortions. And the solution was that because it removed the feature, brushes still had the code, stop it before at least. But the true improvement where we also could remove the release process.

00:14:07

We talked a lot about autonomous teams. We should have an architecture detector that enable teams to have their own code base, their own part of the system. And that those part could be in deploy and then independent, sorry, deployable compared to the other teams. And of course, that's a beautiful thing because the teams can show us the tooling, their frameworks, the languages for their particular challenges. But for me as a recent in year, the best thing where that if something goes wrong with one team, the other 29 teams can continue. As usual. Of course we should. The care of this guy, once things went wrong. But in the, in the past this failure two days before a Saturday morning would mean that we have to block the release I rescheduled and everyone were affected zero downtime or one key technical enablement that we did 2016.

00:15:16

And that means basically that we do a production deployment. The end user can continue to use the system. And that sounds good. But the best thing with this AMA say that this enable at the weekend do deployments on a daily basis. And the engineers that writes the code will be present and actually will be the one responsible for pushing the code into production. So when things went wrong, engineers Kimbra, they are and solve the problem directly immutable production, very interesting topic and will not go deep on this, of course, in this short time. But when I say that developers should not login into production servers, operations guys tend to like this. But when I continue to say that you operations guy, you should not look into those server either. And then this discussion turns it into another area because with immutable production, we need to have the practice of version control the infrastructure.

00:16:25

If we should roll out a new configuration change into a server, we changed that configuration immersion control, and then we'll have a Papa total new server kill the old one, remove handovers we should really avoid to solve optimize. I think our release plan was one, one thing that showed us that we have many sub-optimization because the day when we started to collaborate and show the release plan for the operations guys, then we could start to make real improvements. So try to avoid sub-optimize in organizational silos, we realized that we could, the couple deploy from release deploy to us means that we move binders to the web server, to the production servers. We can move the JavaScript file internets. The bucket that is the front end for our application release means that we enable some new features or perhaps a total new service for the end-users. The ploy is hundred percent of development team concern that they take care of. A release might be synchronized with some marketing marketing activities or documentation for the end user has to be written. So that is a business decision.

00:17:50

This was late 2016, and this was a big release. We enabled a total new brand new angle or front-end with the microservice backend. And this product owner enable this site and system for 10 markets. So we made a graceful rollout of this system for a period of two weeks to three weeks. We enable this new front-end and once things went wrong, luckily it didn't. But if we had to have some struggle, it would be very easy to do a rollback, just uncheck the market and the old site would be still present. So we ended up going in to the office on Saturday morning, but there was one time we had a special demand from the UK market. For some reason, they needed a deployment or a release on a Saturday morning, the 1st of September, 2016 and this little girl Tilda, she was four. But at that time, got the responsibility to do a release. Of course she didn't really know what she was doing. Click a button. She didn't think it wasn't that fun, but you can imagine me standing behind her having my four year old daughter doing a release when I six months earlier had gone into this to the office. But together with 10, 10 years, basically achieving the same thing with hours of work. I was of course, super excited about this.

00:19:39

I feel that there is something within the culture of dev ops that is extremely important. And that for us meant that developers need to understand the complexity to operate software of all kinds, especially in production environments. When you need 24 7 and have SLA studies up to 99.99%, there was such things like availability. You might as a developer, deploy your code into at least two servers. If we have one power outage, there are things like durability, scalability, and security that operations guys in a traditional on-prem system takes care of. And on the other hand, the operations guide needed to understand the complexity of dealing with different code states. And what I mean with that is that we have developers. They produce code that is live in production. They have continuously new demands that would probably enable new code to be written. And hopefully we have deleted that didn't make any good for the end user, but the hardest state of code is code that is pending. And that means codes has been written, checked into version control, but not yet deployed into production. And the longer the time it takes for the developer to check in code emergent controlled until it goes live, the bigger bang will be when the code goes live, and eventually it doesn't work. And it's also the case that if we have long time, probably many developers, we push code at the same time and even harder to troubleshoot. That is something that has shown to be very important for operations guys, to find affinity for developers.

00:21:42

Skandia got to cloud first decision back in 2016, and that's a big change. Of course, we got the state of the art infrastructure platform to work with, but it was a very big challenge for the organization that had, that had an on-prem, where you're working. So imagine we have called the network guy on a release to do some change in the load balancer. Now, suddenly that load balancer from being a hardware and some clicks in a web UI is suddenly something that is defined in code in a way that the developer is used to see code and that code can be in version control. So this is a big shift. Our network back up the MBAs needs to understand that we need to define infrastructure, helped the developers to define it, version, control it, to make it reliable.

00:22:49

I worked at the cloud security team for two years, and now we're working with helping developers to define their own infrastructure in the, in the cloud. And here are our core pillars for how we have been working as a cloud security team, a team of three people for over a thousand developers. First of all, within our cloud native start to do, we also have a cloud security strategy that we should use the cloud native services. We as a security, even though we are on a large enterprise, we cannot build better and secure services than Excel providers. Cloud providers are experts insecurity. We should share failures of course, within the organization, but also externally. And because encryption, if we can

00:23:46

And Crip thinks we should do it. And in a cloud environment, we can very often encrypt it in one single line of code. We also had some happenings around security, find security champions have a day when developers can compete around security. But the biggest two parts that has been really important to us has been we as security professionals, we really need to get out of the cave. We cannot own say, no, we can't say no. We at the cloud security team had a compliance framework state, and that you should, for example, not have poor 22 open for whole world, but we put on us the responsibility that if we say no, we should also provide a remediation for developers.

00:24:48

This was inspired by a true DevOps moment. For me, me back in 2015, started to collaborate with our network architect. And we saw that we need to define the infrastructure or the networking in our cloud environment because, because there were some common rules that need to be in place for the whole cloud setup. And we saw that we can define this virtual private cloud in a way that the network cautious, they were very comfortable with. He know about Cedars VGP and so forth. And me as a developer, I know how to write code. And I know that version to control that code in get also very good practice to enable automation. So we made this together and still today, this work is being used every day to create new virtual private clouds in our cloud environment.

00:25:54

Time is running out for me. Here's how you can help. Of course continued to keep distance. Hopefully this year you can take on this challenge. I shoved it to you to take the DevOps hug. I did this hug seven years ago, just to see if I survive. The outcome was more than I apparently stayed alive, but it was also that I started to find two friends among the operations department. And I will end by saying that you as a developer, if you have any concern about where you can start a small developer in a big enterprise, try to enable zero downtime deployments. That's my call to you and you operations guy. If you don't yet version control your scripts in, get, start to learn. Get the day a developer comes to you and ask for a change and you can ask him or her to do a pull request in your, get a positive for the change you have started enable a true dev ops moment. That was all I had. Of course you can find me on slack. You can always find me. We are devops.vision and connect with me on, on LinkedIn, for example. So I will again, thank you for being here. Have a great day and take care. Bye.