12 Factor Terraform: Next Generation Infrastructure As Code

Babylon’s mission is to put an accessible and affordable health service in the hands of every person on earth.


In order to do this, we need to be able to deploy quickly, effectively, and with total confidence in success.


The 12 Factor App is old news as far as software development is concerned, and countless firms have achieved great success in applying its principles.


At Babylon we’ve gone a step further and have applied 12-factor principles to our IaC pipeline, and extended the concept of a microservice down to the infrastructure layer.


This means that our delivery teams are empowered to deploy their own infrastructure and innovate around it, whilst maintaining our confidence that we can allocate cost correctly, that we remain compliant with our internal and external standards, and that we are delivering as fast as we possibly can.

RV

Richard Vodden

Head of Cloud Engineering, Babylon Health

Transcript

00:00:07

Good morning, ladies and gentlemen. Um, my name's Richard, I run with a bottle of health, um, Babylon, uh, an organization who our mission is to put an affordable and accessible health care service into the hands of everyone on the planet. No sir. Small aim showing you, showing you understand, uh, honest part of that. Um, we've recognized that technology is a very, very important thing to augment the knowledge and skill of the medical professionals that obviously provide that frontline care. So we are primarily a health organization, but technology is an extremely important part of what it is that we do. And one of the things that we've been working on over the last sort of 12 months, it's really making sure we can deliver that technology as quickly as we possibly can under the best amount of light, you know, the kind of thing that we've all been working on pens we're here today.

00:00:57

Um, a few things, um, firstly, I think you're all aware of gene Ken's cloning machine. So whilst I am here upstairs in my loft conversion, talking to the camera, there is an evil clone of me downstairs in my office, all the laptop, able to answer your questions on the track one slack channel. So if you jump on that, um, my evil twin will be able to answer any questions you have as we proceed. Very, um, secondly, this is a subject which we could spend all day talking about. So this really is a sort of introduction to the concepts that we've come across as we've gone on Terraform journey of Babylon. Um, there was a lot more detail behind it. Please do ask me about it. I'll be at all the networking sessions for that. It's kind of a base, um, really look forward to, uh, asking, answering your questions, um, and similarly quizzing you on how you worked with her form, um, how you've made your infrastructure of code work as effectively as hopefully it can.

00:01:50

Um, so the key message I want to try and get across. And the key approach that we took at Babylon was we thought Terraform isn't special. It is, is just another programming language. And that's the key tenant to everything I'm going to talk about for the next sort of 25 minutes, uh, concepts such as dependency, injection, such as separation of concerns, all of these things that if we were writing code in Java or in node or in any other established language, we wouldn't think twice about executing. I found that Terraform code has seemed to sort of forget about some of this is because the facilities of Terraform have been quite limited over the years. Terraform knock 12 obviously has brought in some amazing features, which has helped very much helped enable some of the things that we'll be talking about in the next few minutes, but not everything some of us you could have done with not point 11. Some of it is just a way of thinking about making sure in particular that our code doesn't repeat ourselves, this dry principle is something I'm going to keep coming back to. So let's have a look, um,

00:03:02

Let's light up. So let's have a look at what it is we're really trying to achieve. Uh, what, what were the challenges I joined Babylon August last year came into a team of amazing platform engineers. What is it that that platform team was struggling, that we were trying to get over with the as code and the Terraform that we're trying to use. So firstly, and I think you'll many of you will recognize many of these. Um, all of the cloud expertise kind of lived inside my team and as such, we were constantly being called upon by other teams. So we have a hundred and something microservices in Babylon, um, all existing in a Kubernetes cluster, those microservices talk to infrastructure, but the concept of microservice was very much the Kubernetes bets, the RDS instances, the Alaska cashiers, the Landers, all the other things that main part of the ecosystem were sort of mysterious to the rest of the organization.

00:04:06

And we were the only real people that understood how it works more than that. We were there any re really the only people who could actually deliver it, which is a sort of huge problem when we're trying to lean things out and making that work visible. And we have bottlenecks everywhere and handoffs and obviously a huge amount of delay. So linked to that, um, is, is increased pace by enabling the other teams to deliver. Um, we can go faster, but similarly we ourselves wants to go faster. We were finding that the way we've structured, our code was making a very, very difficult. I think you've all come across the phrase, yak shaving. We were finding what we wanted to do was just make a simple change to an ID apps. Actually, we had to do seven or eight things first and that there was no such thing as simple, which is very, very difficult.

00:04:51

Um, we also had an awful lot of divergence, both in, um, configuration, well in configuration really, and the, uh, the challenge of ensuring that there was kind of a small number of ways of setting things up. Um, really hadn't been considered in the evolution of Babylon over its five-year existence. So we had, RTSs set up an entirely different ways. People using different Postgres versions, people were using different mice, equal versions. People haven't considered moving to a thinner, all of that kind of stuff. Um, secondly, linked to that, I think we all understand the only way of really keeping track of your cloud costs is to make sure all of your cloud metadata is in really good shape. If you don't have consistency, particularly in that tagging, then you can't really track where your cloud costs are going and you can't attribute them up to those microservices.

00:05:43

And therefore you find the bill lands with the platform team, which is exactly the situation we were in and exactly one of the reasons why we decided we wanted to go down this path. And finally, almost most importantly, but certainly equally importantly, we need to enforce compliance. We're a healthcare organization. We're also 1, 3, 4, 8, 5 compliant with just submitted high trust. So hopefully we'll be high trust compliant very soon. We're HIPAA compliant as we work in the U S a lot. Um, we need to be able to demonstrate to our annual auditors that we are compliant. And we were finding that we had to do a lot of work to gather evidence before each one of those audits, where if we sat down and advanced and went, right, this is how we're going to be compliant. This is how we're going to evidence it. This is how our infrastructure is code innately, ensures compliance.

00:06:27

We can, we find our audits are actually a much more pleasant experience and we're able to just kind of go, oh yeah, here it is. This is the answer. This is how we do it. This is how it works. And we can evidence again, using that metadata, that the resources that have been provisioned in the cloud have been generated using this infrastructure as code. And therefore we'll be compliant because we can show that the code is compliant and that's been hugely, hugely helpful for us. So let me wind back a little bit and talk about what I mean, when I say let's look at how these problems would be solved in any other programming language. There was a chap who worked for, I think he still does that today works for Hauraki. Um, a few years ago now, nearly eight years ago, um, maybe nine years ago can't do songs.

00:07:14

Um, and he came up with these 12 rules for writing a holdout seven, 12 fingers, 12 rules for writing a cloud native application. Now I am upfront not going to pretend that we're going to hit all 12 of these mostly because we've only got 24 minutes left and that would mean two minutes on each of them. And that wouldn't be very interesting at all. Secondly, not all of these actually apply to infrastructure. So for example, number seven, port binding, this is all about not using Tomcat application containers and that kind of technology. Um, but really ensuring that your application is a single executable. And then it accepts a port binding and its own rights, um, that doesn't apply to infrastructure and the way that it applies to code. So there are a few of these we can cross out, we get down to eight. So I believe I patch them all off.

00:08:08

Um, you can review them in the slides afterwards. I don't think that those four that I've crossed out there really apply directly to infrastructure. But one thing I will add number eight, rather than doing concurrency through scaling processes, we should actually dynamically scale using elastic services. And that's a key tenant to any cloud infrastructure. We're not going to talk particularly about that today because that's about which resource you choose. Um, similarly, I assure you, we're not going to go through each and every one of the remaining nine. I'm going to show you when we did our analysis. When we sat down, we're sort of going to skip to the end and look at what the answer was, sorry, that we have time to sort of discuss this property with a little bit more context, but let's pick on one. The one that I find I latched onto when we were doing this exercise, exactly one code base.

00:08:56

I think all of you have had some experience with IAC. Um, I think the, uh, there's, there's a kind of journey the organizations go through when they're talking about infrastructure as code. Um, certainly every organization I've worked for has started off putting all of their code in one repository and we all know that's incredibly painful. So what is this talking about? Why are we saying exactly one code base when we all innately know that having one mono repo is incredibly challenging? Um, I think it's all because of this little nine on page two of the definition of 12 factor, um, factor, shed code into libraries, which can be included through the dependency manager. And that is something which I think very few organizations think about when they think about how to structure their infrastructure as code. So what we've tried to do in Babylon is take that tenant to heart. So any reuse code is extracted into a kind of library. We understand that Terraform has modules. I think most people come across Terraform modules. So really the question we're going to be answering today is how should we structure those modules to best answer those four challenges we had at the front to absolutely minimize refactoring, um, and to make sure that, um, we're doing the best possible we can with our infrastructure as code.

00:10:27

So when we did our analysis of those remaining eight, maybe nine factors, um, we came across four big problems. One of which I've talked about already, the factor in the show code is the libraries, um, that, that wasn't really a kind of standard answer. You couldn't jump on Google and go, how do I do this for Tara foreman? You know, the stock exchange answers would come up and say, this is how you do it. This is the right answer. Um, firstly factoring shared code into libraries, um, that we need to dependency manager. It's powerful and doesn't really have one. Um, there are standard modules out there, but when I was looking at them, I found that for example, the standard volt, um, library has incited an ETT module. When I go to the standard console library that has another ECT module, they're very, very similar, but this means we're repeating ourselves.

00:11:18

So if I want to change how an ECE two is deployed, I, I have to update both of those modules. And really, I want one module, three C2, which is pulled into volt and console. If they both eating ECT, um, another co-tenant of 12 factor is to separate build release and run. How do we achieve that in an infrastructure as well, infrastructure as code? Well, what does it mean to build? Um, one of the challenges we've all talked about before, um, is how do we organize our testing? What do we mean by build release and run? Where does testing fits in that process? Obviously there isn't really an artifact. Like there might be with a programming language like Java or C, but all go even. Um, but then suddenly we're not in the same kind of world we're in with JavaScript either where the code just executes.

00:12:09

So what we did there is we wrapped our modules, our libraries, we gave them all very distinct version numbers. So whenever we talk about something, it has a version number associated with it. We'll talk about the structure of those modules and how we organize them in just a second explicitly declare and isolate dependencies. Again, just comes back to the point I was making about vault and console. Both those volts and console modules relied on act yet they contain their own code. It wasn't declared it, wasn't isolated. The ETT stuff was sort of distributed across and made maintenance really particularly difficult. And finally storing environment specific configuration in the environment. What does that mean when we're working with Kubernetes? We've got things like conflict maps. When we're working with servers, you know, we've got desks to keep things on, but when we're doing infrastructure as code, what does it mean to have an environment where can we store that environment specific configuration and how can we make best use of that and mean that things are maintained audited, that we know what it was that changed in the event that something might break.

00:13:21

So we came up with this four level hierarchy for the modules and the, but the Terraform code that we've written, I'm going to start from the bottom of the pile, which seems a little bit strange, but the, the bottom we called module, we've used the capital M here to distinguish it from the just sort of generic Terraform module. These are very, very small. There's a one-to-one relationship between a module and an AWS resource with fundamentally. And I don't want to show. And what the module makes sure is that these resources are named correctly, that the metadata is correct. That tagging is there. And that any opinions we have as an organization about how a particular resource should be deployed, if that's encoded into that module. So for example, we want all our EBS volumes to be encrypted. If you use Babylon CVS module, you don't get an option to not encrypt an EVs volume.

00:14:21

Similarly, when you deploy an ECT, ECQ uses the EBS module. So it's, we know it's encrypted and we've isolated that dependency when a component is the next level up of obstruction, this is a grouping together of modules to form something useful. For example, it's very, very rare to just deploy on IDs. It's most likely you're going to deploy an RDS. You're going to stick a security grape on the front of it to Sukkot. So define which, uh, other resources are able to talk to it. You're probably going to create some IAM roles to decide who's going to log into it. You're going to create some CloudWatch alerts, the component. Each of those has a module and its own. And it's the component that pulls in those dependencies groups, them together as a sensible business value, delivering thing, which itself can be version numbers. So if, for example, the very first time we released RDS, we didn't have the car watch alerts.

00:15:19

And now we just had the, I am role. We could say, this is audio, I suppose. In one, when we add those kind of watch alerts to it, we can go RDS 1.1 and we say, right, we have a new feature. We have a new change log. Um, we understand that, um, we understand which of our databases that have been deployed have been deployed using which version of that components. So the next level up is the service. This is where the consumers consume those components. So we have one service for each of our microservices, which consumes any kind of infrastructure. Um, and so this means this is the part where we can enable teams because their services or their own repositories, they've got their own state files. Um, we can devolve control. We can say, here is your service. There's no longer a possibility that teams can accidentally run over somebody else's infrastructure because the I am role, they use to deploy. It will only let them touch their own because it's all contained within the service. And finally, we have the concept of environment. This is where we group the services together. And this is a repository, an actual get hub repository, where we hold the environment specific configurations. So for example, the IP ranges of our VPCs are in that environment. And then there's a description of each of the services, which needs to be deployed to that environment.

00:16:47

So we can draw a kind of picture here and around us. This is slightly small, but you can see at the top, we have those microservices. Some of them don't consume infrastructure at all. Some of them just standalone business logic that talk to other microservices to get that data and store that state. I don't talk directly to infrastructure in any way at all, but some of them you can see here do, um, and you can see that that dotted line all the way around the outside is how we define environment. And that has three services in it. We have food, we have borrower and you have adult. Um, when we look at that Dole service, we can see that that makes use of two components, the components being DB and radius. And then when we zoom into that component, you can see that each component uses the slightly smaller modules using RDS.

00:17:41

And I am in this particular instance and elastic cache in the case of Redis, but notice that both DB and readers are using the same module. So we've managed to, to reuse that code. Um, it has a version number on it. So if we want to change the naming convention for our, I am roles, for example, we can release any version of the module. It won't get automatically rolled out to everything because we need to release a new version of each of the component to consume that I am module. Um, similarly, each service will need to release a new version in order to consume the new version of each component. And finally, each version of those services will need to be rolled out, um, into their respective environments. Now, this may sound like sound unwieldy when I say it like that, but this is how every other language works. You know, if you're writing node, you type NPM updates and it goes and gets the latest version of all of your modules. And that's very much the gap that we found was missing when we were doing our work with Terraform,

00:18:41

Excuse me a second.

00:18:44

Well, yes, absolutely. I've said it. Um, dependency, how, how do we get round that incredibly long complicated? Oh, we've just released the new iron module, which means we need to update both the DB module and the, uh, elastic cache module, sorry, TP component. And we asked the cash component, which means we then need to update the food service and finally deployed up sort of the environments. Well, we wrote a very, very small ghost script called Baraki any coincidental naming night Karakia is, as I say, entirely coincidental. Um, and what this does is it uses get hub releases, which are basically just kind of rubber stamp to get hub tags, um, to keep a record or to be able to track down which versions of each of these modules, components and services and environments that exist. And so it can go into the Terraform code and it can update the connection string on the get source.

00:19:46

When you declared the module, it obviously tells you what you're doing, but it does no more magic than that. Um, it just go update. It goes and looks at every single module. It then looks at, get hub and says, what's the latest version of this module and says, oh, should I go and update these for you? You ssssss, you can then raise a pull request to update it. It really takes the manual work out of, um, that, that module dependency updating process. But even then that itself sounds very manual. I'm here to devil's conference or the virtual habits conference talking about how to solve dependencies. Yeah, I'm saying you can raise a pool request. So how do we, how do we take this to the next level? How do we make this even more automated? The answer to that is always testing. If we want to automate things, we need to be able to find out when they're broken and be told automatically.

00:20:38

So the automation doesn't crack on and produce something that's horribly broken because we didn't test it. So how does this structure help us deliver automated testing in the world of infrastructure as code? Well, interestingly, this hierarchy works very, very well, indeed. Um, hence one of the reasons I feel it's worthwhile coming here and talking to you about it, firstly, those modules are very small and very simple and they line up nicely with the concept of unit testing. We have a very small number of, uh, Terra test scripts written against each of the modules, which check does the module deploy? Does it even work? Does it create the resource it's supposed to? Does it have the naming convention that we said it should have? Is it tagged properly? Do we have that metadata? When I say, please create me a database that belongs to this team.

00:21:28

When I look at the AWS console, does it have that team's name tied against it? And finally are the opinions that we talked about. You know, I said, we'd encrypt all our abs volumes. I think most people do when I create an EBS volume, easy to encrypted, and we have Terra test scripts, which go around and test each of those things every time a module changes. And every time a pull request is raised on sort of branch, sorry, braced against master. Just like again, just like any other language, but it's more than it's simple because it's right at the, I guess the top of the dependency hierarchy, but the bottom of the functionality tree, um, the next level up components, this is where we can start doing more end to end test them. The kind of thing we talk about integration tests that we were writing a, perhaps a larger application in go or Java.

00:22:16

So now if we take the database example where I've created a database, I've created some CloudWatch alerts, I've created a, I am rolled so that I can log into it. Perhaps I've got some logging, um, those kinds of things, all as part of my component using the various modules that have already been unit tested. Now I can start looking at some what you might describe as you user journeys. Again, Tara test comes in here. Can I get the credentials from my database down to my secret manager? I've said, volt habit, you could equally be either a secret manager or something else. Can I get those credentials? Can I then log into the database? Um, and can I put some data in there, great tests, simple tasks, another one that's really important. Um, and I'd suggest if you guys follow this approach, then you might wanna put this one in.

00:23:04

If there's some data on my database already and I upgrade the module, is that data still there afterwards? That's quite an important catch-all. Um, and that's the kind of thing we've done. Those user journeys is high level. Does the component work? Does it deliver the value that it's supposed to deliver when we get up to the service? So this is where, for example, we have an identity microservice, which is responsible for authenticating all of the users that are coming to the front of the platform. Um, what we did here is we just run the microservice tasks. So we deploy the microservice, we deploy the infrastructure, that's supporting it, probably the other way round. In hindsight, we probably do the infrastructure first, then deploy the microservice after it instantiate whatever dependencies that microservice needs to have its test suite run, and then just run the application code, the application, um, test sweep, the test harness the application has because, you know, w we have extended the idea of the microservice down into the infrastructure layer at this point, that service code that says this microservice needs these components at this version that can be stored just as well in the, um, application code repository, as it can at the moment we're doing it in a standalone repository.

00:24:22

Um, but we will eventually merge them together so we can just run the application tests. Um, again, we can run tests, which, uh, some of the non-functional, so is the data still there? Um, cause many tests, you know, will be self-contained, it will create the data that they need to create. Finally, at the environment level, this is where testing is a little bit more complex. This is where, you know, how do you test an entire environment? We could get a little bit more end to end. We can run our internal tests, but it really depends what that environment is for the interesting environment obviously is production. So at that point, we're looking at the usual kind of Runscope API testing and monitoring that we do. Um, the other thing that we do, uh, for those environments, which are not ephemeral is we regularly run a Terraform plan.

00:25:10

I think it's once an hour, um, against the entire environment. And that tells us whether any of the resources that are deployed in that account have been changed in any way as somebody manually going in and manipulated one of those resources. And so we get an alert if that plan shows anything other than no changes. I mentioned that every entity has its own repository version. Positories have a very strict structure. That structure is isolated. So each component has its own repo. Each module has its own repo. Each service has its own repo and each environment has its own repo. And these structures are the same for all of them except the environments, which I won't quite go into here just for a matter of time. The first thing we include, the first decision we made was all a bit boilerplate code. We'll have an underscore before.

00:26:00

It just makes it easy. You can glance at it. Um, you can glance at what's, uh, what's really massive. You know, you can just look at that directory structure and say, ah, I need to look up RDS. I need to look at, I am the underscore locals underscore outputs, which we all kind of understand the boiler plate code. Um, you can sort of see through the next important thing is we include an example. That example code is real code. It really works. You have to pass in credentials obviously, and you have to pass in the name of the item, the rest account and various VPCs, but the example actually instantiates that components or that module or that service, you can just run it. The tests then use tests to execute against that example, which means that we know that that example, which is fundamentally meant for documentation actually works because we test it every time we do a release.

00:26:56

So let's just have a look back at the four problems, as I said, we were trying to solve right at the beginning. Um, and see if we've knocked them on the head. We were trying to remove bottlenecks. Um, well, what we've done is because we have those services, which are very strictly owned by the teams, developing the microservices. Um, we can give them their own, I am role that we can give them their own state file. We can keep everything nice and tidy. We can mean that they can't impact other teams alongside them. So yes, we've, we've got rid of the bottlenecks in that sense that for we have increased pace, one of the things that were slowing us down when we had a big mono repo was a Terraform plan would take 20 minutes actually. Now, because we've got a state file per service per environment, they're much smaller.

00:27:42

A Terraform plan takes 90 seconds. And most of that is downloading the Terraform modules because I am a femoral build agents, uh, we've ensured consistency. Everything is deployed using a module or component or service, uh, into an environment. We know that it's right. We run regular Terraform plans. We know when things get changed, they very, very rarely get changed, but we find it when it happens. Um, we absolutely know that everything we're doing is either consistent or as importantly we know when it's not consistent. And finally, um, we're enforcing compliance. We have from the ground up, those modules enforce the rules that we've set up, set out, say we are going to be compliant with this framework because we encrypt all our abs volumes. We ensure that everything is HTTPS. It's baked into the module. We completely understand that all of those resources that have those rules associated with them have been deployed in this code and therefore our compliance and it makes our auditing journey an awful lot easier and an awful lot more pleasant. Sorry, wait a minute. Under half an hour. Um, thank you very much for your time. Thank you very much for coming to my presentation. I completely understand that that was a lightning tour through a topic that we could spend hours and hours and hours talking about. Um, my evil twin is on the slack channel still, um, tapping away answering your questions. Um, thank you very much.