Blackboard Learn - Keep Your Head in the Clouds
How do you implement DevOps in a software company that has 16 years of established culture and processes? What if this organization is the industry leader and has everything to lose by changing?
Over the last two years, Blackboard has gone through an enormous change, from a company delivering enterprise software once every 18 months to one on the verge of delivering Cloud enabled education software through continuous deployment.
My presentation will talk about the triumphs and challenges of taking a group entrenched in years of legacy to a new vision of faster delivery of high quality software.
Vice President, Chief Architect, Blackboard
So my name is David Ashman. I'm the chief architect of cloud architecture of Blackboard. And I'm going to talk a little bit about the transition that Blackboard has taken from enterprise to cloud and through DevOps. So what do I do at Blackboard? Well, I build things, I think a lot, generally I wear clothes. When I do that, especially at the office, I manage people sometimes, but really I build things for Blackboard and I generally try to make things better for everybody. And this presentation is about just that making things just a little bit better at Blackboard through dev ops. So a little bit about Blackboard. Blackboard is the industry leader in educational technology and services. We've been around for 17 years in this space. We are currently privately held, but we spent many years as a public company prior to going private. We had revenues of about $450 million a year. What all I can say now is we have more than that. Um, and we are headquartered in Washington, DC.
We have an extensive portfolio of teaching and learning communications, analytics, security, and transactional services and products. These products serve all stages of education from K through 16 and on into adult education. But today I'm going to focus on our flagship Blackboard, learn product, our teaching and learning platform. So some more about Blackboard learn. So it is 17 years old. It is our first product. It was started by a small team at Cornell university and it was, it has its roots in Pearl. Um, over the years, it has evolved into millions of lines of Java code. Many of those years had actually spent as a hybrid of a Java and a Pearl application. And anybody that's dealt with hybrid applications knows that's not an easy task to deal with. We service this application through development and operations in seven offices, worldwide, several in the U S and going, as far as Australia and China, we have 700 people working on this product and development testing, and operations.
Now that's all of our products. We have a lot of people that sort of filter in and out of, of learning and other products. So 700 total, um, we service about a thousand clients in our hosting organization and that's accomplished on about 3000 virtual machines, excuse me. And we have about eight petabytes of content and data storage across those thousand clients. They're large system and we are a horse. And like so many of you in this room, we have dealt with long lead times, six plus monthly times. In fact, we had one release that spanned 18 months before we could get it out the door, like any product that's 17 years old, we carry a lot of technical debt. Now we do whatever we can to try and get that debt down with each release, but inevitably features technical debt. And we end up carrying debt release a release of a release.
We have experienced some high update failure rates in the past. We've had releases that have resulted in hours of downtime for our clients and outright failures that have required them to roll back to older releases. Obviously something that no company wants to put any client through. And we've dealt with the same communication issues that many of you have had in both directions. The story I like to tell about a lack of communication, obviously I don't like to tell this story, but, um, a time when we were releasing a new version of learning that had a queuing mechanism in it to deal with synchronization in the cluster and cache and validation, and this queuing technology dependent on multicast networking. And so we built it, we tested it, everything worked great. We pushed it out to our operations team neglecting to tell them that we needed multicast and the network infrastructure was not designed for it. And it brought down the whole system. Nobody knew what was going on. It created a lot of fires, bad communication, and we have bad feedback loops in some in the past too, the operations team would spend a lot of time and energy building scripts, building tools, building various things to make the application behave well in a production environment, but never tell us. And we were never able to put this into our product and make it easier for the operations team to end up operating the product when we were done with it.
But through the art of DevOps, we are now a better horse.
So what changed a lot changed a lot has changed the Blackwood over the last two years, but I'll focus on three key ones, automation, cloud infrastructure, and culture, of course. Oops. So first automation, this chart here, as Jean likes to point out, this is the graph he really liked. Um, this is the lines of code in our mainline product. And our learn product is the Java code ever since we started tracking it around 2005, um, what's the problem here? Well, the problem is over time, this product has grown and grown and grown. Especially when you start looking around the middle here, it is growing at such a pace that is becoming this enormous product with so much complexity. So much an unsurmountable debt that we were running into problems, both in development and operations of significant failures in releases and problems with developers taking far too long for these products to get built out, we needed a way to improve this.
We tried using different tools. We introduced better code management tools, better build tools, but in the end, there was no way to look at it any other way than to say that we were building a monolith. And with that monolith comes all kinds of problems, poor code quality that was making it out into the field, slower release times clients that would have to wait longer and longer and longer for releases to get out there with fixes or with new features, more instability. When you have products like this, you have more dependencies. You have the butterfly effect of a change on one side of the product, causing an error on the other side of the product that you would never expect to happen. And internally it was causing a lot of, a lot slower developer productivity. The times that a developer would have to wait for any kind of feedback on a change they would make, it was getting longer and longer and longer.
So we built more and more complex machinery inside to try and make it more and more lean to get better and better feedback. But all we were doing is building more bigger, complex stuff to deal with a bigger and more complex application. And in the end, it was resulting in 24 to 36 hour feedback loops on integration testing. So a developer would write code, they would commit it and they would have to wait 24 to 36 hours to get any feedback as to whether their change broke with somebody else's change. Or the problem also was this was a bundling. This was the nightly build. So this was a bundling of all the changes that had happened during that day. And that meant that when there was a failure, which happened fairly frequently, we had to rip it apart and figure out whose code change actually caused the problem.
And to get the ticket back to them, to try and fix it in turn, Mike sitting right there. Um, Mike was hired at Blackboard by Steve Feldman and other colleagues this year, this week to kickstart dev ops at Blackboard. And he came in with a whole fresh set of eyes. We had a large organization that had gone through a lot of change, but we needed new eyes. We needed new ideas to come in and really force us to think outside the box and think of new ways to approach, build pipelines, to approach release cycles, to approach how we were building our product. The first problem we took on was that problem that I talked about earlier, we had this monolith, how are we going to deal with, with making this monolith easier to manage, get a better release cadence and get fixes out faster into the field.
And to do this, we took a monitorize in the product. What's that in the top corner, how do we get our product to go down at the end? Well, we actually had something built into the learn platform that we could leverage. We had a model, a module technology, a component technology built in that had traditionally been used by third parties to extend our platform. But we leveraged that same technology. We ate our own dog food and started building our product, using that same technology. And this allowed us to start breaking apart this monolith and start understanding the dependencies between these components much, much better. And it also allowed us to start building these components independently and getting better feedback to our developers. We also introduced new technologies, better tools, more modern tools, like the ones you've heard so far today, Jenkins chef Vagrant for virtualization and grateful as a new build tool.
And all of these combined together allowed us to take that 24 to 36 hour integration cycle feedback, time down to 15 to 30 minutes. Now a developer commit their code. And by the time they got back from getting a cup of coffee, they could know whether their change would integrate well with everybody else's change. And additionally, because now we were doing commit level builds. We could know that if a build failed, we would know exactly who broke it and when they broke it and be able to automatically route back to them that the issue exists and tell them to go fix it.
But of course there's more, that was only the first phase. That's only dealing with integration testing. There's always on the other side of that functional acceptance testing. And again, any of you guys that have a long history know that functional testing really starts with manual testing. We as an organization had already years before we started doing this, had already taken on the effort of automating that manual functional testing. And we had done a good job. We had taken a manual functional testing process that would take anywhere from a week to two weeks to run down to running in about 14 hours, which was a great improvement on feedback. But it wasn't enough. If you look at the actual pipeline of what was going on, we have the great work that we were doing around integration testing at the beginning. A developer is working on a feature, spends a half a day, working on a commit to that code.
And within 15 minutes, he's going to know that it integrates well with everybody else's code and it's working fine. But after that, you have your acceptance tests. Each suite would run about 40 minutes to an hour. And we, um, in this case, I'm assuming that one suite covers the code that was changed. But the main problem was, even though we had automated these tests, the feedback from these tests were not simple green, red tests. It required human intervention, human analysis of what the failures were to determine if they were real failures or if they were environment or some other issue. And that took an additional hour. And then it has to loop back again, these are nightly built. So now it's a culmination of all the changes that had happened during that day. And then you have to unwind it and figure out who broke.
Things gets much, much worse when you start thinking about the wait times. So again, we're doing okay at the beginning five minutes, we're going to get some feedback to the, to the developer, but then you start throwing in this problem in the middle 36 hours, we need to wait for that nightly build. We need to make sure that it's all okay. We need to ship it and install it on all these servers. And then we need to get these test suites running. And then you got to wait for all the test suites to finish running. Cause there's more than just one there's dozens of them before you can get all the analysis done and then feed that back into the cycle. And this was resulting in 9% efficiency around their development. It's horrible. This loop was taking three to six days for a developer to know that at the acceptance test level, their code was working.
Additionally, because these tests were, were not red green tests. We couldn't know for sure whether a failure was a real failure. We were generating hundreds of failures. 60% of the failures were actually attributable to scripting issues either they were invalid tests or tests that have fallen out of sync with the code 30% were data or environment. So data was left behind by a previous test that was breaking a subsequent test. 7% were pre-existing issues. And again, because we had this butterfly effect of a change on one side breaking the other side of the application, the same root cause could cause multiple different tests to break. And though we might be able to avoid running a test that we know is going to fail. Some other one might fail down the line. You need to figure out well, is this an issue we already know about or do we have to open a new ticket? And in the end we were seeing 3% newly discovered issues, again, not a very highly efficient process. And it was all evidenced that we really had this inverted testing triangle. We were far too dependent on end-to-end gooey tests, a lot of clicking browser clicking and not enough at the integration and unit testing level.
So long story short, we had these elongated along data test cycles. We were running it three plus months of testing. So those six months cycles, half of it was being spent testing. It resulted in longer time to market. We were struggling against competitors that could get out there much, much faster than us with features and fixes. So we were losing to these customers and we have reduced the visibility back to our development team. And that was evident in a couple different ways, longer delays between coding and fixing of course, and just far too much noise. We had to Wade through all this noise to understand what was going on and what are we doing now to try and take this on. And we're actively working on this right now. We're starting to adopt a test driven development internally in Blackboard, which is great. We're also working on a fully automated acceptance test pipeline.
All of this is with the goal of trying to get our six plus month lead time down to one to two weekly times, a very aggressive push internally that we're in the middle of right now, but we're making great progress towards getting to that. So next I want to talk about cloud infrastructure. So developers, they have their deployment environment. They have their environments that were running on their machines that were really only tuned for them to run on there by themselves. We had our test environments that could scale a little bit more, had more data in them and were good for testing for multiple clients and multiple environments. And then we had our production environment. So we're really built for scaling, really built for production environments. They needed to run their run for clients in the field, but none of these environments were the same. And what made it even worse is the environments were owned by different teams. We had our production environment that was owned by operations team, our production operations team. And we had product development that owned development and testing environments. This created the typical, well, it worked in dev issue where we would do the work. We were pushed out there. Everything looked great. Our testing was wonderful, but everything would go up and flames and production.
And it really came back to that. Our development of our environments were snowflakes. None of them were the same. And so we had a different deployment models where we had developers when he builds scripts on their own machines. And that would deploy into their own workstations. We had automated installers that would be running in our test environments, but there was, those were individual environments for clean datasets for testing. And then we had gold masters in virtual machines that we would roll out for, for production environments. So totally different ways of deploying our application. And additionally different architectures developers are typically working on windows machines. We do support windows for our self hosted clients, but none of our clients in our production hosting environments are using windows. They're all in Linux and Oracle. So we have our developers working on windows and we have our production running on Linux. And then right there in the middle, we had testing that would run both, but they weren't even running clusters. So we had clusters running out in production and anybody that does cluster development knows that there are a lot of issues that creep up because of clustered environments that we weren't seeing in testing.
We introduced chef, which was a great tool to start standardizing on how we were doing configuration management within our environments. But again, we had two different efforts going on. We had our operations team that was independently doing chef development from our development organization, both trying to get to that goal of configuration management and deployer environments, but both doing a completely separately then came the cloud cloud computing disable all. I was given the opportunity to build a super team, a development and operations, our first true dev ops team. If you want to call it that it was going to be focused on a shadow project that would take our teaching and learning platform, Blackboard learn and move it into the cloud. We had run it on traditional hosting for years and years and years, we wanted to see the benefits of cloud architecture. We wanted to see what cloud computing could do for our, our clients in terms of scalability and reliability of the platform.
And of course, automation was our goal. We started from the very ground. Everything was automated. And what this allowed us to do is truly implement infrastructure as code everything from orchestrating the environment up through installing and running the environments, excuse me, was automated. And it's the same automation that's used across all deployment environments. Whether you're running a development testing or production, it is the same deployment code. And of course, because we're using AWS, which makes it super easy to get environments to use. We're able to provide those same environments as self-service to our developers. They don't have to go to an operations team and get systems provisioned for them to be able to do. Now they can run those exact same scripts. All they need to do is give us an SSH key. We give them an AWS key back, and now they are able to spin up real production, quality environments for them to do their own testing, great environment.
Our next phase is what we're calling BB cloud, which is going to take what we've learned in AWS and move it to our own data centers. We have as an education company, an international education company we'd have regulations, and we have clients that have issues with public cloud deployments. And so we want to be able to have deployment that works for all of our clients worldwide. And so we're going to have some data centers, some places in the world there'll be running OpenStack. And we're also going to continue deploying to AWS where it makes logical sense, but we want to be able to abstract the way that cloud architecture. So we're building a layer on top of this that allows our app, our application developers, to focus on application development, an abstract definition of what their orchestration should look like. And we'll take care of orchestrating it for them.
We centralized and standardized our chef automation so that we have one team that's stewarding chef within the organization to make sure that we have good clean cookbooks that everybody can use no matter whether they're in development or ops or production operations. And of course the most important piece is increased visibility for, for development. It's a top tier requirement in BB cloud to have monitoring specifically performance monitoring using new Relic stats D and centralized logging. They will be completely transparently visible to all developers at any moment that want to go on and see what's going on. Not only in their development environments, but in our production environments too.
I obviously want to talk about culture. Culture is everybody up here is going to say culture is the biggest part of instilling DevOps in an organization. And as, as an organization, we were very siloed. Like everybody, that's going to get up here. And it's going to talk today is going to say we had teams that would build a development team, would build their product, package it up, hand it over to QA, to do testing, wash our hands and move on to the next thing. QA would run all their tests in this thing. Once they got a GA quality, one handed over to operations, wash our hands and move on to the next thing. Now, operations was left here with this box and they had to figure out what to do with it. We had documentation, the same documentation we would give to our self hosted clients, but it wasn't enough.
There wasn't enough feedback coming from development. And there wasn't a shared ownership of, well, this is our development product. We want to help you get it into operations efficiently. Now we would just hand it off and give it to them. It's not a very healthy organizational structure. Here's some ideas of some of the quotes that you would see even within the development organization, from developers, from testers, that QA is responsible for defining, defining test strategy, not development and QA is responsible for checking the quality, not developers and unit testing is not enough to verify that a feature is, there are a lot of fun around what testing is and what quality is, and that branched out beyond, uh, testing and development and, and into operations. A lot of finger pointing between the organizations of this app. This isn't performing well. Well it's because your deployment architecture isn't isn't right for this.
No, it was working fine until we deployed the last version. The software is broken. You guys go fix it. A lot of finger pointing, not very helpful. So in the end, the key to the cultural change at Blackboard ended up being executive buy-in. Once the team saw that the leadership at the top wanted dev ops, the tide shifted people started to get on board and started to really believe that it was the way that we wanted to do things. The single biggest change that was made was bringing all those groups together. Operations and product development, all became one organization. We had people in leadership positions that truly cared about development and operations. At the same time, we started placing development groups into the traditional ITN infrastructure organizations. And we started moving all of the application operations teams back within the development organiz that the application development organizations. So they will be sitting side by side with the developers and working on the same problem together. And of course we automated all the things you actually, you cannot automate cultural change. That's just not possible. That takes human capital. And that's what it took from our, our executive staff to decide that this is what we were going to do.
So what have we achieved? Well, development teams are now deploying their own code into production, and that's a huge step for an organization that was used to just handing off code to somebody else and saying, go do it. I don't care. Um, we have developers that are solving operational issues without ticket escalation. It doesn't require going through tier 1, 2, 3, 4, all the way up. It's literally, oh, there's something wrong. I'm going to hop on there and I'm gonna fix it right now. We have open feedback loops coming back from a, a F on operational issues. Now that we have a shared leadership, those issues are all working their way up to the same leadership that are saying, we need to work on this. It is in one group over here trying to solve it in their own bubble, without working with the other team to solve the problem. And we're, we're making data-driven decisions. We now have the tool sets and the infrastructure in place to start gathering data. We're gathering an enormous amount of data that we've never gathered before from our production environments. And we're using that data to drive decision making within our development organizations. And most importantly, we're talking finally.
So I am honored to have been invited to speak alongside some of the leaders in the DevOps movement. Um, I hope that I can offer something back to the community here. Um, some things I might be able to help with, um, understanding the impact of cloud computing and what it can, what kind of impact it can have on dev ops at a traditionally enterprise company. Um, I can help understand a little bit about cost models of traditional hosting and enterprise hosting versus cloud. We've done a lot of cost modeling at Blackboard around this and how to frame a pitch to an executive team. If you're really struggling to get buy-in from the senior as to why a dev co DevOps culture could help push an organization forward. I also would love to learn from some of you about effective testing strategies for traditional manual testing.
We still have to do some UI level testing. What's the best way to do that with a minimal impact on a pipeline to be able to achieve a high level of throughput and still be able to do that type of UI testing. Um, and also how to apply, uh, dev ops strategies for ship to premise software. We do still have clients that won't even deal with a hosted cloud structure. They want to be able to run it on their own campuses. So how do we do dev ops for a product that we don't even have in our own data centers? So anybody that has experience with that, I would love to hear about it. And that's it. Thank you very much. Thank you, Jean, for inviting me. Do you have any questions? Go ahead and email me. I actually have some time. Thank you, David. And we have a couple of minutes for Q and a, and I've been reminded to repeat. I will repeat the question so everyone can hear it. Yes.
So how do you create a culture and that nice picture of death pointing fingers? It often breaks. Traditionally one will find a ball pointing the finger at somebody, but I didn't do that way. It's not threatening to stifle.
Well, I'll be able to, how do we transition from blame culture to high trust culture? Well, I'll be, I'll be completely honest and say, we're still trying to get through that. Um, we, we, it's, it's fairly recent that we brought the organizations together and we're finally sitting at the same table talking instead of blaming, but there's still that after years and years, I mean this, the hosting organization is part of the part of Blackboard for 15 years, I believe. And so after that many years of being a separate organization and really having no recourse, other than that blame game, um, it takes a long time to retrain those people to think, well, let's stop thinking about who's at fault and, and how to solve the problem. So we're better now than we ever have been before, but we still have a long way to go.
What was the biggest action that you took that made the biggest difference? You think, again, moving these teams together and finally having finally had the, these, these operational escalation issues, the fire drills, that weren't just the operations team sitting in their room in a dark corner of the data center, but literally the, the main conference room in our headquarters with the senior VP of product development and all the teams that worked on the code that is having problems all sitting in that room and saying, well, what's going on? Why, why is this happening? Awesome. Thank you. But can you share how many lines of code was in that mainland repo? I don't know if I'm alone. It's a big number, tens of millions, tens of millions of J two EE and embedded Perl. Yup. That was actually just the Java that wasn't even the Pearl. Awesome. Any other questions? Sure.
You have to hit rock bottom.
Really good question. Um, I would say yes, personally, I feel like we did hit rock bottom. I felt like we had enough failed releases enough, just enough of a client backlash of problems. And honestly, enough clients looking at competitors. They really made us think about what we were doing and how we could do things differently to, to remain competitive in the marketplace.
I'll walk to the next question. So why did your,
The question, why didn't the execs jump in and solve the problems themselves? Why did they enable and empower you to solve the problem?
Um, well, not being the executive. I can't really answer for him, but, um, I think some of it was visibility. I think some of it was just maintaining a level of separation from the people that were seeing the problems and how they were being reported up. Um, we got a new executive team at Blackboard and they are much, much more engaged they're they are much more interested. They are much more technical than some of the ones we've had in the past. And they wanted to get in there and they wanted to learn more and they want to know what was going on, executives that were willing to get on calls and middle of the night to deal with issues as it was, it was a new experience.
Awesome. One more. And then, uh, we'll conclude
So the question is dev wants to, but can't, it's compliance security.
So education has a lot of compliance in it too. Um, FERPA is the big one. It's, it's a lot of privacy issues around student information and not being able to divulge any identifying information about a student. Um, but FIRPA has some fine print in it about how an organization that is acting as a sort of a technical consultant to an organization and that they have access to that data. So we actually don't have to deal with that at least on our traditional higher education and K-12 side of things. We do have a government side of things that we're starting to move more and more into the fed ramp and by Kappa areas of compliance, that certainly will start to change that landscape about who can access certain environments. But as of right now, most of the regulations that apply to us don't really impact our ability to get into production systems.
I do a HIPAA compliance with healthcare, and one of the things that we came up with was two, build some scripts where we can export data and automatically off the state information so that our logs have like the first few characters of each field, so that you can verify that you're looking at the data, it was inputted, but not the can't really see exactly what it was, Building these little holes in this backup backgrounds that, you know, not be able to expose that data.
All right. So the comment was about obfuscation of data as it was being brought into the development organization. And we did some of that at Blackmore, too, where we wanted to be able to bring in more production data sets to be able to do testing on there, but there were concerns from our clients while there's information in there. We don't want you to have access to. And so we built a series of tools and scripts that were able to siphon that data through an obfuscation that will remove any identifiable information, because all we were really more interested in was the relationships of the data and the volumes of data and what the data looked like. Not necessarily who it was or what they were doing. Awesome. Thank you so much, David.