Your Data Nerd Friends Need You

How the world of data analytics, science and insights is failing and how the principles from Agile, DevOps, and Lean are the way forward. #DataOps

CB

Christopher Bergh

CEO and Head Chef, DataKitchen

Transcript

00:00:07

Hello everyone. My name is Chris Berg, and you're going to hear about a topic called your data. Nerd, friends need you, and how the world of data analytics and science and engineering is failing and how the principles from agile and DevOps and lean are the way forward.

00:00:29

So let's start off with like a really broad separation. So in most of this dev ops conference, excuse me, we're talking about transactional systems that take user input or machine input persistent in the database, and they're built and run according to dev ops principles, but there's a whole other class of systems, analytic systems that take data from transactional systems and try to analyze it. And they're kind of stuck in the dark ages. And so that's what this talk is about. It's really about those analytics systems and the people that operate it. And we're going to talk about that problem and also talk about the people and how having empathy for them can help and also talk about the problems, the suffering that they're going on and what can be done and what you can do to help. So everyone's heard about big data or machine learning or AI.

00:01:16

Um, it's everywhere. Some people call it the new oil and it's a huge market, $189 billion market. And, um, you know, companies are getting acquired for huge amounts of data, 15.7 billion for Tableau, which is a reporting tool where get hub got bought for 7.5 billion. There are tens of millions of people creating insight from data. I think there's even more than software developers. Some people argue that it's a one and 25 workers are actually somewhat involved in, uh, doing data and analytics. And it's also a system that's rife with failure, most projects that deal with data and analytics fail. Um, Gardner estimates, 60 to 85, 5% of projects fail. Most organizations who want to be data driven that is at the highest level. They kind of like when Gartner says companies want to be digital, uh, they also say companies want to be data-driven and self-reporting is going down.

00:02:14

And most data science projects where the work is done, algorithmically also failed to get in production. So it's, uh, it's, it's not going well. And so what, what is that about? Well, in a lot of ways, the data and Ana lytic industry is like the us auto industry in the seventies. Um, you know, they're producing Pacers instead of Corollas. And so the cars that they produce tend to be pretty buggy. There's a lot of errors. Um, it tends to when they want to change model or get a new model and it takes weeks or months to deploy and actually a lot of data and analytic teams are suffering and their, um, uh, the rate at which people are leaving the field, um, is growing. And so if we look at data and analytics today, and we take just four characteristics of systems that people to do to do data and analytics, the cycle time at which they can get something from the fingertips of their data, scientists or data engineering and production, how many errors they have, how many error-free days they have, how well they collaborate, how well they measure, it's all pretty low.

00:03:14

And we actually did a survey last year with Ackerson and they found the same thing that the companies are pretty slow to deploy, pretty, uh, have way too many errors and a host of other challenges. And so let's kind of go back to the beginning and talk, talk about these people who do data and analytics. Um, and so, you know, all of us who've, I've had the fortune of spending half my career in building transactional systems and various forms at various organizations, um, as an individual contributor and a manager, and then another half building these analytics systems and managing people who did it and doing the work myself and the people who build and run these analytics systems, they just took a different door than you. They sort of at the, got their BS in computer science and went right instead of went left and they have lots of different roles.

00:04:00

You know, some people are called data analysts. Some people are called data engineers who build, uh, in transformed data. Some people are called business analysts, um, or, uh, analysts themselves who do visualizations. There's a role of a data scientist who actually does algorithms. There's managers, there's statisticians, there's database, uh, administrator administrators. There's a whole bunch of people who have different roles in trying to work with data. And so if we look at it in some ways, all of the systems that you hear about in the DevOps enterprise summit are broadly speaking, transactional systems, CRM, ERP, supply chain, a website, a financial system, an HR system, um, and all those sources. And then some more actually get taken by these people and built into an analytical system and pushed out to customers. And so that's what our topic is really about these analytical systems. And so to build them, the people who work on it, they work in teams and they've got just different roles and different people who work.

00:04:59

And so there's a data engineer, like I said, who actually is involved in transport accessing and transforming data. There's a data science person who applies algorithms like segmentation to data. There are self-service or analyst teams who actually visualize data. And then there's a data governance team who's tries to catalog and make sure that they reduce the risk of data. Um, and so these teams themselves just have a huge fragmented tool chain. And so every task involved in data analytics has broadly speaking a tool to help them do visualization, help them do algorithms and data science tools to store data in a database it's tools to transform data, ETL extract, transform load ELT tools, um, there's tools to actually keep track of data in human readable, form a data catalog. And in every part there's 50 different tools. Some of them are more code driven. Some of them are more low code driven, but there's a huge market.

00:05:54

That's, uh, you know, uh, billions and billions of every year and people love their tools and they have their favorite tools. And some of them actually may be familiar to people here like Python and some of them may or may not, may not be. And so the teams themselves actually work together. So let's say let's just go through a simple scenario. So there's a data engineer. Um, he or she sources the data from somewhere, loads it into a table, in a database, right in the it's a sales table. It's got Joe and Kelly and a name and a sales. And then there's a data science team who takes that and actually puts an algorithm on top of that. On top of it, maybe they cluster it and say, who are our high value customers and low value customers. And after that, there's a self-service team.

00:06:38

And maybe what, when, I mean, self-service, it turns out in data and analytics, sometimes the people who do the nice charts and graphs work for a, and we'll talk this work for the same team. And sometimes they don't, but they're basically putting a chart charts and graphs on it, but they also can add more data into it, maybe an owner. And there's a whole tool for self-service set of tools for self-service data prep. And then finally at the end of the poor data governance, people is trying to get, keep track of what, where did the name come from? Where did sales come from? Where did the segment come from? Who's the owner, what's the lineage. And so they work together, but, um, and they also source data. Like I said, from all this diversity of systems at different frequency, some of it's big, some of it's small, some of it's fastened, it's slow.

00:07:19

Some of it's structured, some of it's unstructured, but they take all that and put it into what I think of as a factory. They assemble artifacts of data and, uh, they have different steps along that factory where they put data somewhere, they group it together, they model it, they visualize it, they catalog it. And step-by-step data is transformed. Artifacts are created like charts and graphs from data. And finally it gets to the customer. Um, and so it's a lot like a factory, but it's also a lot like a developer system where, because they've got developer boxes and developer systems that need to be deployed continuously from dev to production. And so, um, they've got this really contradictory set of, uh, um, challenges. And I felt this when I led data and analytic teams, one is you just never liked to hear about data problems from your customer because they call you up.

00:08:12

They yell at you, they are unhappy. And then they end up not trusting the data. And that's the whole point of them to have business people, uh, is to trust the data so they can make data-driven decisions. And so the other part is you've got to, you know, be able to get ideas from your cool data scientists or your data engineers from their brain into production to get feedback. And so both these things are for a lot of data, analytic teams, very opposite. Um, you know, they, they run in fear mode where things are gonna break. Um, and they don't, you know, they've gotten yelled at too many times or they run in heroes mode. And so, um, these teams themselves, they may actually work for the same boss. And a lot of you here may roll out in a dev ops summit, may work for the CIO, the chief information officer on your organization.

00:08:59

Um, there's a role that's parallel to that sometimes works for the chief data CIO sometimes not called the chief data officer or chief data and analytics officer. And sometimes all these people work for that person. And he, or she may have the sort of full stack from accessing the data all the way to getting value, but sometimes they don't. Sometimes the data warehouse still works for the CIO and data governance still works for the CIO. There may be a data science team somewhere off in the organization reporting to the CEO. And then there may be lots of different self-service teams, um, embedded in different lines of businesses, uh, all providing insight. And so this complicated organizational structures there. And so the idea of a dev relationship in an ops relationship development and ops is nicely one-to-one in software, but it's pretty many to many in, in a, to use a database term, many apply to many, a dev ops relationship.

00:09:55

And, and let me talk about what that means. So I'm going to go through a couple of slides. So let's, uh, I'm going to go through four columns here. And the first column is a, that little graphic as a D in the middle of the development team. And maybe there's someone doing data engineering on that team and data science, and they all sort of work together. And, and, you know, maybe it's, it's harder because they're remote, uh, remote employees, or maybe they sit next to each other and they've got to work together with con without conflicts. And, um, you know, the software industry has solved this with and, and version control and, and yet version control really doesn't have a huge amount of penetration and, and most data science and engineering projects, probably maybe a third of them are less than a third actually use version control.

00:10:35

Um, and like, like the development team has got some way in a bigger organization to push their work to production and they need to deploy their work to production. So it's safe and they can, uh, get feedback and actually get value to the customer. That's fine. That's sort of a dev and ops relationship. Um, what's interesting about data and analytics is that there's a lot of self service teams and not these sort of light green DS here, they have decentralized development and this happens quite a bit. And actually most organizations, big organizations are like this. So you'll have a central development team. Like I said, maybe doing a data lake, a data warehouse, a data enablement team. And then every group in the company will have someone with Tableau or Looker, uh, someone maybe writing some Python on top of that dataset or around that dataset.

00:11:21

And so there's this kind of mesh or work of people doing development. And because then the challenge is the end customer sees the combination of maybe that person using Tableau or Looker, and then the data warehouse teams data altogether. And so it gets challenging also when the, when the push to production is also complicated. So you may have, uh, you may be able to go in and push a button and Tableau to deploy to Tableau online, but in a centralized development team, you may be able to, you may have to go through stages, a dev QA, a prod system to get the production. And so you've got this collaboration, complexity, and data and analytics. It's actually a much harder than in software development because you've got account account for this local and global centralized decentralized aspect, multiple ways to get the prod. And it makes it really hard when the CEO says something screwy in my dashboard and which team did it and where do they live in which dashboard is by which person and okay.

00:12:21

That was that dashboard within the line of business. And, oh yeah, the calculation was done by a data science team, but the data actually came from a third group. It came from the data warehouse team, and then that data came from, from it somewhere. And so, um, you know, uh, at the end of the line, um, being able to fix this and quick, it is, makes it really hard to be a part of these teams. And so, you know, when I was a software engineer, we'd bring up, oh, we, when we were developing code, uh, we'd sometimes talk about Conway's law and say, oh, Conway's less socks. You know, it's bad if you're developing something in Conway's law, you're not, um, following the natural way. Tech wants to go together to be the optimal solution you're putting partitions in your technology that are based upon the way some organization decided, you know, some VPs in a meeting decided people should work.

00:13:14

And I think they didn't have clinics files that quite a bit, that, um, how we partition the value chain that takes data and goes to insight very much follows the Conway's law and these pipelines, or more specifically Metta pipelines, you know, the pipeline for a data warehouse and the pipeline for a visualization pipeline for a model, all are sort of broken up according to Conway's law. And that actually just makes it really, really hard for teams to work together. And so that becomes this point that a lot of data and analytics teams are suffering. They're not having a, a great time of it. And so, um, you know, gene last year wrote this book called the junior, the unicorn project. And in there, he talked a lot about transactional systems, but he had one chapter where he focused in on the, these analytical systems and, you know, talk to a lot about, um, you know, the five ideals, but also talked about the sort of hero culture or fear culture that goes in, in these, uh, data teams, because a lot of them are in some ways where the sort of hair shirt of pain, okay.

00:14:17

I'm getting crappy data from someone, my customers don't trust it. I've got to work on Saturdays to fix it. And so you end up with accepting the fact that they've got really high error rates or that you've got to have a fear, a fear culture, or a, your hero cocktail culture. And also a lot of these teams just don't have any automated tasks. Um, in fact, most organizations don't barely even monitor the data as it's flowing through their analytical systems to see if it's, you know, even ballpark. Correct. Um, and so there's a lot of technology review boards to be able to get things out from production. And so average companies are taking weeks or months to deploy, not continuous deployment or continuous integration, just manual step-by-step with a spreadsheet deployment. And so, um, you know, a lot of companies want to be, data-driven, they're have fearful of the big Silicon valley guys, or they've gone to the gardener conference and they've heard about it.

00:15:13

And the idea of being data-driven is, is really important. And the team that helps support a company being data-driven, uh, the data science and data analytic team are not doing well. They're not, um, they're not being successful. And so, um, I've experienced that personally myself and managing teams and having, um, customers, like I said, challenge me when things are late employees wanting to try out new tools, um, just, uh, being able to be desperate, to try to get some insight into my hands and my customer. And so, um, about three or four years ago, this term data ops, which is kind of intentionally like, like dev ops, um, started to have a bit of a moment. And, um, my company itself has been around for about six years and being engineers. We weren't really good at naming. And we started to talk about what we did is agile analytic operations or dev ops for data science or one guy.

00:16:06

We tried agile lytic ops for awhile, and that didn't work. And then we try to analytic ops and that didn't work either because if you short it down, it's not the rest, the best word. So the idea of data ops is kind of gathering a momentum. And one from this company, gardener put around its hype cycle to the number of search requests are way up 500% over two years. Um, we wrote a manifesto that has over 10,000 signatures, um, and people are starting to talk about it and use that, that term more. And I think it is really, uh, uh, across all of them and like any term in tack it's being used in different ways, um, is this sort of definition of, of how you adapt and agile and lean and, and dev ops mindset to the world of data and analytics. And so from a definitional standpoint, data ops is a set of technical practices and cultural norms and architecture patterns that enable the first thing is really rapid experimentation and innovation.

00:16:59

So to give new insights to your customers, you need to be able to do it rapidly and try things out because ano a data and analytics, whether it's a data science model or visualization, or just data is a river of questions. And these teams are constantly trying to paddle down this river to find the place that where they can, um, give a unique piece of insight to their customer. And so rapid X experimentation, but low error rates where the data's wrong or the data's late. Um, and the collaboration across that complicated set of dev and ops that many to many relationship and the complicated tool chains that they have across cloud and on-prem and the different tools. And then finally just measuring, how do you measure, um, uh, uh, a system. And so like, um, and so you're, you may, in your organization have teams who are doing, you know, are starting to hear about data ops.

00:17:53

And of course, if you're here, you've already heard of dev ops and your CIO may be saying, we're going to be agile, and we're going to do dev ops. And when you walk across the hall to the data science or the data engineering or the warehouse team, they're like, ah, how do we do that? And so we've actually written quite a bit about this sort of differences between dev ops and data ops. And I think they're very at a high level, they're kind of the same, because they're all sort of, how do you have small batch size, a safe culture? How do you deal when, how does everyone deal with this technically complicated thing where, you know, we're all sort of touching a piece of the elephant. And so I think, you know, one of the main differences is this many to many relationship between people who are doing development and people who are operations, which are more complicated in data analytics, and then the sort of factory methods, statistical process control manufacturing of data sets, I think are just much more prevalent because, um, that, you know, the manufacturing of insight from all those transactional systems is something that every data and analytics team does.

00:18:50

Um, and so, you know, what we found is that people who adopt data ops, they tend to focus on similar things that you would think that, um, data science or the software teams focus on. And the main idea is that there's a system at which they can work, where they can work in a better way. And the characteristics of that system is that it allows them to deploy quicker and, and they didn't analytics world. It doesn't mean C I, or even CD. It means instead of going from three months, they can get down to three days. That's a miraculous change in a lot of organizations, um, and that let alone from three days to three seconds. Um, and then just trying to run with low errors. And so many companies nowadays have no idea if they're the analytics, they put out as rights, they have no idea when they deploy something to production.

00:19:42

If it's going to actually work. And the teams themselves are just really frustrated and really beaten down. And I think that's one of my emotional motivations to, um, to talk about this to a dev ops audience, is I think there's an opportunity for the skills that you have from dev ops to help these teams. And, you know, I think the investor there's a lot of cases of best-in-class people can do it all with data ops. They can have fast cycle time and deployment. And as a result have lower costs and less happy customers. And so, well, what can you do to help them? And so I guess the first thing is that you already have by being in this conference, have a set of ideas around dev ops and what it means to work in an agile way, what it means to iterate what infrastructure is code is what having script is, what having a safe culture is, um, how to work, how to work from bottlenecks and those principles themselves, um, as being a respectful, powerful set of principles that can influence how a team works.

00:20:46

They need advocates and data science and analytic teams, because the mental model they have, it's all about. I gotta do the next chart. I gotta do the next model. I got to do the next data set the next table. It's all very focused on the features that they want to get out and not the system that they want to build. And if anything, um, the, the dev ops, uh, idea is that you can build a system in which to do work, where people end up being able to do more work that's better. And so that system wide I idea, I think is important. And so, you know, when you go back to your company, if you're having lunch with someone on a data science team, or in data engineering or data warehouse, ask them some in pertinent questions, like, are you using source control for your work?

00:21:33

Simple question, you know, is your model and to get to, is your visualization and gets, how many automated tests do you have in production? Are you observing the data as it goes through your production systems to see if it's right, because they're, in some ways your customer, um, these operational systems that they, that you guys run in a dev ops world, your CRM system, your website, the tables, the logs that come out of that are actually the input into an analytic process. And so, you know, how do you know if there haven't gotten crappy data, they need to automate and test and then ask another questions. Do they have regression tests, functional tests, or unit tests? Do they just have unit tests? Do they have, uh, you know, there's some companies who they have a couple of units have on their ETL, and then they've plugged some stuff in the Jenkins and they think they're doing, doing miraculous.

00:22:25

And so how long does it take to deploy a model or some ETL code from dev to production? Is it automated or is it a manual process? How up-to-date is your development environment, or how often are your business user finding errors in the data? And every one of you who listens to this, I think knows what kind of answers you want, right? You want fast deployment, you want low errors. Um, you want to be able to have your, uh, a new person come into your group, get a development environment, be able to fix a simple bug and deploy to production their first week. Um, most data and analytic organizations, uh, answers to this that would shock you, that would, uh, almost make you want to tear your hair out. And so I think asking these important questions can give you some motivation to say, Hey, these dev ops principles, these ops principles or data ops principles, or whatever you want to call them are away for these teams to wait for them to help these teams.

00:23:24

And so we have a as data kitchen, we've written a book about data ops, which is free. Um, you've written a manifesto, um, and actually on our website, gene, Kim was kind enough last year to let, let us, uh, give out one chapter excerpt of the unicorn project that talks around a project Panther and the, um, data warehouse and the challenges that they, that they solved. Um, and so we've got some software that can help with help with it too. Um, but, uh, I just want to thank you and say, there is an opportunity for anyone who does dev ops to help out on their data and analytic team. I think it's a growing field. It's got, um, some great trends behind it and it could use your help. So thank you much and have a great day.