Las Vegas 2019

Your Data Nerd Friends Need You

How the world of data analytics, science and insights is failing and how the principles from Agile, DevOps, and Lean are the way forward. #DataOps


Christopher Bergh is a CEO and Head Chef at DataKitchen. Chris has more than 25 years of research, engineering, analytics, and executive management experience. Chris has an M.S. from Columbia University and a B.S. from the University of Wisconsin-Madison.


Chris is a recognized expert on DataOps. He is the co-author of the "DataOps Cookbook" and the "DataOps Manifesto", and speaker on DataOps at many industry conferences. Chris began his career at the Massachusetts Institute of Technology's (MIT) Lincoln Laboratory and NASA Ames Research Center. There he created software and algorithms that provided aircraft arrival optimization assistance to Air Traffic Controllers at several major airports in the United States. Chris served as a Peace Corps Volunteer Math Teacher in Botswana, Africa.


Chris Bergh, CEO and Head Chef, DataKitchen

CB

Chris Bergh

CEO and Head Chef, DataKitchen

Transcript

00:00:02

I met the next speaker earlier this year. And I'm so grateful for how he helped me get up to speed on the vast data problems that almost every large organization faces, trying to get data from where it resides to where it needs to go. I mean, because it's so often stuck in systems of records, data warehouses, fragile ETL process, often requiring months to get data to where it needs to go, which is in the hands of the developers so that they can use it in their daily work. And during those conversations, it sort of reminded me of presentation that went back all the way to 2014, whether it was a Telstra, uh, the Heather Mickman store at, uh, target. We heard them here through Adidas, Optum, John Deere, national bank of Canada, and so many others, uh, after several conversations with judges with this gentlemen, I got so excited that this actually became one of the central elements, uh, of the unicorn project.

00:00:44

And he spent tens of hours with me on the phone often on weekends to review seeds with me so that I could hit my writing deadlines. Um, so I'm very pleased with how it came out and I'm so pleased as one of the central elements in the book. Um, in fact, by the way, just to share with you, one of my favorite new scenes is that, uh, in the middle of the catastrophic Phoenix release, uh, Brent gets pulled into a meeting, uh, like an urgent call because all the prices have disappeared from the e-commerce site and the mobile app. And the reason is that someone in marketing uploaded a CSV file with a bite order, mark, basically making sure that like none of the fields matched and someone actually did that to me, that person has Dr. Nicole Forsgren passed me a CSV file with a bite order mark. And it took me like a half day to figure out how to fix it. Anyway. Um, I asked Chris Berg, CEO, and head chef at data kitchen to share with you what he taught me and all of the terrible, horrible on a separate problems exist in the data space and how this community I can help please welcome Chris Berg. Sweet.

00:01:56

Hi everybody. I'm Chris Berg. So where did that baby shark meme come from? Like I remember my son, like 15 years ago doing baby shark at preschool, and now it's in Lebanon now we're doing it here. So how would you understand where the baby shark meme came from? Well, you would do something called data analytics and I'm sure all of you heard of it, but the purpose of my talk is to actually say what you know, here, the ideas in dev ops, the ideas in agile or lean actually apply to a whole different domain. That's bigger, that has more potential and has more people than the systems that you're working on. So the title of my talk is your data nerd, friends need you. So what am I going to talk about? First of all, it's a big problem. And the people who are working in it, I'd like you to get some empathy for them.

00:02:47

Like they are really, I think, suffering and they're a lot like you and I, and I'm going to have an ask at the end, I'm going to ask you to help. So pay attention. So you can't walk through an airport nowadays and not hear something. You can't watch a sports presentation without hearing analytics. And some people call data the new oil, and there's a lot of buzz around data. And so whether it's big data or small data, whether it's streaming data or batch data, whether it's structured data or unstructured data, and there's a whole bunch of techniques that people are applying to data. Some people call it machine learning or AI or data science or data lakes. There's just a lot of buzz around it, a lot of acronyms going on and it's exciting, right? And it's a huge market. It's $189 billion market from a tools and technology men in here.

00:03:43

I think everyone was proud that sort of get hub. Maybe you weren't proud that kid hub was bought by Microsoft depending upon your political persuasion, but it was seven and a half billion dollars. So just recently a company that does data visualization. In fact, one of the 50 companies that does data visualization was bought for seven or for $15 billion. And then the next day another company was that does data. Visualization was what was bought for $2 billion. And there are tens of millions of people whose job it is either full-time or part-time are taking data and trying to rest insight out of it in some way, it's more than software developers. Some people estimate it's one and 25 workers. So if there's two or 3 billion workers in offices, you can do the math. It's a lot of people.

00:04:35

And so it's, it's a big problem. Um, because a lot of this stuff is just not working. It's failing most data science projects fail. Only one in five, one in 10 model goes from a data scientists keyboard into production. You know, the, if you go to a Gartner conference that covers the space, they talk about being, you want to be a digital company. Well, in the data space, Gardner talks about being data-driven as the high order bit that you should focus on. And yet most companies are saying that they're less likely to be data-driven than they were even a few years ago. And there's this set of numbers, 60, 80, and I've had a long career, right? I've been a, I been a teacher, I was a peace Corps volunteer. I was a researcher at MIT and NASA. I did enterprise software development and a leader, a CTO ran teams wrote code. And then about 15 years ago, I started to focus on data and analytics full time. And in that, whether you call it a data lake or a data warehouse, or whether you call the application of it, machine learning or, or BI, most of these projects always fail.

00:05:47

And so why is that? Why, why are they failing? That's the question? So I'd like you to walk down the hall when you get back to work and the people who do this and your company. And I think you'll observe a few things that they, they really have poor quality and just a high error rate on what they do. Minor changes to a model, to some sequel code to a visualization can take months. And a lot of their work is kind of hijacked by unplanned work. And they're kind of beaten up or, you know, they're, they're oversubscribed resources. They're very, um, they're sort of a, a hair shirt culture of like, give it to me. I'll take it. Does that sound familiar to anyone?

00:06:33

So let's, let's talk about these people and why I'd like you to have empathy for them. So they have lots of different roles. So some people are data scientists, some people are data engineers. Some people do data visualization. Some people are statisticians. Some people are architects. Some people are administrators, some people are managers, but what's interesting is they're just like us it's as if you got your sort of BS in computer science and upon graduation, you took a different door. You went to the data science and analytics door. And the work that they do is on this incredibly complex tool chain where every one of those roles has a tool that they use. And some of those tools in our parlance are kind of low code development tools, which are configuration. Some of those tools actually are, are, are code like Python or R and people love their tools.

00:07:27

And if you go into your organization, you're going to find three or four different tools that do data work three or four different types of tools that do visualization work. You're going to find the prevalence of Excel doing all of it. And so there's this, this complex world and they're just like us. So let's, let's talk about it in a little bit. So they work in teams. So they're teams of data, engineers and data scientists and people who do self-serve analytics and people who do data governance. And so they have different personalities. So your data engineers tend to be kind of like backend software engineers. They're a little grumpy, they're like carry a lunch, pail. Your people who do data science are probably just like the people they're algorithm people on a, on a, on a team. They tend to be kind of mad scientists, the people who do self-service or visualization, they see themselves as data artisans. And then there's governance people, which are like governance people everywhere.

00:08:24

So I don't mean to go into that. So let's, let's walk through what they do on a typical day. So if you start off a data engineer, sources, data from somewhere, and there's different systems that you've created different places in the organization. And it's a simple case here. There's some sales data. And then there's a data science team that takes that and applies some algorithms to it. Maybe they segmented or cluster it, and they create a high value in low value segments. So they're adding into that data. And then there's a self-service team. Who's actually visualizing that for an end customer. And the self-service team may not only do the visualization, but they may add more data into it. Perhaps they're going to add an owner, a west team and an east team, and they have these little small data files that they're adding in.

00:09:13

And so they're actually doing a lot of, sort of a lot of the data work themselves in addition to the work that the data science team is, and then comes along the data governance team, just trying to catalog it saying, well, where did this data come from? Where does it go? And so they work together and that's actually something that's changed in the last 10 or 15 years. We've kind of moved on to the heroic age of data and analytics, where it was one person who could do everything and the full stack data scientist. And if you go to, I speak at data and analytic conferences all the time, and there still are the data scientists who are trying to be heroes, but it's, it really is a team sport.

00:09:51

And there's this gigantic, massive fragmented tool chain that people use. So for instance, Tableau tools to do data visualization, there are tools to do self-service visualization where a business user can use it, do it there's tools like Cognos, where they're more IT-focused. There are tools to actually do what are called augmented analytics. There are storytelling tools, there's tools to turn analytics into words. It's a huge, a lot of cool stuff. And there's a been a blossoming of data science tools that help people do algorithms to democratize the work that that does it. There are a whole segment called , as well as stalwarts like SAS and open source tools like Python. There are data catalog tools. There are ETL tools extract, transform load, um, and they basically take data and massage it and transform it and change its form. And that's another 50 companies that do that. And then there's just a variety of databases that are tuned to analytics. They are sort of tend to be clustered in parallel. You've heard of Hadoop and spark and snowflake and red shift, but they're actually more attuned to the analytical needs because they can do joins really fast. And that's something that you really need to do in analytics.

00:11:07

And so they may work for the same boss and what's happened in the last, I don't know, five or 10 years is that there's a role called a chief data officer or a chief analytics officer. That's kind of in parallel to a CIO. Um, their job is to help the organization become data-driven and they play both what's called offense and defense. So they're trying to guard the organization to see if there's going to be data breaches. They're the one who's who gets hung of data, leaves the organization. But they're also trying to get the organization to stop thinking about analytics as building a house as doing reporting and more as a continuous delivery of value, as more as a process of delivering insight over and over again. And so there's conferences on CDOs there's organizations, and you can look, the number of them have, have grown up exponentially exponentially.

00:12:00

And so they all may work for the same boss and that that's cool, but they also may not work for the same boss. So probably in your organizations, you have a data warehouse team and a business intelligence team, and they probably work for your CIO, but a lot of organizations have teams who are doing self-service analytics embedded in every line of business, and they use different tools and also are doing the full stack of work on data. Or your CEO may have read an article about data science or AI, and has commissioned a data science team that, that works for him or her. And so the org structure is not clear. Um, and also what's interesting is that there's a many to many dev to ops relationship.

00:12:45

So the data engineering team and maybe the data governance team, they will be working in a traditional, perhaps working in a traditional it structure. We'll have a dev system, a QA system, a UA test system, and a production system they'll have a documented, I'll be at slow rollout from dev to production, but the self-service team has a button to push. They can actually take a set of reports in Tableau and push a button and put it into production. They also have tools like Ultrix or two for them to actually do. What's called data prep. And there's a whole new market that's emerged over the last few years in self-service data prep tools. Uh, and so, uh, the people, there's not one dev one ops there's these teams that are actually pushing things to production all the time. And so it's hard to know what's going on.

00:13:35

So let me, let me give you an example. So let's say you've got two teams, a home office team who gets data from a bunch of internal sources and builds a warehouse using SQL server, um, using SSIS, doing some Python code. You know, they live in Boston and they, they have sort of a weekly cadence of working. Now there's another team that's actually in New Jersey and they're using Alteryx and Tableau and they're making changes, but they're making changes every day in an hour. And the person who receives the value is that VP of marketing. So there, they don't know that they only see a report. They don't know that that report comes from two different groups with two different cadences and two different dev and ops relationships. They just know that the report needs to be right.

00:14:23

And so this creates a bunch of problems. So one of the problems is a typical in software development, you have this for the backend front end relationship, right, where you may have a rest API or a graph QL API, and then JavaScript web app in the front. And so, uh, there's a S similar in, uh, data and analytics where you make a change in a schema and a database. And then the reports could break because you've added or subtracted a column. And there are lots of organizations now who cannot do that because they have thousands of reports in many lines of business. And they have absolutely no idea if things are going to break or not. So they can't change a schema, or they change a schema through a procedural, let's have meetings, let's roll it out, let's send emails. And then, then when it breaks and then when they actually do it, it breaks and people get yelled at, or because they've, you're your team in New Jersey has these self-service tools.

00:15:18

They may be preparing these small data sets that only they know about, and they're keeping track of them. And maybe those data sets should be put in a central repository and shared with people and governed in a certain way. And one of the challenges with these kinds of low code tools, tools, and, you know, there are low code tools and software development like Mondex, right? You can build and think of Tableau as a low code development tool for data and analytics. And in, in Tableau, you can build charts and graphs, but you can actually embed Python code. You can build an if then else you can create a calculated field. And to me, and if then else is an if then else, whether it's in, you know, Gene's favorite tool closure, or whether it's in Python or whether it's in Tablo, it's an if then else.

00:16:05

And so that is business logic that needs to be kept in source code that needs to be tested. And so what happens in a lot of these organizations is I'll make an if then else in Tableau and it'll work and that'll be my market basket calculation. It'll get that report will get copied and copied and copied. And that if, and else we'll end up in 10 different reports. So how do you know if you have to change that if then, else, which report it's in, how do you find it? And so these calculation and consistencies across it are not there. And just, how does this work when you've got a team and I want to put some new data in, I want to change some schema. I want to have a new report. It's hard and slow to do this.

00:16:51

So let's look at it another way. So all of you work on systems, they have acronyms ERP, CRM, supply chain, website, financial HR data comes from open data or syndicated data. Data comes from databases or API APIs. Um, maybe you're doing some cool stuff. Maybe there's, uh, immutable data stores in Kafka where it's streaming in, maybe it's batch, but there's a whole group of people who are your customers who are living on the exhaust of your system. And if anything, just remember those people when you're developing, they are your customers. And they may be interested in the transactional data. That's in a database. They may be interested in the clickstream data. There's lots of data that they may be interested in. And so, as an example, what type of analytics are people interested in? Well, one example is the customer journey. They come from a website, maybe they're cookied.

00:17:40

Maybe they came from a marketing campaign. Maybe you get a login with just their email address. Then they may do more actions on the website. Then they may actually a customer that customer may be in support that customer may have some delivery issues. And that whole journey is not in one system. And someone in an organization is trying to understand that journey, trying to understand what marketing activities cause what results the PNL, someone in finance may be looking at an aggregate and trying to do a P and L of the customer journey. And so if you're on the side, you're like trying to piece together, all these things. And if you're an organization that has lots of lines of business, what a customer means is, is incredibly varied. You may have a customer in banking, you may have a customer in investment. You may have a customer in another place.

00:18:30

And so just trying to get an idea of who your customer is trying to get. The entities that you're you're thinking about is hard for these teams because there's no common ID. In fact, there's a whole set of tools and data and analytics called master data management tool. There's a whole set of techniques to do record linking. And so one thing to take away is you've got people who are just like you are living off your exhaust and trying to make are actually really important and valuable decisions. Um, and, and so they may actually take this. And, uh, the metaphor I like to use is they run a factory of insight.

00:19:06

And so when I started in data and analytics and I was, uh, you know, I was a CTO I'd run development teams, and about 2005, I thought, oh, I'll go into data and analytics. It's kind of like, they copy and paste the data. This isn't too hard. You know, data adds data. I didn't really pay much attention to it. You know, I thought data, people were kind of lesser beings and you know what, they're not, these are really complicated engineered systems. And if you think of it like a factory where data comes in, they have to access data. Maybe they're using Python code, they have to transform data. Maybe the data is sitting in a database, they're using some SQL code or an ETL code. They want to segment it or model a data. Maybe they're using our, again, they want to visualize the code and report on the code.

00:19:49

So imagine these are stations on an assembly line and data is being transformed. Artifacts are being added, and this is a pipeline and there's not one of these pipelines and organizations. There are hundreds of these pipelines that are running. Some of them have to run every hour. Some of them runs a week. Some of them are event driven. Some of them are batch. Some of them are scheduled, but these pipelines are, are, are complicated. And this is the world that, and you have an operations team trying to look at these pipelines. So think of it like a factory. And so like a factory, the principles of lean apply. So this actually is a source of data that you can analyze to understand how your factories working. You can have it, and on-court in this, the stop, the factory. In fact, everyone should have an Anton cord to stop it.

00:20:41

And there's another way to think about this is that they need to take pieces of that factory, where each of the assembly, places on the factory may be some code running in a VM, or maybe a tool running in the engine for that tool, like an ETL tool. They need to pick it up, move it into development and create a dev environment, create a sandbox. And for every software engineer, when you join a new group, you have some scripts and you create your sandbox. And if you're a good person, you can do it that first morning. Um, but the creating of sandboxes in the, in the data science and engineering for all this surprisingly complicated, because you need to have not just, you need to have test data, uh, different types of test data, test data, that's small test data, that's big test data, that's clean and doesn't have identifiable information.

00:21:26

And you want to have servers and software that are sort of like what your production is. And so creating these kinds of development environments is, is, is hard for people. And it's surprisingly hard. And they're actually surprisingly out of date. Some people have dev environments fixed to dev environments that have last been updated six months ago. So they're not like production at all. And I think everyone here can see the problem with that. And so you have the sort of diverse tools and diverse people and diverse customers. And they have a process to deploy changes, to deploy a change to that factory from dev into production. And it's not crate, actually, most companies take a long time to do this. We're not talking about continuous deployment or continuous integration. Most companies are taking months to do this.

00:22:22

And these teams need to do both simultaneously. They need to run a factory that has low errors, lots of pipelines, but they also need to change the pipelines and they don't want to break production to change those pipelines. So there are people who are just like you, they took a different door, they have a different set of technologies. But my thesis is that the ideas that are in the dev ops movement, the lean movement, the ideas that came from Deming from the book flow, all apply. And so this group is really, I think, suffering. And so I'd like to talk to you about that suffering cause it's a suffering that I experienced. And so when I talked to Jean and I talked to him about the sort of hero culture and the fear culture, the insanely high error rates, the complete lack of automated testing. So for instance, I was on a phone call this morning with a consulting firm that works with a big European bank. What was I talking about? You should do automated testing as your data flows through your system. Don't rely on your customers to find problems in the data. This is 2019, it's a multi-billion dollar bank and I'm advising them to do automated testing and production.

00:23:37

And so the, the groups have this sort of culture of heroism or fear. They have technology review boards, and there's a whole chapter in, uh, the unicorn book by the unicorn project that talks about it. And I'm so happy that gene actually put it in. I think it's great. Um, but to be particularly honest, I wanted, I want a whole book devoted to the trials and data and analytics because I've been living it and it's a big area and has lots of problems in, but I'm thankful, so thankful for gene to put it in. So let's, uh, we did a survey with a independent consulting organization called Eckerson who focuses on data and analytics and asked a couple of questions. How many errors do you have per month? How often are your late, the data's wrong? You miss your SLA. And 80% of the companies have an unacceptable error rate.

00:24:26

And when Ashley show this to people in the field, they think it's all bullshit. They think most people have no idea what their error rates, cause they're not tracking them because they're living in avoidance of error because they don't want to talk about it. Cause most of the data that's flowing through systems. Now isn't tested, it's comes in. Maybe they find out it works. There have reports that don't show up, the customers are testing it and telling if it's right. And that's the same. Like in 1999 I worked on one of the first sort of social websites. And I still remember having a million people on this website. And we were like changing code on the production system. And it was kind of cool that it was this living breathing thing. And we were changing code because things were breaking left and right, but that's not a sustainable way to work. And certainly having errors all the time. Isn't in another part of the story, we asked a couple of questions. How, how long does it take you to deploy changes from dev to production? And it takes months. So 10 lines of SQL can take months to move from a dev system into production. And these are really good people like people at a major insurance companies, people like you and I, I like them and it's just, it pains me to see how long it takes them to do this.

00:25:43

And they're actually really slow creating development environments. So w let me tell you my story. So about 2005, you know, I'd been a, like I said, I'd been a CTO, I'd run development teams. And so I thought out of this data and analytics stuff, no problem. And so I joined a company that did analytics for the healthcare industry full-time and I was the COO. And I had I a kind of the guy who made the trains run on time. So I had data scientists and data engineers, people who did data visualization. And as the company grew, we ended up having thousands and thousands of people who used our analytics. And I worked for a guy who was a, knew a lot about healthcare. He went to Harvard medical school, but he wasn't a technical guy and he would go off and talk to senior leaders in these healthcare companies.

00:26:26

And come back with a great idea, man, I'd go in a room with a data scientist and a data engineer and somebody who did data visualization, we'd whiteboard it up and I'd come back to his name was David. I said, David, this is going to take us two weeks to do. And he'd look at me kind of on the top of his glasses, like had killed a patient on the table saying two weeks, Chris, I thought that should take two hours and I'd go back to my office. And you would think people in health care companies are nice, but like when you have the CEO get wrong data or a whole bunch of salespeople have their incentive comp reports wrong, or even if you took a line chart and you moved it to a different place in a dashboard, people would call me up and yell at me.

00:27:07

And I just don't like being yelled at. Maybe you guys do, but like, it is not, I'm an introvert. I don't like having extroverted socially powerful people call me up and just read me out. And then we had hired all these smart people, right? They'd had master's degrees and PhDs. They wanted to do data science and visualization. And there's, there's been this blooming of tools, right? Open source tools. Great. And they wanted to try new stuff. They wanted to innovate. And so how can I do this? How could I have a life where I wanted to have be able to go fast and not break things and let and let people try stuff out easily? It's actually very, fairly hard to do.

00:27:49

And so that's my story. So how can you help? Well, you know, we've been my company and my co-founders have been trying to get this idea of data ops going, and actually we didn't call it data ops. First, we call it agile analytic operations. We called it analytic ops. We called it dev ops for data science, data engineering and data visualization. We tried all these different names and we settled on data ops because it's short. And I know some of you like, don't like the idea of, of data ops because the ops term is getting overused. And I, I sympathize with that. Um, but it's kind of, you know, we actually, in 2017, we run a manifesto. We actually stole a words from the agile manifesto. There was a dev ops manifesto, a bunch of lean ideas and put it together into 18 points. And surprisingly enough, 6,000 people have signed it.

00:28:44

The idea of data ops has got on a hype cycle, which is kind of the Gartner's hype cycle. And we're kind of seeing increased traffic increase search on it. And so what is data ops? Well, data ops is kind of the stuff that, you know, applied to the world of data science and engineering. And so gene has got a definition of dev ops. And so I sort of in my own way, mutated it. So data ops is a set of technical practices and cultural norms and architecture that enable rapid experiment, rapid experimentation and innovation for the fastest delivery of new insight to your customer, because that's the coin of the realm and data and analytics is insight. What you want to deliver is insight. And that's kind of a random walk to get there. And in a lot of ways, people are doing a little bit more experimentation, a little bit more spike solutions in data science than in software, but they're still experimenting a lot.

00:29:40

And we want, they want to do that with low errors because people who have data that's wrong, don't trust it. And there's this collaboration part across teams and locations and environments and clear measuring and monitoring. And so when you Google data ops, we created this one that acknowledges its intellectual heritage from agile and dev ops and lean. But when you look at it, it's the same stuff. If you guys know about dev, if you've come to this conference and heard the same ideas, take those same ideas and put them on a different reference, put them in the data science and engineering world. And it's the same thing. And so teams now have this mindset. There's a lot of organizations because they've gotten yelled at, by the head of sales. They end up in a change, fear mentality. They built a wall of process around it.

00:30:33

And a lot of the operations, the deploy to dev from dev to pride, checking for errors, alerting is all manual. And there's actually a lot of hope. I mean, how can some of the biggest banks in the world bring data in from a lot of sources and not do automated testing to see if it's right. They sort of hope it works, or they take code from dev to production and kind of hope that there's not our regression. That's crazy. And as a result, there's people who are heroes. I want to do things I'm going to work nights and weekends. And I was at a conference and there was a guy who was talking about his, his life and fixing a bug while sitting on a, sitting on in the bathroom during his kid's birthday. And so I only have a few minutes left and there's a couple of differences.

00:31:22

And so between dev ops and data ops, they're pretty much the same, but what's my ask here. So my ask is that you join the data ops movement is that if anything, the ideas in lean and agile and dev ops apply to manufacturing, applied to software development, need to be applied to the world of data science and engineering. It's a world that's broken. It's a world that needs your help and you have a unique perspective and a set of ideas. And so when you go back, could you ask people in this role, a bunch of impertinent questions, are you using source control for your work? Do you have automated tests? Do you have a regression functional unit tests? How long does it take to deploy from dev to production? Is it automated? How up-to-date is your development environment? How often are your business users finding errors? So ask these questions, you will be horrified at the answer just horrified. And so if you want to learn more, um, Jean was kind enough to take that chapter 16 of the book and allow us to syndicate it. We wrote a book on the ideas of data ops. You can get it for free. You can sign the manifesto or ask me for the slides. So thank you very much.