Las Vegas 2023

LLMOps: How we Develop, Operate and Secure LLMs in the Enterprise

With the advent of new technology comes new ways of working. That is exactly what we are experiencing as we enter the era of AI- a new deployment pipeline for this new technological paradigm.

Many aspects of LLMs make existing devops workflows difficult,. Figuring out how to continuously deliver, monitor, secure and go on-call for LLM-powered applications is the next frontier of DevSecOps. In this talk we will learn from recent product launches involving AI at Cisco and walk through an end-to-end model for LLMOps in the large enterprise.


John Rauser

Director of Engineering, Cisco





Uh, good to see some friendly faces in the audience. Some of the, the new friends I've made here today at the, OR this week at the DevOps Enterprise Summit. Really great to be here. My name's John Rouser. Um, I'm the director of engineering at Cisco Systems. I build security products to be specific. It's a zero trust access product. If you're interested in that, we can talk about that another time. But today, I'm here to talk about AI in the enterprise. We're doing a lot of stuff with AI these days. Everybody's talking about ai. They wanna know a little bit more about how AI is gonna work in the enterprise. That's what we're gonna talk about. We're gonna go on an adventure today talking about what it means to run LLMs in, in production. And it truly is an adventure, folks. It truly is, because what is an adventure?


It's excitement, it's unknown. It's venturing into an area of the world where we don't have a map yet. And for me, and I know for a lot of people here, that's very exciting. One of my other passions is organizational behavior, how teams work together. And that's another area where we don't quite have a map yet. And so I, I think, you know, I'm trying to think to myself, why am I so interested in this area? I think that's it. It's, it's the frontier and frontiers are exciting. We're discovering things, and new papers are getting written every day. We're changing our world every day. We're changing our understanding of how we can use these things, and that's just amazing. So what I'm gonna do is tell a little story here. We're gonna build a castle together, the castle's, the product. We're gonna talk about what it means to build the castle.


We're gonna talk about who lives in the castle. I call that the council. And then we're gonna talk about how to defend the council demo mote around the castle. There's gonna be the three main things. But, you know, I just wanna reinforce why we're all here. Why are we all here listening to me talk about this topic? We're at the peak of inflated expectations, folks. That's right. That's where we're, we, we're at the peak. And that's fine because you know what the impacts are. Real businesses are getting impacted by this change already. And the uptake is huge. I, I put this graphic in here. I'm not sure if we're allowed to use this graphic anymore, because Threads, the Threads app, I think came up with like two weeks. They hit, they hit a hundred million users. So I dunno if we're allowed to use this.


I don't, I dunno if it's a good comparison anymore, but I'll still put it in there for a data point that this is real. But what is it gonna make? What's it gonna make it real for you is off figuring out how to run this stuff in production. This is AY Combinator data graph, the latest round, all the different startups that are working in this space. What are they building? They're building tools. They're building ops tools. And what are leaders saying? They're saying Ops is the biggest problem that they've got. Ops is the key to our success here. We can come up with all these ideas about what we're gonna do with ai, but how are we gonna make it real? And that's, that's, that's the frontier I think, uh, for us, for people that are building the builders of the world.


I come from Cisco. Cisco actually has, and I learned this recently, I'm teaching you now, uh, a very deep AI capability because Cisco has many different businesses and it acquires a lot of businesses. And every one of those businesses has a data science team. And those data science teams are getting together, and they're talking a lot about what we're gonna do, and we're thinking about all the different ways that we're gonna do this stuff. There is a difference, though. The AI that we've been doing up to now, you could call it traditional, you could call it classic, you could call it predictive, but it's the ML kind of stuff. And that's the competency that my organization has and many organizations has. And we're transitioning to this generative AI competency. And in Cisco, it looks a little something like this. We've got a wide number of businesses across all these different areas.


And like I said, initiatives going on in each one of these areas. So what I did, what I set out to do is talk to all these people. I interviewed a couple dozen people, figured out, what are you doing? How are you doing it? What are your challenges? And that's what I'd like to share with you today. So let's talk about the castle. Who lives in the castle? How do we build these products? Who lives in the castle? So this is the, this is the analogy, this is the metaphor. The LLM is, I'm, I'm sitting in the throne. The LLM is doing my work, and I need the LLM to do these different tasks. And what I've seen is across all these businesses, three different use case categories or task categories emerging. The first one I call the visier. So the visier is the helpful one.


They're standing there to ask answering questions, and they're available to, uh, to give me access to this deep set of skills and knowledge and data that they're trained on. Um, the visier, you can, you can interact with them. You can keep asking them questions. If you don't like the answer, you ask again. And so this is the, the, the primary use case that probably everybody here is thinking about in some way is a conversational, a agent, a chat bot, or some kind of a co-pilot. There is a difference between a chat bot and a co-pilot, a chat bott. You're, you're using natural language, you're interacting with this thing that's, that's your, your open ai. But a co-pilot is more sort of guiding you through a decision tree. Would you like to do this? I see you're doing that. Maybe you want to do these three things.


So it looks like you're trying to accomplish this. Here's some ways to do that, but not necessarily through conversation, but more guiding the person. So I call those co-pilots and versus chatbots. And at Cisco, we've already launched our help desk chatbot. Cisco has a huge amount of information and knowledge that customers are trying to get access to. So we developed Sherlock, and there's a little tweet from somebody who is very impressed with Sherlock's ability to answer questions on any kind of material that Cisco is working on. And so huge value created for customers right there. And so this is, this is the first, the first use case and the most obvious one. And if you're not working on this already, um, I would be surprised because it's, it's, uh, it's really important. But the next one is a little bit different. I'm just gonna go back, uh, one slide.


I want you just to look at one thing. And the visier, there's an inflow the user is going into and interacting directly with the L lms. So there's an inflow. And the next pattern, I call it the judge. The judge is more of an outflow. So we're presenting to the user, we're, we're making available to the user some skills or knowledge that the LLM has access to. And so the, the use cases here are, I'm kind of boiled down. I'm trying to give you my judgment. I, I wanna, I need the reasoning on something. And you can think of different ways that this would be used, um, in, in, in an enterprise. There's all kinds of case analysis that has to go on. We have to summarize the sales call. We have to summarize the support call or to summarize, uh, an incident. And we can have the LLM do that.


But the, the user isn't necessarily interacting with the l lm. They're just getting the output of it and then maybe regenerating it if they don't like it. We're also cautioning the user in, in a deep way. And in every one of these use cases that they are interacting with, uh, a model, they're interacting with ai. And that's part of our commitment to responsible AI, is to making sure that user well informed. The last use case. So we've got the visier, the judge, I call it the general. So the general is going out and doing orders. They're taking orders, but getting the job done. But they have to get it right. And this is actually the, the paradoxical area I see in AI is that, um, we're dealing with probability based models. They might not get it right. And so a lot of the thinking, a lot of the neat ideas that people are having is what if the lmm could just do this one task and could take the output of that task and use it in the next task and the next one, change these things together.


The LM can take over a piece of the workflow. We don't need to have to worry about that anymore. But the problem is that the lmm always has to get it right. So there are some really interesting stuff. Again, I work in security. So a, a great one that we talk about all the time is content categorization. What if we can just look at the content that the user is getting access to and call it whatever it is, it's a gambling site, but we're gonna block it. So that operation that it gets used in, there's a danger there. And that's where, you know, especially as we're ing these products to where it really matters that we are able to observe, monitor, um, improve the accuracy of the product to a very high degree, we can't launch these things into production unless they actually can do their job very reliably, right?


So this I call the general, and I, and I, I don't think we're quite here yet. The visier already launched, the general we're figuring out. And I, and I do think that's a very strong ops problem. So let's talk about the ops a little bit. What are we opposite? So we're gonna build the castle. Oh, by the way, I did generate all these graphics in Mid Journey, and it's pretty good. Uh, I, I, I was telling my wife the other day, I'm now paying more for, um, my AI tools and we're paying for our entertainment stuff's. Netflix, Disney Plus and all that. I, mid journey and open AI and I'll back and all, yeah, starting to add up, but they don't always get it right. I don't know if you noticed the judge, he had a, there's something going on with his hand, his finger there, and I still like it.


So what's, what are the, what are the, the towers, the keeps with the castle? There's, there's three. There's the model, the data, and the interface. So the models are what are kicking off all this fascination with the world. Um, and they are doing incredible things. We don't exactly know why, but as we add more parameters, they start to do more things. And so there's this motion, there's these two motions really. There's a motion to add more parameters and get more capabilities. And then there's a motion to see if we can squish more capabilities into models with less parameters. And so those two motions are going on right now, and we're seeing the effects of those across the industry. And there really is this explosion of models coming out. The GT three was, was almost the turning point. It wasn't quite good enough. 3.5 became good enough.


And then, uh, we start getting these open source models, smaller models, models that I can run on my computer. One of the questions I had for people when I'm asking them these questions, is it, is it real? Like, is it realistic for us to take a small model, like a 7 billion parameter model and run that on a computer? We'll actually do things that, that are effective? And uh, and I think the answer is yes. You know, LAMA two is open source. It's, it's available for commercial use. It's the only one. Here's the data sheet on it. It comes in these parameter sizes. The 7,000,000,001 will run on. Well, I have the M two MacBook with the integrated chip and the 64 gigabits of memory. I can run it on here, which is, if you're looking for a reason to buy a new MacBook, you just found one.


It's got limitations. The context size, it's small, 4,000 tokens. It's not very big. Andro has a hundred thousand tokens. You can stick a whole book in it. So you know, this may be a chapter and 2 billion parameters or few trillion tokens. All the data it was trained on, it's a lot of data. But you know, there's, there's a lot of data to go around. Um, but these things are effective. And if you go out and run some experiments with them, you'll see that, in fact, there's a really easy way to do that, that I like a lot, which is this thing called Chatbot Arena. And you go here and you ask it the same question and it'll load up two models side by side and give you the response at least of those models. And it randomizes the models too, by the way, which is kind of exciting.


Um, you don't know which one is giving you the answer, but it'll answer questions, deep questions that I have. Like, what is the meaning of life? Not very well, but, uh, worth asking, worth a shot. Um, and you can convince yourself that, hey, maybe these smaller models do have a role in the enterprise. Because make no mistake about it, companies like mine and maybe companies like yours, if you're working at a bank insurance or something like that, you will need to run these models in-house. You know, you, you, you know that already. You probably run your own GitHub. We do probably already run your own j you're gonna be running these things in-house. You're not just gonna ship all your data. And that's, that's where, you know, we have to sort of figure out, can we actually do these? Can we create smaller models that we can run on smaller infrastructure and still be effective?


How do we know if they're effective? So we get this concept called evals, and evals is essentially running a test on the, the model. And so, um, these are the really popular ones. MMLU and human Evals. These are the ones that are saying take the, take the, uh, the lsat, take the sommelier exam, which I thought was interesting. It's actually a very hard exam. Um, and then you get this collection of sort of folklore tests that, that we use to, you know, a new model comes out. Let's see if we can pass the Sally test. Somebody even wrote a, a cool little tool here. It runs the Sally test against every single model and tells you the latency and whether it got it right. Does everybody know Sally has three brothers, each brother has two sisters. How many sisters does Sally have? I mean, how many of us do you think are gonna get that right? Did you get that right? You did. Okay. <laugh>, that's good.


One of the things I thought was interesting here, just highlights some of the problems that we have with running these models. Look at the latency on answering some of these questions. That's a lot of latency. Are you really gonna be able to put that in production? How are you gonna put that in production? You, yes you are, but how? And so these are some of the things we have to solve. So the future of models is, again, it's a frontiers. So many different areas that we're exploring bigger ones, smaller ones, cheaper ones more expensive ones, um, multi-language. It turns out when you train a model in more languages, they get better at doing everything in all languages. That's interesting. Why is that? We don't know multimodal when they can do language, when they can do audio video to get better at all of it. So interesting. Why is that? We don't know. Again, the frontier there, it's not like there's somebody out there in the world that knows <laugh>. It's not like you can go take the MIT course in this stuff and, and you'll figure it out. There's nobody knows yet. That's why it's exciting for me, probably for you two. Let's talk about the next tower, the keep of data.


I wonder if data is a good word for this, because one of the things I discovered is that you think we have the data, but we don't because we've been collecting data metadata we don't actually store today. People are using models and people are collecting large amounts of data. They're not storing everything, don't store the entire documents, the full text. They throw that stuff away and keep the metadata. And now when we go back and look at our data sets and we want to train our models, our language models on those data sets, we're finding they're not complete. So I wonder if it's the right word. It might be knowledge or information, but we want the whole thing. We wanna train these models on the whole thing. So there's a bit of a missing piece there today on the data.


How do we get the data into the model? How do we get the model to do what we want? There's a word that's coming out, it's called grounding. It seems the industry is settling on this word. We're gonna ground the model in our own information. And so there's four ways to do that. The first one is not good. Too expensive. We're not gonna train our own model. But the next three all work together hand in hand. So you may have heard of fine tuning. Fine tuning is where you take the model and you update all the parameters using your, your dataset. The next one is rag. You don't change the model, you run a system just off to the side of it, and you try to retrieve just the documents, just the elements of your information set that are relevant to the query. And then you give those both to the model to answer your question.


And the final is the one that really captured the world's imagination for a second prompt engineering. How can we stop the prompt with more stuff, right? How can we make the prompt, um, put, put the data, you know, and, and throw up? You can throw a whole book in the prompt, right? You can put all the examples of the kind of responses you want there. You can almost train the, the model right in the prompt itself. When you do these three things together, now you start approaching high degrees of accuracy with your responses. That's what people are finding. Another thing people are finding is, and you can think about it this way, if you want to give it a skill, you're fine tuning it. If you want to give it information, knowledge, documents, you're using rag and the interface is the prompt. And that's you, you're getting, you're getting the stuff in there.


So that's the way I've started to think about it. I noticed people in the industry are starting to think about it that way. So, um, you know, rag itself is this kind of, um, mysterious thing, but it's actually not that complicated. You take the data, you, you turn it into embeddings. Embeddings are just vectors. You put it in a database, you take the query, you put it into embeddings, you compare it against the embeddings in the database, you take the ones out that you retrieve them, retrieve, retrieval, augmented learning, and then you put it all in the prompt and you get a better response back. That's the essence of it. It's not that, it's not that complicated actually, but, but it is cool because this frontier is emerging of well, embeddings and what is the algorithm we should use to generate the embeddings that affects performance, that affects correctness, that affects a number of things.


What is the database that we should use to store these? Embeddings also affects accuracy, also affects performance. And by the way, those things should match the, the database and the algorithm, the embedding algorithm. Um, there's a relationship there where they can work well together or they can't and experiments. That's how we're gonna figure that out. Uh, the last elements, the last power of the castle, the interface. So, um, what is, and how do we engineer the prompts that we're gonna give the, the model? Now, this is probably where people are most familiar, so I'm not gonna spend a lot of time. What I do think is interesting though is, you know, you are, you're setting up the query, you're feeding in the, um, whatever it is that you're getting from the user or the instructions you wanna give in, you're getting the response back. But you have to be really careful.


Again, we get back to this responsible AI initiative, the important aspect of this, and we have to put guardrails on the system. And there's all kinds of ways that we can check and make sure that users aren't, if they, if they're using the visier model, that they're not, uh, doing malicious things with the model. Um, also that if we're using the judge model, that it's not producing really random output that doesn't fit or maybe is, I don't know, not just something we don't wanna show users. So we put guardrails on the system, both in front of the prompt and, and coming out of it. And that's how we can make sure that we're protecting ourselves. Again, this will have an impact on performance. It will have an impact on accuracy, but it's an absolutely necessary thing. Something again, we're gonna have to tune, experiment with, put it all together in a really complicated sequence diagram that I'm not gonna talk you through, but I just wanted to show you the data, the interface, the model.


But one thing I wanna point out is it's not just a model, it's many models. And that is one of the patterns. So this is one of the patterns that is emerging, is that we're probably gonna train smaller models for specific tasks to take on parts of the process and larger models that maybe are more expensive, difficult to run, difficult to train and use these all together in an ensemble. You hear this word, the ensemble, that's what it's, it's just models working together, and then they're running on some kind of infrastructure. There's, um, VLLM that's emerging. It's a way to, to cache queries and proof performance, these simultaneous queries. So that, that, that infra that we're running it on is, is critical. And then the hardware itself, um, uh, are you running it in the cloud? Are you buying your own DG xs? I heard they break down a lot. Cisco was trying to figure something out with the s product.


Just checking the time here. I got six minutes left. So, uh, there's some really important stuff. I wanna, I wanna talk because we're just getting into good stuff. So the ops, the actual ops, so we touched on this a little bit. There's gonna be a lot of elements to figuring the s out. It, it's the first one that I really wanna focus on. And then the, the last one, security that I want to tease a little bit. So accuracy, we have this accuracy problem, and especially when we get into the general use cases, um, we can't prevent hallucinations. We don't know why the things work the way they do. We have this problem of explainability that you don't have in most computer systems. Why did it arrive at this decision? I don't know. And I, and I, and how, how can I change the system so it arrives at the correct line?


Well, there's so many things that affect accuracy. I just tried to list off a few off the top of my head and color code them onto that map We saw earlier, all these things are gonna affect the accuracy of the model. So I, I cha and, and you don't know how they're inter it's a network. You don't know how they're interacting. So, um, the data hygiene, did we clean up the data enough? Did we remove the duplicates? Dotes matter, they, they bo the engine down. It, they, they make it give, um, non-unique responses. But there's all these elements to it that come together that, again, we have to experiment with. We have to learn, this is the frontier. How do we make these things accurate? How do we enable the general model that I, I don't think is enabled today? And then the second one is security.


This is the Nvidia red team model. They came out with their own security model. There's a few that are floating around here now. There's also the oasp top 10 for lms, which is pretty cool. And, uh, I find looking at security models, threat models, that kind of thing is a great way to understand a system. It's a, it's another layer of insight into how systems are designed. That is a slightly different take than maybe the one we're used to. So you can go and look at that and they think about some of the security issues, the, the SecOps problems that we're gonna have with running these things. Okay, the last section here, we're gonna, we got four minutes left, we're gonna move to the last section. I call it the moat. Oh, forgot about this one. Whole stack of stuff, stack of tools is merging.


This is from the Sequoia article just published recently. You gotta check that out. They go through the whole stack, um, both the, the off stack, the INFRAS stack, and the consumer stack as well. So definitely go read that article there. So let's talk about the moat. What is the moat? And I'll just make a, um, you know, my own opinion on this, what is the MO right now for LLMs? I, I, I don't think it's the data or the knowledge of the info. I don't think it's that because, um, a lot of it's just bake into the LLMs and a lot of it's even public. So, you know, Cisco's public data set on how to run its products. Well, anybody has access to it. All we're doing is stuffing it into an LLM through some rag and then making it available to you. So it's not, it's not the, not necessarily the, the data that's remote here, um, in enterprise LM enterprise ai, it's not the users because I don't think there's a network effect or a critical mass.


I don't really know how much that 100 million users on a OpenAI really matters to the stickiness of that product. Um, but I do think it is these, like, and, and this is, I'm talking in the context of a large enterprise that's trying to enable its teams to, to build LM powered products. I think it's the, the platform. So the platform is gonna create a mode in the large enterprise that enables teams to deliver features faster, faster than the competition and to win quicker in the space. And if you're, you're paying attention to some of the stuff we've been hearing the last few days that that's, that's it, right? I think it was Steven Spear. He said, you know, every business wakes up in the morning with a set of problems. The ones that can solve those problems are gonna win. And so how do we en enable our org to solve those problems faster?


So here are some of the things that we're doing as Cisco to enable, and I, like I said, a diverse set of teams across many different products to build AI features faster. And we're able to release these quicker and get them into the hands of customers. So things like a common design system. Oh, did you think about that by the way that, um, you know, when you're using a Visier model or even a judge model, there's this new design paradigm where we're gonna get, we're gonna get feedback on whether, whether we're wrong or not. It's not normally, that's not normally something that products have to do. Products are normally right, they should be right, right. So, uh, but here we have to come up with, with ways and means to, um, to get the, to get feedback from either. So mid journey is, does, does a cool thing where you can, um, you can regenerate the graphics in different ways and kind of get what you want.


Um, so a common design system is gonna be critical, especially when you're in an enterprise. You want people to have a similar experience when they're using your products. They want to, you know, be slipping between products. And the AI works in one way here, in one way, there not a good experience. And then there's a whole bunch of other things here too that, uh, let's talk about a couple of them. So the design system, yeah, there's the idea of a model zoo hugging, hugging face is the model zoo. It's the GitHub of, of where you go get models. But you know, the large enterprise is gonna have to have its own model zoo, its own local hugging face, serving up these things. And even a place to run them, a collective infrastructure where anybody can go and run their model and do it cheaply, do it quickly, not have to worry about getting the GPUs and, and, you know, allocating the, the AWS hardware, something like that.


It's not easy right now, by the way, if you wanna, if you wanna grab GPUs and AP aws, it's not, uh, it's not like eecs. You can't just push a button. You gotta, you gotta work with them to get it. Um, my friend Patrick here at, uh, in the UCS group is he, he created this, he built this, he put it into one of our data centers and we can run our models on it, serve 'em up through an API, and then use, um, use lang chain to, to work with them. Kind of cool, maybe we should sell this, I dunno, <laugh>, um, an internal corpus. You know, we, we need these, these big piles of data. Um, and how can we make, make them available to our, our, our teams with good hygiene? That's the critical thing, thing here. Not, not have a lot of dupes and, and things like that.


And then an enterprise ai, API. So how can we give access to our teams to experiment with Azure? API experiment with open, open ai, but do it safely and do it securely. So we actually put this whole infrastructure in front of our, in front of OpenAI and our, our employees use this instead of gonna OpenAI do, they go to an internal site and we get the same experience, but it's all filtered, logged, recorded, it's compliant. So compliant infrastructure. How do we learn more about this? So this is, I'm wrap things up now. Um, the best way is to go ask <laugh>, ask ai. And that's exactly what open AI does. This just blew my mind. Like, they don't even know how these things work. They're using their own large language models to figure out how their large language models work. How, how much do you love that? Right? This is, this is a great place to be right now. <laugh>. So that's, that's their paper by the way. You can go check that out. Read the paper. Yeah. So you know what, here, here's the ask. How, what help do I need? What help do we need? I think we need a community that's thinking about these things together. I think we need to come together and ideate and figure out, you know, what the edges of this frontier are. Draw the map, start drawing the map. And I think that the people in this room, the people at this conference are the right people to start that community. So I'd love to do that with you. Reach out, get that going. Patrick's into that too. We've been talking a lot about that. So.