Slow or inadequate data infrastructure can seriously hinder the progress of data science projects, making it a constant battle to balance the need for speed with accuracy and quality.
Join hundreds of practitioners and leaders like you with episode insights straight in your inbox.
Checkout our brands or sponsors page to see if you are a match. We publish conversations with industry leaders to help data practitioners maximise the impact of their work.
As an infrastructure engineer, you know how frustrating it can be when data infrastructure gets in the way of data science projects. Business leaders expect results at the speed of business, but slow or inadequate data infrastructure can seriously hinder the progress of data science teams. It's a constant battle to balance the need for speed with the need for accuracy and quality.
Today I learn from Ville Tuulos, former leader of the machine learning infrastructure team at Netflix and the author of "Effective Data Science Infrastructure." Ville understands the challenges that data science teams face and has dedicated his career to helping them overcome these obstacles.
He shares his insights on how to deploy a data science infrastructure that is capable of leveraging data science and machine learning to solve large amounts of business problems quickly and accurately. He offers practical advice on how to overcome the common roadblocks that infrastructure engineers face, and he discusses the power of Meta Flow, an open-source project that can help increase data scientists' productivity.
Special promo code from Manning
Use our exclusive promo code today and get access to top-quality tech content from Manning publications: poddiscovdata23
Book Giveaway
Do you want to win a copy of "AI-Powered Business Intelligence"? Sign-up here:
www.discoveringdata.com/mlsystems
Membership Lifetime Discount (Listeners Only)
Your ideas help us create useful and relevant content. Send a private message or rate the show on Apple Podcast or Spotify!
Loris Marini: so if you are into diy, you know that the right tool is half the work and data science teams are no different. The trouble is that while business leaders expecting sites at the speed of business, the data infrastructure really often gets in the way, effectively slowing down the data science project or workflow and in some cases completely blocking it. So this episode is for people that want to learn how to deploy a data science infrastructure that is capable of leveraging data science and machine learning to solve large amounts of business problems quickly and accurately. Today I'm speaking with Ville Tuulos Ville is the author of a book that came out last summer, Effective Data Science Infrastructure, published by Manning. The book was motivated by his work at Netflix where he led the machine learning infrastructure team. During his time in Netflix, he and his team started an open source project called Meta Flow, which he might have heard of, which helps make data scientists and engineers more productive. The project now powers most data science projects at Netflix and hundreds of other companies. Today, Ville is a co-founder and CEO of Outerbounds where he continues helping companies be more successful with machine learning, leveraging meta flow. So Ville absolutely stoked to have you on the podcast.
Welcome
Ville Tuulos: for having me.
Loris Marini: the time.
Righty. So let's dive into the book. But before we do that gimme a bit of gimme a story like what prompted you to sit down and write a book, which is a massive undertaking.
Ville Tuulos: yeah, no that's true. And of course, always I feel afterwards that I shouldn't have like a made the promise. But look, the thing is that I had given many presentations at conferences about this topic and you know how it goes that there's a 20 minute, 30 minute slot, and I always felt that like we only scratched the surface.
there are like so many things. these systems are so complex and not only the systems, but also the organizational aspects. So I thought that like it would be great to have a chance to finally like actually like dive into details and I'm maybe a bit old school. I grew up reading books.
I think books are quite amazing. So I thought that actually they aren't too many books. Like really covering not only the technical details, but also the kind of the mental models of like how to even think about building these systems. So that was really my motivations.
Loris Marini: on behalf of the whole data community, thank you for taking the time to write it.
and speaking of the book, just a reminder for our listeners that obviously this conversation cannot possibly be a replacement for reading the book because again, but the thing you just mentioned, 60 minutes is just scratching the surface again. definitely recommend to check out the books. We do actually have a giveaway. there should be an active link by the time this one comes out. I discover data.com/ml systems. It's one word, no iPhones, just ML systems. Discover data.com/ml systems that will give you. the chance of winning a book basically.
So put your details there. If you wanna be a part of it. We'd all also have prom codes with a 35% discount code, and you'll find all those details down in the description, in the show notes as usual. with that out of the way let's, dive into the problem because a lot we, as data scientists, we. was trying to brainstorm what are the parties, what are the stakeholders that need to be aligned when we talk about infrastructure? Because it always an investment, either time or money, right? Resources that we need to dedicate to this. So really, top of my head business leaders Should be part
of the equation. These infrastructure engineers though, that those that support the data scientists and then the data scientists that are all kind of the customer facing, if you want, internally because they face the business leaders and try to serve them. you think they all get it or there are, asymmetries in the perception that
infrastructure matters for.
Ville Tuulos: I think everybody agrees that it matters, but as usual in human life, they all like a seat from a different angle. And and I think of course the fact is that this is all very new to everybody. although these days, of course, like. There's a lot of AI ML in the news almost every other day.
It is a fact that, like we are in a very early days of this whole like ML revolution or whatever you want to call it. And and like in that sense it's very understandable that like when I talk to different companies, people are just trying to figure it out. I think like everybody agrees that something needs to be done.
Data scientists feel the pain of not having the right tools. They're gonna feel like immense bus business pressure that they need to deliver models that actually work. it's complex stuff. Fast moving field. Business leaders of course know that like at the end of the day, ML is just a cost center unless it really somehow benefits the company at the bottom line.
And then like in the midst of all this, in infrastructure engineers at the end of the day are the one like bearing their responsibility. They're the ones like who wake up at night if something breaks. And then like you have all these like a crazy new systems that behave very differently than systems of the past.
And they need to keep everything running. . Yeah. Every, everybody knows that something needs to be done and everybody agrees that yes, something needs to change as well, but like we haven't yet quite converged at what it needs to be, so
Loris Marini: There's another piece of from the engineering perspective that is incredibly frustrating and it's the. and perhaps, a name that comes to mind is Chat. Sanderson on
LinkedIn has been leading the way in, the data products we did on episode together, which I definitely recommend listeners to go back and check it out. Cause we talked about the semantic layer and, but fundamentally, if you strip out all the buzzwords, like what we talked about is the frustration between. balancing a short term win with a long term goal.
as engineers, we wanna build stuff that can scale, that can, survive the test of time, the business doesn't have that type of patience.
They have problems right now and in needs solutions right now. And that's the pressure. So if we visualize it as a water pipe,
right? , there is an inflow and an outflow and there's different pressures in the system. so the only way for this to work is, if we come up with some tapers, some
sort of gentle reduction pressure from high to low.
And those tapers are probably made of literacy education, but
also storytelling to get people to understand, hey, we need to. this is why I'm taking the extra, spring to two weeks or period to get back to you
because if I don't do it, do you think that works? the Netflix obviously you've experienced that firsthand.
How was it like to manage that kind of pressure?
Ville Tuulos: Yeah. Yeah, no that's true. And I think also like big problem here is that again, as I said, we are early days. I always I think that think. What it meant. I mean like some of your listeners might even remember like what it meant to set up an e-commerce store in 1999 or like maybe 2002 and so forth.
It's kinda like a same kind of feeling that look, I mean there's this internet thing, there's the web thing coming. this is gonna be big. How on earth are we going to do it? And like people like duct tape together, all kinds of a crazy solutions. And there was the business pressure that we could be selling so much more.
and at the same time, engineers knew that, oh my gosh. this thing can collapse any day. we don't have the load balancers and this is just like running on a bunch of Pearl scripts. So it is same kind of feeling today. I think it is like a positive pressure in a sense that I, what's really great about the field is that like everybody is so excited about this and yes, it's a fact that yes, maybe the excitement needs to be tempered and I think what I always want through remind companies is that look, you are not gonna be.
Kind of missing out on anything. There's always the fear of missing out at all. If we not, if we don't do it today, like the other, our competitors are gonna do it. I think that the biggest mistakes, to your point, that many companies make is that they try to rush to some end goal. Like I talked to one company who said that, okay, can we build a similar kind of recommendation system as Netflix?
And I was like, look, do you understand that it has taken more than 10 years for Netflix to get to this point? it's a kind. Not only technical thing, but it's also like organizational muscle and it takes time to build. But the best time to start building it is today, kind of yesterday, but like today at least.
but yeah, so it's a combination of both fast and slow. So it's better to start today, don't wait. but at the same time, yes, understand that like it's gonna take a while to get there and it's never ending process anyways, so.
Loris Marini: I love it. I love, it's an infinite loop for sure. and is the piece on organizational muscle that
really is often overlooked. the type of roles and the mindsets and the behaviors that need to be in place to be labeled, to leverage data towards full potential. and that's still a lot of work in progress.
But going back to the business stakeholders, because without their buy-in, we can't really. And do anything meaningful. We can try and experiment on the weekend, but if we want, the space to do things well in a way that can support the business, cuz the focus is always on delivering business value. we need to get them on board. So
from you led ML teams and Netflix, like did you experience that kind of pain or was the organization structured in a way that, you had the space to operate and
you didn't have to worry about engaging your
Ville Tuulos: Yeah. I think the really, the key realization is that, , there are tons of different areas where ML can be utilized in different parts of the business. So I think like even in the kind of a business point of view, the leadership point of view, like oftentimes these, like a flagship projects, like whatever, like a recommendation systems they get.
Oversized focus and I think like where we see the kind of, really the real value coming is that you have this like a long tail of all kinds of a smaller things in the business. It might be like a small internal things that just enhance the productivity here and there. A good example was that there was one project that like just helped.
To optimize calendars of people involved in like making of movies at Netflix. it's not flashy. Nobody hears about it. But it's just the fact that like we can be smart about that kind of thing is immensely valuable for the business. And then understanding that like look, I mean there are thousands of opportunities, but we don't know in advance which one of those are going to work.
So the only one, how you can actually do that is to start having this experimentation culture. That you understand that like in order to understand where the business value comes from, like we have to try out a number of different things and we have to accept the fact that not everybody, like not all of these projects are going to succeed.
But of course, like we mi want to minimize the cost of experimentation. That's the key thing, and this is the failure pattern At many organizations that they identify a one big flagship project, they. 10 million doing it, takes way longer than anybody expected. And that results are less than anybody expected and everybody is disappointed in the end.
And rather you could find those like a smaller targets, get them done quickly. Some of them fail, some don't. And a double down on the ones that work. And that's like how you start building the organizational muscle as well. And that's how where the kind of the experimentation culture comes from.
And to your question, Netflix was actually like pretty good and they are really good at like this. let's just try a bunch of different things and see what works and do it in a systematic manner.
Loris Marini: So in, in gaining that agility, cause I think it's critical. We spoke about build a little test a little, learn a lot with. Christopher Burke recently. the whole topic of DataOps and how to leverage automation, there's always gonna be tension when you motivate. we start initiatives, especially from an infrastructure perspective.
Cause not everybody works in a data colossal company, right? Like many people, data engineers, data scientists work in smaller organizations, they don't necessarily have that maturity. they need, they are, they do face the other challenge of communicating, Hey, it pays off to try a bunch of things. There is the impatience on the other side of the business. do you think there is a way to crack that nuts and spin up something quickly,
show that there's some value, and then double down and plan it long term?
Ville Tuulos: Yeah. Yeah. No I totally think, and I think it requires maybe a bit of a mindset shift both on the business leadership side as well as on the data science side. I think like one, one thing that data scientist, machine learning engineers oftentimes, These days, forget that. Look, you don't have to use the latest and the greatest transformer models, like for some, like a simple business problems.
So there, there's a whole, like a portfolio of different models available. some problems are best solved with linear regression and there's a bit of a problem that like if you are like working data scientist, like proud of your job and education, you'll be impressed to say that, look, I'm using linear regression at work and that's a.
Bit of a a silly cultural thing that we have that there's nothing em embarrassing in like using simple things. Let me proving that value and then moving on over time. if that seems sensible. and that's really the key, like how you can get because as like usually like why these things take so long is that there's a tendency of over complicating everything like that applies to data site as well.
Data is complicated. We shouldn't over-complicate it unnecessarily. The same thing on the modeling side, the same thing on the infrastructure side.
Loris Marini: Yeah. And one thing that I found personally to be useful is to build a punch list or like a backlog of issues, not technical issues, but business issue. as you talk to you, different stakeholders, people will complain, over coffee hold that SAP integration, like particularly in large companies, it's really a pain.
Or there's that data set for only I could push a button and magically get the data to get my job done. It would be amazing. And obviously, as a matter of sifting through this requests or unspoken requests, just a matter of them being lazy and not enjoying that piece of work, right? There's no,
a massive impact on the business, but some others are critical. And so if you can find the critical ones and build some stories, then you can have almost a decco slide with a pitch ready to
go whenever it is, funding time, allocating funding, so you can motivate it. but that agility always remains very much true. I'm thinking about the opening of of your book in the motivation piece, you speak about an understanding the need for infrastructure engineers to understand that data science needs are different from software engineering needs.
So data and software not quite the same thing. They
both involve code, they're not the same. And I couldn't agree more, but I know it intuitively and I know it experientially cause I.
In both sides, can you help our audience understand a little bit why there's such a difference between these two world?
Ville Tuulos: no, I think that is absolutely the key question. And I think like all of us, should spend more time first even asking the questions, I think it's good to ask the question that, is that even true? is it and like, why would it be different? And we actually published an article like sometime ago, just like calling data makes it different, which means that like really the crucial.
Is that I believe that in both, in the kind of this new world like ML world, like you will need code and like obviously like in the old software engineering world, I mean it was all about code. So the code is here to stay. That's not going anywhere. But the thing that we are adding in the mix is data.
And now, like data by itself. of course like if you just look at zeros and ones, that's not the problem. But the problem is that it's an interface of the real world and it exposes all these systems to a huge amount of entropy that we didn't have to deal with before, and like huge amount of chains as well.
one way of thinking about it is that if you are or like we're a traditional software engineer, you are like a constructing. Artificial like world of abstractions. So you have your classes and you have your modules and like you don't have to worry about how the real world actually works.
You are totally isolated from that. But now with data, whatever the data might be, maybe it's like a clicks on banner ads. You are exposed of all the, to all the messiness of the real world. and that has huge implications to everything, like how we develop and how we deploy and how we monitor the software.
And like one of the key questions there is that it becomes much, much harder to know. The piece of software works correctly. And I think that is like one thing when people, for instance, think about the C I C D and is the lops different than DevOps? That's the kind of the big question. to realize that the way how we even assess the quality of software is so different, like when we have a data in the mix and that has many implications.
Loris Marini: Yeah, it's a very interesting system level consideration. are you in control or are you not? And what are, how many variables are you supposed to, how many bolt in the air are you supposed to
keep up? and hence the role of automation and back to the episode with Christopher Burke.
But there's also another piece which is really like the piece, the machines, the piping the infrastructure. That's what we mean, I believe with infrastructure, like the systems that are made of software, they're made of nodes, computing nodes, so machines that can compute but they're also made of, best practices, behaviors, or at least that's my view of a
system.
Like it system technology, people and processes. So how we interact with it, but also,
the steps we follow when you create a meta flow. did you have this picture clear mind and did you have a specific objective you were trying to optimize for? or it evolved naturally.
And if so, what was the initial pain point that led you to say, okay, we gotta develop some, no tools out there can fit, can is fit for purpose, but what we wanna do, we
need to create something
Ville Tuulos: Yeah. Yeah. I think the first, it's useful to always understand the context in which the kind of any tool was born. So the really, the motivating factor, like for Melo was the fact that we had to deal with the very diverse set of use cases. So you see many other. ML tools and systems that are, let's say, built by that.
Okay? only have to do with is computer vision or self-driving cars. So like maybe all only we have to do recommendation system and then you purpose build a system for that use case. And that's of course exactly the right way of doing this. If you need to build a moon rocket, you build a moon rocket and that's it.
Now, if you have to build more like a universal like vehicle going from a place A to b, I mean you approach the problem differently. And the way how we thought about it is the kind of, we ask ourselves the question. What are some fundamental elements that are true for every data science project?
Not like specifically for com computer vision or recommendations, but everything, and that's like why we started thinking about the stack like that. Okay. stating the obvious that like we always need data. We have to think about this like a data access question that like where does the data come from?
So it has to be a first class citizen. . And then the second one where we started thinking about is that okay, it's not only like a data addressed, but it's, we have to do something with data. So there's the compute and like being able to access compute easily is absolutely critical. And then the next layer on that was the question that I mean it's never only a single piece of compute, but it's multiple pieces of compute.
It's like this workflows and then like how do you orchestrate the workflows? It needs to be part of the stack and then. There was the question that we never get these systems right on the get go. So it's always an ity process and which is where the kind of versioning and tracking comes in.
And then like when you, just like a thing about that. Okay. how do we do put together like data compute, orchestration and versioning? The fact is that like we have good tools for each of these layers separately. , but there wasn't like anything that like really nicely puts all these concerns together in one package.
And that's also then like lastly like where the kind of our human-centric focus comes in. That although technically like you could navigate this stack in, in all possible ways, in a technical point of view, humans need like really. Easy to use, human friendly user interfaces, APIs, SD case, so you don't want to be writing docker files here and using this C I C D system there and using GI here and using Jenkins there.
So that, that was really the origin story of like how we started thinking about meta flows.
Loris Marini: Yeah. Okay. So I love it. So not just. integration of different pieces that you need to operate. but also the the user experience
the human aspect. How do we interact? Is that what you mean when, cause I read in your book that quote, human-centric
infrastructure, is that the idea?
Ville Tuulos: that's the idea. And there, there is like interesting, like a different, like a schools of thought that what's going to happen in the future. And one school of thought is that let's say, , some kind of like a auto ML overlords like will take over the world and like humans don't need to build any of these systems.
Or maybe there's some like AGI and and it's just like the systems built themselves. of course in, in a world like that, the human centricity doesn't make any difference. Then there's another school of thought thinking that like at the end of the day, it will be humans like building these systems in the future.
In which case we are building tools for humans, in which case the user experience very much matters and Very much in the latter school of thought that yes, like it is about like building tools for humans. These machines are not gonna build themselves anytime soon. So that's why, like we really have to think the infrastructure from the human-centric point of views.
Loris Marini: Yeah I shared a view with you. I don't know if it's a hope or a belief or
maybe combination of a
Ville Tuulos:
there,
No, not much need for a podcast. So if it will be the h e I like just building itself
Loris Marini: yeah. We can have a podcast hosted by a chat BT
podcast version yeah, I don't think it's gonna happen. it could, because, gen AI is incredibly powerful, but, I guess the value that one can get out of it is something that we still need to to see it's actually, there's also an element of, the emo emotions.
as engineers, we always try to talk about facts, but reality is marketing people know very well 90% of our decisions are made emotionally and in an organization. What plays often is, interactions with people as businesses made of
relationships and people working with people. So fundamentally, whether we like it, Emotions are 90% of the decisions, the,
sometimes the dramas, but also like the collaborations and the positive effects that can come up when
you have a team. And there's a bit of, that, I hate that word, but synergy. I've seen that synergic action, synergic force in action and it's incredibly, beneficial for your organization and for everyone involved. I wanna go back to those layers that you mentioned, so you started with the the in your chapter three, there's an introduction to meta flow and you start with the branching and merging section. Why did you start from that?
Ville Tuulos: Yeah. I, one, okay, so I, if it's okay, I take a step back. really the like one of the defining factors of any ML data science thing compared to especially like to old school business analytics and like business intelligence. Is it like, it is a very compute hungry activity.
So like training models takes many compute cycles and like a transforming data takes compute cycles. And of course, not to mention all the new deep learning models that are, that the most compute intensive things ever created by the humankind. So hence becomes this question that how do you actually.
candle compute efficiently. How do you make compute easily accessible to people? And now of course, like over time there have been many abstractions for this. some of you may remember and map produce, and of course now they're Spark and Ray and so forth. But like our, like thinking is that it's actually like a very, like these days it's actually amazing that like we can.
we can go without any like a fancy new abstractions and just make it easy for people to execute basic python function, whatever they want to do in the cloud easily. And the cloud gives the abstraction that like you can just do even like a dumb things, but it just works. And that's where kind of the branching comes in, that basically the idea with branching is that like you can do multiple things at the same time.
So that's basically a way to express concurrency and parallelism. And I think that is a very key to data science. it so much pains me when I hear and see data scientists to let's say operating in a notebook. and let's say that they want to, they have five different models that they want to compare, and then they do it sequentially one by one, and like each one of them takes, let's say one hour and you wait for five hours.
Why wouldn't you just like do it in parallel and wait one hour instead? But the answer is that as of two days, Still just like saying that, look, do these five things in parallel. it's not as easy as it should be in most cases. And I think that's a good starting point that like, let's start with the very basic things first.
that should be something that you can do out of the box, so
Loris Marini: Yeah. And that ties very well with your next one, which is, it's all about scalability.
So the ability of just saying, Hey, I need more compute, and ideally just press a button and magically you've got your resources available.
Ville Tuulos: Exactly.
Loris Marini: I I found that this is another interesting piece that's hard to communicate to non data, non software people. because of the intangible nature of what we do, like everything we touch at work, it's all behind. There's always the same keyboard physically in
front of us, the same screen, but really we are playing with logic and things that you can't kick with your toe. What is a bit right? Like it's just some state, logical state somewhere in a
machine, and those that are not used to it, they even struggle to understand scalability, horizontal. not that necessarily your business stakeholders should understand what it takes, but having a bit of an appreciation for what are these guys even doing? what, why are they on our p and l
Ville Tuulos: Yeah. yeah.
Loris Marini: are we spending
millions in our, maybe that piece, because when you go, say the hardware store, the local hardware store, I've been doing a little bit of a project here, and you buy, your tools.
You buy a hammer, you buy a bunch of nails, a bunch of timber, once you use that timber, it's gone. if you need more
timber, you just have to go and buy a new timber. if you buy one drill, you can use it only one for one activity at a time. You can't drill 10 halts with the same drill. But we can do amazing things with data intangible stuff. Like we can spin up nodes, we can kill them. We can make, give birth to them.
Ville Tuulos: Yeah,
Loris Marini: sounds like a bit of a creative. creational type
type thing. But it's true. We can do all those things.
Problem is managing that that mess.
Ville Tuulos: Yeah. And I think like one I think it's still amazing that as of today, like we don't have the mindset that one of the best deals any business leader can make is to convert CPU cycles to human productivity. Because at the end of today, CPU cycles are extremely cheap compared to human hours.
in, pretty much in any context and. Still, like we live in this kind of like a scarcity world, like where we think that, oh my gosh. I mean we have to guard access to compute. And this is of course like if you think like where computing started in like a 1950s or 1940s, you had the, there were like a hundred computers in the world and everybody was queuing up so that they could like, please use like a couple of hours of the computer time.
and then that kept going and let's say like in 1990s, early 2000, like we had clusters of computers and fancy cuing systems that like, yes, I'm submit my job and then the map produce cluster executes the job and I wait and so forth. But like now these days, like we can live in a world that like.
Kind of compute isn't scars anymore. you can let your data scientists access it. And now, even today, I talk to companies and they shake their heads and they're like, oh my gosh, that would be way too expensive. And then we see these same people that cost a hundred thousand, 200,000, whatever, 300,000 a year, like just idling there.
And I'm asking, look, what is expensive here?
Loris Marini: It's a really fair point. and I understand the argument on both sides, right? Like the cloud builds can skyrocket
without any management, without any control. But we're not saying that
you still should manage your cloud
Ville Tuulos: yeah. And like you can be smart about it, so
Loris Marini: yeah.
exactly. so what would you like think about the that piece, because I think it connects with better framing, which is one of the how to points that you stressed in a preparation notes leading to this episode, which is this, the idea of start starting simple and
building end to end that is somewhat connected with compute because yeah, we can throw as much computer as we want at a problem, but the computer loan is not gonna. the buying piece and the building confidence in building trust and
fundamentally building a relationship with our if you were to go back 10 years and give recommendation to the version of yourself that was doing data science back then? what would be a, some tips, some actionable tips for how to improve the framing of a project and
try to avoid scope creep and keep it.
Ville Tuulos: yeah. it's not only 10 years back, but I mean something that, like I'm reminding myself every day and I hope other people would still like, remind themselves every day as well, that it's always a question of framing. That. Let's say you are a data scientist, so what is your job at the end of the day?
And like of course, if you ask many data scientists today, they say that their job is, let's say, to build models or in some cases maybe they say that their job is to build workflows, like maybe data pipeline, something of that sort. But of course, like hopefully, and I know that this requires some like kind of enlightenment from the organization as well, but the idea would be of course to deliver acute results, which means that you really want to think of the whole like a system from where the data comes from originally. Like all the way that, like how do we make the business impact? And and I know that this is easy to say and it sounds like a lofty, like when you say it.
But it, it is really a fact that like when you just adopted perspective, a couple of things happen. First you realize that yeah, oh my gosh, there's so many problems that you have to solve and the model like really shrinks and like it would be like a small part in the middle. And then secondly, it's so empowering because you understand that like now suddenly actually like many problems that you are like really quite hard to evaluate, like in isolation, become actually quite tractable.
Of course. one big question is, for instance, about the model quality and like what is the right quality metric. And if you look at the question in isolation, as a data scientist, you can fight infinitely over what is the li right loss function. And you can have fun loss like lunch discussions about that.
But then if you take a few steps back, that okay, let's say, are we optimizing for the lifetime value of the customer? So then you realize that actually like a, whether it's like whatever, like kind of a. Kind of a R M S C or something else. it doesn't make a huge difference, but it's like everything around it.
and I would claim that the reason why this is not happening today, it's just that the tooling makes it so hard that, just in many cases the complexity of thinking the thing end-to-end would be too much. And that's really the mission that like we have had and I have had that can we get the tooling to the level that like people can actually start seeing the whole thing end to end.
And it's not like cognitively too overwhelming.
Loris Marini: Yeah. And I think there's a bunch of pieces there, right? There's the, The amount of brain power available
to do that. Cause I often, you're thrown as a data science center, you're thrown at the problem and you're given access. If you're lucky, you have access to all your data.
But let's say that you do. then the next piece is you don't have, you have your laptop. You don't have the infrastructure to, to crash
those volumes. Let's say that you are lucky again and you have that infras. So you basically have everything. You've got the raw ingredients, you got your data, you have the systems to process that data, and you have a clear, formulated business questionnaire.
You wanna try and answer and
give recommendation. So here you go way, you study your dataset, you do some explorative data analysis. You have an intuition. You start building a model and you realize that hey, you can actually get already to 80% accuracy for that particular subclass of problems you got
back to your stakeholders. And you propose a solution. I've seen this lifecycle happen
so many times. I think there's a bit of a bug there. And what's missing is the piece where you actually understand the domain before you, you go off and, we are eager to
use data, use our systems,
Ville Tuulos: Yeah,
Loris Marini: sometimes the answer is you don't need to do to develop an algorithm at
and the problem that I find is that a lot of data scientists don't see this as valuable work because they believe that they perceive, at least that their function is measured against how many models and what's the accuracy, all the stuff that is easy to measure
as opposed to the stuff that is invisible, which is talking to the right people, getting that shared domain. Knowledge so that we can try and find a solution before we even try to build this crazy thing.
And that's really about problem solving and a focus on delivery and as opposed to data. what do you think are some of the strategies for data scientists that we can use? Because there are
pressures, right?
We are under the pressure. We need to get an. taking that time can feel extremely hard. We have to literally push as hard as we can against these invisible boundaries and make the space to have those conversations with our stakeholders. Is a general understanding enough or there are strategies that you can actually do in the day-to-day to, to carve that little, five minutes a year, 10 minutes there, and build sort of your own understanding of the problem before you dive
Ville Tuulos: Yeah. Yeah. No, that's a great question. And I want to be very empathetic to, to all practicing data scientists out there that it's so much depends on the organization that, it is a fact that like some organizations just don't give that space. And even if you understand really that's what you need to do.
if you are. Surrounded by people who don't understand, who don't give you the space. And it's not that, like those people like wouldn't understand, but they have people around them and it is an organizational thing. So that's one thing. But and I also understand that it's hard like for a data scientist to say tell your boss that, look, I'm just gonna need like at decent this many months just to understand and I'm not gonna do anything.
And I do believe that. . A good way to form that understanding is also to actually do something. But I think that as you pointed out, the kind of the worst thing is that you're just like rushing the building something amazing and you don't even understand like why.
And Then like you would, might say that, okay, the next best alternative would be that, okay, let's spend good amount of time understanding what we are doing without doing anything. And I understand that there's a business pressure that likes us saying that you guys, you can't spend two months just sitting in there.
So I, I think like what, what really like has led to good results many times is that like you start with something simple. you start like exploring it's, like how in machine learning oftentimes we have these like kind. expectation maximization or exploration exploitation type of a dual situations where we start doing something and as we are doing it, we understand the problem domain better.
And hence like we can tell the business owners and like kind of the business leadership that look, I mean we are making progress, but I mean we are keeping it quite small and then we deliver something. And that's the key point. And the point is not to do the first deployment. The first deployment is maybe like barely like a first milestone.
And then once you have something, and this is such a key thing in all data projects, Your like work has very limited value before. It's connected to real live data that changes all the time. And that's the ultimate test for projects, that it's not like how well the model performs in isolation in a notebook, but you have to get it out there.
It has to run without human super supervision daily. And no matter, I mean it doesn't matter, like what are the results initially, but it gives you such a kind of like good perspective when you see how things start failing in practice and then you come back and then you like, with the much better understanding of the data and the business context, then you start creating the next version and then you start building that muscle that, okay, how do we keep improving the system like consistently over time?
And I think that, that seems to be working quite well in, in many places. So,
Loris Marini: Yeah. Yeah. Awesome. So you split the workflow in two parts. Everything that, that you need to do to go from an idea to a minimum viable product that you
deliver. Doesn't have to be necessarily in production, but it can show that, hey, we've done, we made progress. We have reasons to believe that this thing can perform well, and these are the potential business gains That, or the price, the value of the price if we deploy it and now we need to deploy.
and that's where we jump to the one of, we start, we traveled a long way through your book
Ville Tuulos: Yeah. Yeah. And if I may just quickly make one comment on that. So I think like one thing that's really useful discussion to be had at the every organization is that what is the definition of production? And especially like in the context of machine learning systems, I think that this, like a production thing is a boogeyman.
And like people are just unnecessarily afraid of that, that, oh my gosh, it's production and the engineers go bonkers that, oh my gosh, it's a production and look like, here's the question, do all of the listeners that like is the Netflix recommendation system a production system or not? it serves hundreds of millions of people.
It can never go down, all that good stuff. But at the same time, that development has never stopped. there are hundreds of experiments ongoing at all the time. If you ask the people there, all of them suck in different ways. it's an experiment. it's a continuous experiment.
It's an ever-changing experiment. and it's an experiment that has certain SLA requirements. Some of them have lesser SLAs than others. But yeah. I think we have to think about the production in different ways than before US
Loris Marini: You just changed my mind. It is incredible. You know the feeling when you see, you feel your synopsis, stretching, a
new path has in my brain right now. I never thought about it that way. I always actually saw a production as a, the static monolith, not a monolith can be distributed, but like a static piece of. asset that you deploy and you want to, like in in Docker we have docker hash keys that tell us that image is exactly that image and it's impossible to change it. If you change one bit the
hash changes So that feeling of control, and, I can take a snapshot, I can tell you exactly what it looks like, but you're right, that production doesn't have to be there.
And actually, probably the most value. production environments in data science in particular, and this comes from the research from bill Sch Smart. So as well, in one of his books he writes, the 5% of the business value is results from dashboards and analytic pieces including the more fancy ones, powered by, machine learning models.
The real business value comes when you deploy this models in production. You
have continuous recommendations, continuous updates.
That's why you. The money, the return on investment.
So it's critical to and and it's in imaginable to deploy and call it production. And then what do you do when, if you wanna make an experiment, like
you gonna deploy it, you okay?
See, okay, a clone system, you still need to get real data from the
real Use. Use. So
Ville Tuulos: Yeah. And that's, and that's like exactly like where like the difference to traditional software engineering comes in that it's really the data that makes it different. That even if you pretended that like we have a perfect guard trails for production and like to your point about dock files, that the hash never changes.
you know what? The data changes your customer's behavior change is the world chains east around you. . So there's no way to pretend that like somehow you have a dislike aesthetic production that never changes. And then if the companies who try to pretend that's the case, then they like start thinking that, oh my gosh, now we have this data thrift and we have a model thrift, and like, how do we do this?
look, that's the nature of the world. it's not the bucket, it's the feature. That's how these systems work.
Loris Marini: I love it. I love it. so all these assumptions have been baked into the design principles of meta flow,
really. So
Ville Tuulos: Yeah. And
Loris Marini: what is it? Yeah.
Ville Tuulos: yeah, no, I just wanted to say that I don't want to claim that the tool does much by itself. It's more like a mindset. It, and that's like where the human centric comes in, that it. It like tries to encourage a certain type of culture, like how people should behave.
the tool helps and like it really helps you actually in a way, if you look at meta kind in a sense, you might even look at it and it doesn't do much by itself. And that's exactly by design, but it exactly helps with the thing, with the things that like you shouldn't have to worry about so that you can focus on all these things that we discussed here.
Loris Marini: So give gimme the a description. High level, obviously, because we can't, we would have to do a demo and I'm craving to install metaphor and actually play. Because I'm getting so curious about
this , but give, gimme a descriptive overview of what the customer journey or the user journey in this case looks like.
So I install it and
then what? what do I see? What do I do? What can I
Ville Tuulos: yeah, that's right. you people install meta flow as simple as that. So that's how everything starts in data science
Loris Marini: Yeah.
Ville Tuulos: and look, that's as, as I said, the point is that it like, like that's like a very no nonsense things.
while the first thing you gonna do is that you can structure your project as a workflow and as it's a workflow. So it means that like you can have branching, which means that you automatically. get that like a parallel compute and like many things at that point code. Oh, I think that's nifty.
yeah, I've seen this before, but that's nice. Then the next thing you can do is that you can start thinking that, okay I can actually scale this compute out to the cloud and then at this point, like you coded, oh, this is nifty. Actually, I didn't need to change anything in the code.
I can just run it in the cloud. You can say that, okay, I need like a 64 gigabytes of RAM for dysfunction, so I don't need to refactor my pandas code. Oh, that's nifty. And then like you also see that like it versions tracks everything automatically. So you don't have to think about the git and or different like version control experiment tracking system.
So that's nifty. Then it helps with dependency, so you don't have to figure out like how to make sure that you write the dock files by hand. So that's nifty and it's just that like kind of a. It helps you like with all these like small questions that like, none of which like seem like revolutionary, but then like when you put all these like kind of a hundred things together, like you end up with the functioning ML project and then and the best part is that like you can ask your, like a neighbor do the same thing and now you have this like common language.
Everybody speaks the same language and the best part is, and like I, I really believe that this is the case that no tool can actually solve your problems perfectly. I. , the whole point is that the, any tool can help you maybe with the foundational problems like what Melo does, but the best part is that you can then start layering your own abstractions.
Let's say if you're a real estate company, you surely have your own like ways of handling like geographic data and spatial data and so forth. So you do that. Or if you are in computer vision, like maybe you have your own special ways of doing whatever labeling and data load. You do that, and then it's the combination of you having your domain knowledge baked in your own libraries and the foundations taken care of by meta flow, that suddenly like you have this like superpowers, then you can actually get to that experimentation culture.
so that's how we see the, the best organizations doing it.
Loris Marini: Yeah, this is yeah, this is amazing. There's a tutorial on meta.org that might be the first point.
to start and actually have a feel for this, but
Ville Tuulos: yeah. And if you go to out outbounds.com, there's plenty more all kinds of tutorials. So we actually very much believe in education, like helping people. And of course there's a super active Slack community like you are welcome to join. So I.
Loris Marini: Oh, I will do for sure. outer bounds.com Fantastic. that's your business. We're gonna make sure we out the links to the shots as well. Yeah, I'm just looking at the website right now. there's plenty of
the part of when you said you need 64 gig Ram, that is, a superpower. If I can just say, give me more resources.
I ran out of nails. I need more hammers, or I
need more, bigger model
Ville Tuulos: like I'm sure that like many of your listeners like who have been data scientists in the past, you have had the feeling that, let's say you have a notebook on your laptop. You load the data set in pandas, you do something and then it like crashes, goes outta memory, like you get memory error.
And then like you know that you are under time pressure. You know that the eye could refactor everything in Spark or like something, and maybe that would be the right way of doing it, but you hope that you only had the button that says that. just gimme more memory on my laptop. Now you push that button and then the cell executes and you are happy.
And that's exactly the abstraction that we want to keep that just gimme the button that give me a bigger laptop, give many laptops and like I don't have to change anything in my code.
Loris Marini: Worse than that Bill. It was when you know you do the infrastructure yourself because maybe you don't have an infrastructure team and you know you, you rent a special instance. Of particular instance of a particular machine in your cloud of preference. It's got a specific amount of ram. You have a job that you run, you wrote yourself in Python or whatever the language you're splitting into small chunks.
Then you recombining it all together. The job lasts three hours and then at two and the mark, two hours and 35 minutes, you just get this instant crash
all your dog Yeah,
Ville Tuulos: You should check out
resume command in meta flow, so it helps you with that. , being there, done that. like it's like exactly like you said, it's so painful. You don't want to start from the beginnings.
Loris Marini: Wow. Okay so what is that I do is capture, captures the current state or
the map, the execution map, and we, and continues from that.
Ville Tuulos: That's right.
Loris Marini: takes snapshots as it
goes. It's like an auto save in
whatever
Ville Tuulos: yeah, exactly. Yeah.
Loris Marini: this is Rilliant. Okay. should definitely check it out. okay, so I'm just looking through our notes, is there anything that we set out to cover that we didn't cover at this stage?
Ville Tuulos: gosh. I said in the beginning that like, we o only have a 60 minutes and we are only scratching the surface. So I have that feeling now. . of course a big topic overall is like all this cool new stuff, which ani ai, like we could go deep into kind of how that relates to all the infrastructure.
I,
Loris Marini: Yeah, let's do that
Ville Tuulos: for a different like day maybe, but
Loris Marini: yeah. We can open it up because I'm just to try and recap. So let's do the recap exercise, the 67 recaps, and then we have, if we have extra time, we're gonna spend it on whatever funky AI
conversation we want. Just before we do that, a reminder to our audience again, that there is a bookkeeper way@discoverdata.com for slash ml systems, one word, ML Systems. we are also giving a 35% discount code for all our listeners. So if you're interested in getting that, One, definitely reach out, if you're a listener, so you know how to reach out. and just a note, I actually forgot to mention this, but the earnings from the book the Vil wrote, will be donated to charities that support women and underrepresented groups in data science, which I think is. Even more cool than the book itself, which is already fantastic. Effective data science, infrastructure, how to make data scientists productive. so I'm gonna kick off the timer see if I can set it up to 60 seconds. That's a challenge for you. Will help me summarize the important messages that we cover today in 60 seconds?
Whenever you're ready. I'm gonna hit
Ville Tuulos: Okay. Yeah, let's go for it. Yeah. I'm,
yeah. the most important thing is that understand the business context. There are many diverse use cases where you can apply ML doesn't have to be only flagship projects. Now, when it comes to the infrastructure, and you should really like a think of having quality infrastructure for all your ML projects.
It really makes sense to think like from the ground up, like what are the things you always need? You always need data. You always need compute. You always need orchestration. You always need versioning. You can have like point solutions for each one of these problems, but in order to actually make your data scientist productive, it's much better to have a single cohesive interface, human-friendly interface, which meta flow provides.
So meta flow just makes it easy address all these concerns that like you need to do in every data science project anyways. And and on the infrastructure side, it's something that engineers like, it works like on your, like cloud account works with your security policies. So yeah. Has made many companies, many people happy this far.
So try it out by yourself,
Loris Marini: That is the definition of perfect how many times have you done this
Ville Tuulos: Yeah.
Loris Marini: no seriously meta flow.org. is the website to start looking@andouterbounds.com for even more learning resources around meta flow and tutorials and God, I wish I didn't have to do my next activity and have the whole day free to experiment to this. Um, so fantastic. I think we have a few more minutes. I wanted to talk about, maybe just plant some seeds for what's what's happening built
with charity and generative ai. what, how's that going to. Fit within the data science workflow. We talk about data scientists of the future becoming the domain experts.
Now charge, G P T can write summaries, tell you talking points, action items. It's gonna do you soon enough. we'll do the project management part. Setting priorities, rearranging your tickets. can we, is this an ally? Can we find a way to leverage this stuff to make our job more effective? You.
Ville Tuulos: yeah. of course like totally warrants, like a podcast episode of its own, but long story short, I think let me first, yes, like we are seeing a qualitative improvement in what has been possible in the past. So like nobody should be underestimating like that.
We are this is of course like, has been long time coming, so it's not like some kind of like overnight like revolution. But it is that like we are reaching a new level now. One thing that people underestimate today is that like how hard it is to actually build applications around these models that actually work and actually like help with real life problems.
So it's one thing that like. a go to Chachi PT and write funny questions and get like a silly answers. And that's of course highly entertaining. But it is like a really next step that like how do you, let's say, really improve whatever healthcare, logistics, real estate, like using these models.
And I think like my prediction is that the best way and like how we will see best companies doing this in the future is that you combine your own domain expertise, like your own Very detailed understanding of your customers, your domain with the kind of the best parts of these models.
So it's not somehow that these models are overtaking the world and running you out of the business, but it is so that actually like you have a new kind of a, like a superpower in your toolbox, if you will. you will still need your own data scientist, you will still need your own models. But the best you can do is that, let's say you can take the embeddings, you can use these models, you can mix them in with all the stuff that you are doing internally.
And yeah, the best companies will be providing amazing experiences. It'll take time five to 10 years. But yeah, I'm super fun to see like the kind of how this will play out.
Loris Marini: I share that with you and I think I'm gonna add that that a AI capability will create more space. Hopefully, that's what I'm hoping for data scientists. analyst, everybody that is doing the heavy lifting of connecting business requirements to data requirements and trying, to serve them because they're going to have that space, that bandwidth, that emotional and cognitive. Resources
that they can spend to build those relationships with people that in, in the end, whether you are a remote first company or you are an onsite company and you breathe the same air
Ville Tuulos: Yeah.
Loris Marini: and share the sweaty air with your colleagues fundamentally, we are humans, right? We social animals.
So that piece on, on, on human skills, not soft skills, but human. and the processes are that you need to develop to be able to be across all of the different business requirements and be that agent, an effective agent of
change, effectively. the only way you can prioritize, cause everybody wants to get promoted in the end. And for me to be, to get promoted and to. To have a fantastic, professional development. I need to know where the business is going. And that's typically not that hard. All it takes is have the time to attend those key meetings when
business or the leadership team shares their targets and objectives, right?
So that's easy. The hard part. In to map. This is where the business wants to go. These are my constraints and what I can actually do in the day-to-day. How do I, what's the 80 20 there?
What kind of projects do I have to say yes to? And
perhaps even lead, with enthusiasm trying to go people motivate them. maybe even work that extra one hour a day on those four couple weeks just to prove that there is enough appetite for that project
because you know that project is gonna lead to the business outcome. And you know that's gonna give you domain knowledge. you grow as a data scientist, you grow as a business strategist or business support person,
and your next data science project is gonna be even more impactful.
So that
kind of. Flywheel. So I really hope that AI will spare us to write Docker files, and get us to focus on
the human and people skills
Ville Tuulos: that's right. And yeah, just like a kind of refund that idea. And one thing is that like we all see how chat G p t fails, like maybe, I don't know, 20% of time you get something silly hallucinates or something, you should adapt the same kind of a mindset with the experimentation culture.
It's totally okay to fail. failure is. is really a feature, not the bug. It just means that like you move on to the next idea. this stuff, what we do, it's experimental in nature and that's how it should be treated. But the thing is that like a, yeah, you keep doing that often enough that eventually, like you hit that gold mine,
Loris Marini: yeah. Yeah. Yeah. Fantastic. Awesome. bill, thank you so much for your time. I think we run out of the 60 minutes we have allocated and, you're super busy we'll wrap it up. But for our listeners we'll find you in the many channels. Check the channels for those links if you wanna get copy of the book and definitely recommend the checkout. Everything we mentioned in this episode. So I'll see you. I'll see you next time and I'll see you on
Ville Tuulos: Yeah. Thank you. Thanks for having me.