Episode:
52

Lauren Balik: Why the modern data stack is broken and how to fix it

Loris Marini - Podcast Host Discovering Data

This is a big-picture conversation that drills into some of the worse anti-patterns in the Modern Data Stack movement. How do we deliver data at scale without breaking the bank?

Also on :
Stitcher LogoPodcast Addict LogoTune In Logo

Join the list

Join hundreds of practitioners and leaders like you with episode insights straight in your inbox.

You're in! Look out for the next episode in your inbox.
Oops! Something went wrong while submitting the form.

Want to tell your data story?

Loris Marini Headshot

Checkout our brands or sponsors page to see if you are a match. We publish conversations with industry leaders to help data practitioners maximise the impact of their work.

Why this episode

This is a big-picture conversation that drills into some of the worse anti-patterns in the Modern Data Stack movement. How do we deliver data at scale without breaking the bank?

Today we think about “incrementalism” and why the technological landscape has never been more fragmented. We’ll talk about why data consumers are also data producers, the balance between normalization and de-normalization, why what we do in DBT is not metadata driven and therefore why it doesn’t scale, why we need to measure the value add of everything we do, and much more.

Lauren is an open heart surgeon of the Modern Data Stack and is an expert in building performant data functions that just plain work. She runs Upright Analytics, a boutique advisory and technical implementation firm for everyone, from Fortune 500 to startups.

You can follow Lauren on LinkedIn.

Join the Discovering Data community!

The mission of Discovering Data is to create opportunities for data professionals to connect and learn from one another. That's why we launched the Discovering Data Discord server! This community is for people that want to have an impact at work. A place where we can all gather and ask questions so we can help each other grow. Stay curious and keep learning by joining the community at https://bit.ly/discovering-data-discord

For Brands

Do you want to showcase your thought leadership with great content and build trust with a global audience of data leaders? We publish conversations with industry leaders to help practitioners create more business outcomes. Explore all the ways to tell your data story here https://www.discoveringdata.com/brands.

For sponsors

Want to help educate the next generation of data leaders? As a sponsor, you get to hang out with the very in the industry. Want to see if you are a match? Apply now: https://www.discoveringdata.com/sponsors

For Guests

Do you enjoy educating an audience? Do you want to help data leaders build indispensable data products? That's awesome! Great episodes start with a clear transformation. Pitch your idea at https://www.discoveringdata.com/guest.

💬 Feedback, ideas, and reviews

Want to help me stir the direction of this show? Want to see this show grow? Get in touch privately or leave me a review with one of the forms at discoveringdata.com/review.

What I learned

Share on :

Want to see the show grow?

Your ideas help us create useful and relevant content. Send a private message or rate the show on Apple Podcast or Spotify!

Episode transcripts

**Loris Marini:** the premise of this one is that lot of people have noticed in industry that the modern data stack went through the usual hype cycle. now we are at the point where a lot of people realizing there are some fundamental problems with how we approach. The management of our data assets and that the MDs is in itself a great idea, but it's also broken.

It doesn't really deliver on the promises that we all had in our minds when we jumped into combining and connecting this fully managed services. So today I want to go deeper. I want to try and understand why is this not working? And if we do our job you'll be able to understand the bigger picture around the modern data stack, how it connects with your day to day, and what can you do about it without necessarily, scrapping it off.

And chasing the next technological solution. Now doing this alone, I'm here with Lauren Blic, Yes, Lauren Rans, Upright Analytics, a boutique advisory and technical implementation firm where she and her team are open heart surgeons, of the modern data stack.

She works with organizations ranging from the Fortune 500 to startups to build performance data functions that just plain work. I'm super excited to have you on the show, Thanks for being with me.

**Lauren Balik:** Yeah, absolutely. Thank you so much for having me.

**Loris Marini:** So let's give, gimme a bit of an intro. How do you see what's happening with the MDs What's going on? Why is it not working?

**Lauren Balik:** Yeah. The The modern data stack is centered around the cloud data warehouse. So that means snowflake, that means big query. It can mean Redshift. Those are the three big players and. Modern data stack is the idea that this is the center of gravity of how data teams and functions should operate.

when you break it out a little bit you have your warehouse. using data, bricks can be a lake house. don't really care about the difference between lakes and warehouses and such, ingest, how you get the and store it, and then how you compute it into metrics, reports, outcomes, whatever.

Typically that's, a bi layer, so that's a looker. mode Tableau whatever. some kind of outcome that comes outta the data warehouse. And, functionally it's just not working at a lot of companies. if you break out the modern data stack and look at it, a history,

and I've worked around this for a while the modern data stack in 2017 was then it for ingesting and then it was Looker as the BI tool. And from there we've sprung out. And now Big Query has matured as a competitor to Snowflake and is larger than Snowflake. That's option for

**Loris Marini:** And what was the problem that the MDs was trying to solve though?

**Lauren Balik:** Centralization, it was solving, the modern data stack was centralization. So if you think about the history of this, like a couple of years ago, five, six years ago, a lot of businesses are doing or integration platform as a service and shooting data from Salesforce to market.

Through this middleware of IPAs or shooting data from this app to that app. And what the modern data stack is, it says ingest everything into a warehouse in the warehouse all in raw and then sort it out in sequel and then dump it out to your business intelligence, your other outcomes, et cetera.

And along the way, in the last few years, we now have a bunch of data observability. We have data cataloging for the modern data stack. We have reverse etl. Not only do we have, the ingest, the el, but we have reverse ETL now. We have metrics layers, we have everything else. at a lot of companies, it's just not working.

**Loris Marini:** So we were trying to solve the problem of silos lack of connectivity. We said, Why don't we dump it all into one place? One of the fundamental ideas was how do we do it with the least time to implementation, like the deployment time. How do we drop that down to days instead of months, right?

Remember before five trend was a thing. People had to write custom code and, spend months to just ingest the data, or to extract it from the different endpoints and put into place What Five Train did was like, okay, now you can subscribe as a service. Bunch of clicks, you point to your sources and magic, now you've got your data one place so is progress.

And that got a lot of people excited and they said if you don't have to worry about that anymore focus on modeling the data. And so then one of the big problems you have with data is scalability, you need to have enough compute, enough memory to process large jobs.

And you never know in analytics how large is a job. Sometimes it's just a simple query. That touches, a hundred where a thousand records sometimes is at complex queries. You have to merge a lot of tables. And the, one of my nightmares when I was doing this stuff was waking up in the middle of the night with the famous error you out of memory.

The your Python job crashed because you don't have enough RAMP . And so that's not nice. And so what things like Snowflake B Query did was completely abstracted away. So you just write your query and the engine works out how to do it. It scales, it's interior efficient. It's got start, and stop functionality so you don't waste money on keeping the system idle.

So it, it feels like progress. I got excited back in 2018 when I was working on this stuff. But , so what's inside that? But why are we still struggling with largely messy, inconsistent data sets and we still struggling to prove the business that there's value in doing this stuff.

**Lauren Balik:** no, you hit the nail on the head. Three major ways you and accrue costs in running a data operation. There's people, headcount costs, there's costs. This is all cloud, like modern data stack is cloud.

So that's your, snowflake bills, whatever you're using, in AWS or GCP or whatever around And there's the product cost. So you have 1, 2, 3, five. Trend, for example, is great because five Tran reduces the need for engineering for a set of, it was, a hundred and they got to 110, then 120, 150, whatever they're at now.

Data sources loaded into the warehouse and with the point and click you're basically paying them to offload the cost of a data engineer for those data sets to integrate those. Now, where this goes wrong is like what we've done is like five TRA and a number of their competitors. they're priced to bill you on rows or on volume.

people talk about ETL and E L T and the the vendors that do this, they normalize everything as much as possible. And what I mean by that is like they're making as many tables as possible and landing them in your data warehouse.

So you may have an endpoint from some app or a couple of endpoints and if you make your primary keys and figure out how this endpoint works, that could end up being, five tables, five trans, They bill on monthly active rows. So that's the number of rows that come in.

So if you look at any of the five trans schema they're very normalized Shopify, for example. Pretty common use case for anyone in e-commerce. Usually Shopify, there's 60 plus tables that come in. So every order that comes in, it's not just one row that they're billing you, the order or the order line item.

It's that times, dozens and they're billing you on every row. And, if saying to yourself that seems bad. The idea is that this is cheaper than a data engineer. This is cheaper than, paying, in the US anyways, what could be 200,000 or so dollars a year of a data engineer to once you normalize it, you have to do something with it. Like it's not report ready. You're probably gonna want to de normalize it at some point. . And if you're pulling in, your Shopify and you've got Stripe and you've got several other sources, you're pulling in your Postgres tables for your you're eventually going to de normalize these tables and set them up for reporting.

And this, that's just all done in sequel in the warehouse. a lot of people use DBT today do this, DBT is, a sequel compiler with a scheduler. You have the ability to write macros. You have the ability to do g and there basically taking the. That was previously done before the ingest, the e t L, and you're putting this all in SQL in the cloud warehouse and then rolling it back doing all the modeling in sql, ver it's a very manual process and what you're doing is you're putting this into a form that can be used for bi, can be used for operational use cases, whatever you wanna do with your data at that.

And I think this is mostly wrong and so do many people around the space. What the modern data stack does is in the cloud warehouse, you're normalizing the tables. So you're pulling them down as much as possible because that's what the vendors want, who automate And then you're rolling it back up in D B T with airflow, whatever.

And because you're doing this on the cloud warehouse and you're running that however many times, you're running it a day, 10 times a day, 24 times if it's every hour, once a day, whatever. You're incurring a cost every time you run that. And what we have here is a SQL problem. we've been asking too much of And we've been doing SQL on the cloud data warehouse, which is one of the most expensive CPU cycles you can do instead of pre-processing and handling head of dumping it into a cloud data warehouse an m p cloud data store and compute center. we're doing now is just throwing it all in there, sequencing it together, and coming up with some kind of outcome.

**Loris Marini:** Yeah. Because at and maybe this is a wrong assumption, but I feel like there is a cost speaking, if you look the process of creating order out of cows, of reducing entropy, of ordering things, you gotta have to spend some energy to do that.

There's computational energy, there is as you said, the cost of people, right? The brains that need to take the time to type on a keyboard and think about the design of that query and how to name things and how to test for them. There's always going to be some overall effort you have to make. , and of course it comes down to which one is better.

Do you prefer to pay for people time or is that better because maybe there's added value they can take the time to think about what they're writing, what it means for the business, align everybody on the same conceptual plane, or is it better to pay more for computation time it's just nice and easy and I understand why this is appealing for a lot of people because especially those that are studying out you can go from having zero data platform to a thing that feels like powerful, that scales that you don't have to manage, that you can just smash some sequel together and do things right in a matter of days.

Perhaps even less, a couple days you can start ingesting some serious amounts of data and have the whole system running. It's cloud based, so you can do access control, you can get people to join. long are the days where we say, it works on my machine, or I'm waiting for my cloud formation or whatever script to, ensure that all the systems in that big cluster are up and running and we are ready to crunch some data.

there is something good about this. But the other side, as you just explained, is that we are not thinking about things that we should be thinking about because we think that just by bringing stuff together, the data into a place, then we are done. And we'll surely we'll have time later on to worry about how to stitch it together, how to de normalize it when we need it.

How to add meaning to it. Like a description of a field of example. What column mean? How did you get there? So we talk about lineage and all that stuff. But before we dive into the architectural side, I wanna talk about a concept that you've been writing fair amount about, which is a product incrementalism and product led, vendor led of the systems that we used to create data products.

elaborate that a little bit?

**Lauren Balik:** Yeah. So here's what's happening and it's important for anyone who works in ops who may be listening or, has some kind of ops data hybrid role. It's important because they'll notice this has happened there too. in the past couple of years way companies develop, especially new companies that go through typical.

Venture led they'll raise a seed, put out a product, get some users, get an A, get a B, get a C, What happens with this is a lot of these are what is referred to as like unbundled or point solutions or, they're really good at one thing. if you look at the modern data stack, like look at the category of reverse ETL has existed as a concept for years.

**Loris Marini:** Little bit of background. What's, what is it about?

**Lauren Balik:** ETL or the EL is putting data from an application usually off of an api into the data warehouse. ETL is the opposite of that. It's taking data from a data warehouse and putting it back into an application. reverse ETL is one of the concepts that's come out of the last few years of heavy venture capital investing everywhere in the data space. And the idea is that we have these companies that ingest data, let's have other companies. Take the data and then push it back out to the applications.

And if you look at Informatica and what some of the old school players have done treaded ground for years, really, you're just, flipping around the requests you're flipping a source and a target around. then, adjusting for what's in any limitations that you have in the source or the target it's not that like revolutionary of a thing. And a lot of people say this shouldn't be its own company. a lot of companies out there today, like they offer Kaul, nla, Rry could name off a bunch of them. be able to do both. It's not a big deal, Right now, like you have all the companies out there that will buy five Train and they'll buy Snowflake, they'll have the data moving in and then now we wanna get data out.

Oh, we've got specialist companies for that. We've got ETL that's high touch, or it's census Now there's a new company out here that's gonna take the data back out. So one company pulls it in you, D b t, it around, and then you push it back

**Loris Marini:** Yeah.

**Lauren Balik:** with a new company. So you have three companies, three vendors in there doing this one basic pipeline.

Now the funny part is it's all the same investors that do all of this five is Andresen Horowitz, D B T is Andresen Horowitz in Census is Anden Horowitz in Sequoia. unbundled the data stack and then you send it all the way through. And if you say, Oh, I don't wanna use census, then I'll use High Touch.

High Touch is another venture capital firm that also has money in d b DBT and then also has other money in other products. And so really it's just, it's become a game of unbundling. And if you look at the Rev ops world or the sales ops, marketing ops

rev

**Loris Marini:** op. What's

**Lauren Balik:** Rev?

well. Revenue operations.

So that's, sales ops, marketing op CS ops all under the same roof. So that's, products like, that's gain site, Salesforce, Marketo, Pardot things like that. And then all the plugins that go around them, account based marketing, clear bit. Lead sourcing, whatever. These are also unbundled.

And the funny part about all of this truly is that if you look at under the hood of all these products, they're all databases of tables. they're APIs, it's all the same thing. So the data products aren't any different from the Revs products and just everything's been unbundled now. And if you look around the market this has led to a lot of tension.

I always call it like, the iron triangle. There's data, there's finance like fp and a, and then there's like business ops, revs, whatever you wanna call it. All Salesforce ops marketing ops people as a triangle. And all three of them are just in a battle with each other over land grabs and how this data is gonna move from system A to system B.

And how it moves from B to C and then what the result is. And if you look at how many products you're using, like anyone who works at a SaaS company or works at e-commerce, like digital company out there, I would encourage you to look at , how many different you're actually using to move stuff from A to B, from B to C, from C to D.

And you're just, it's all just moving data around in a business. And, it's funny because the investors of these companies love it. The companies love it you know, they get more revenue off of it. But how much of this is valuable and should be responsible for what and where is the core of the problem?

**Loris Marini:** Unpack that a little bit. What and where.

where

**Lauren Balik:** Yeah. I'm a huge fan of the idea of chargebacks and federated compute or federated budget. And a lot of companies don't adhere to this, especially tech companies that are smaller growing and especially smaller companies in general. And so what I mean by this, the what and the where, if you're a business and you have Salesforce for example, and you have Snowflake, and the sales op team runs Salesforce and the data team runs Snowflake and you ingest data in.

From Salesforce to Snowflake. then you tie that data with, I don't know, something in a Postgres table, product or whatever. Now you're able to see product usage by account lead contact, whatever whatever you're getting from Salesforce. one of the big problems that exists now is that,

a lot of the ingest is just allowed to pass through as is. And if there's data that's incorrect or wrong, or, not adhering to how it's supposed to be structured to a contract that comes out of Salesforce and that gets into Snowflake.

It's gonna get any better once it's in Snowflake. Some companies out there will try to like sequel around it, Oh, we'll clean this up. Oh, this table is wrong, but it's wrong because of this reasons. And we'll net this out and we'll write some sequel and it'll fix that.

**Loris Marini:** Yeah.

**Lauren Balik:** You do this once and you do this twice, you do this three times, and then all of a sudden you're using Snowflake or Big Query or Redshift or whatever, compute in SQL to write exception logic to, fix the problem that should have been fixed upstream.

ask anyone who's worked in data a long time, like when the application developers put stuff into a MyQ or a Postgres, and then they drop a column, or they add a column, or they change the meaning of something. That's going to affect the in your data warehouse.

And this whole idea of charging back and fixing your entities before they come in that's missing in the modern data stack, not any product vendor's fault. It's just missing in the paradigm operations of how this actually works in function. you look at any company out there that operates, at scale what they're doing is they're resolving their entities before they bring them into the warehouse.

And so when I talk about entities, like every business has, its own set of entities, but there's pretty core ones that are universal, like the idea of a customer. The idea of an order and an order line, revenue et like things like that. A lot of companies will be bringing these in raw right now.

So let's say a company has three order channels. They do some stuff that comes through PayPal, they do some stuff that comes through Stripe, and they do some stuff that comes through Shopify. They sell in three different channels. If you're bringing this in five trans connector to PayPal, five connector to Stripe, five, train through to Shopify, what you're doing is you're resolving these entities and you're fixing this up in the warehouse.

you're netting out orders, You're netting different financial logic. I've called this Shadow Finance before and it compares with Shadow it Shadow IT is of course. When people go and buy their own products and, buy their own plugins to things and, all the IT people hate it. But with Shadow Finances is the idea of a team that's not finance being responsible for and executing on financial metrics. And when you're resolving entities like in Snowflake or in Big Query, writing out a bunch of sql, SQL and fixing stuff up in the warehouse because it's easy or because you wanna do it or your boss wants you to do it, you end up with Shadow I see this almost everywhere, whenever I'm called in by a CFO or fp and a team or COO or somebody like that these, privately held companies. That is one of the biggest issues that I find because what you should be doing is, using your system of record of finance, maybe it's NetSuite, maybe it's a number of other competitors out there.

You should be resolving your finance and then bringing that into the warehouse as your orders and order lines, transaction lines, et Because the point of the warehouse is to join that to marketing and it's to join that to product and then to get outcomes out of. How did this marketing stuff affect our revenue, which is a function of orders and order lines,

**Loris Marini:** But that requires that the app or the service you use to keep track and have the full understanding of your financials is. able to integrate with the many others that you use for payments? Like it always the case? I suppose this is the frustration of a lot of chief information officers.

If you look at back when the CIO was was hot, right? Their job was to integrate systems and do all the background research necessary to ensure interpretability. So they would go through lengthy, sometimes processes of evaluating vendors and sometimes even large enterprise, right? To talk to the teams to do some changes in the app so that they would guarantee that the system would work with the other systems.

For some reason we got excited about every team, every department choosing the app they like the most Which is good. It led to a lot of independence. Now teams can choose whatever they're comfortable, more productive with, which is a good thing. It led to a lot of innovation, cuz now a lot of vendors now they solve the same problem in different ways.

And that's good because, more diversity arguably helps the industry evolve and find the most fit for purpose. But it also led to fragmentation. And so when it's we can't, we, I don't think we can unwind the clocks. It'd be nice to go back, but that's where we are at the moment.

This fragmentation of tooling and an applications. So what you are arguing here is not get one tool to do it all is instead of connecting direct directly this maze this gigantic, enormous list of applications you use and bring them all back into the warehouse and then worry about cleaning, defining your identities and doing all that in sql, you should do it before you do that.

Am I understanding this correctly?

**Lauren Balik:** Yeah that's correct. And the canonical example of this is your finance team should not be closing their books or paying out sales taxes or, other operational stuff that they do based on anything that comes out of a data warehouse. They should be doing this out, comes out of their system of record, NetSuite, whatever There's a number of other systems out there that, you can rattle off.

**Loris Marini:** the So why is that not happening?

**Lauren Balik:** Here's why it's not happening at a lot of smaller companies. Amazon and Google are offering cloud credits a hundred thousand a year for two years to customers who to use their services and startups love it. It's great. You can get a lot of credits be able to, build your application, build up your system, get a data warehouse, get a looker license, whatever And it's actually cheaper now to do that and buy the data software, buy this basic, modern data stack warehouse a bi reporting tool. And then, Extract and load it in with a five train or a whatever that it is to, do NetSuite correctly.

And it's cheaper because it's free. It's free for these companies for two years.

**Loris Marini:** Computational speaking. Yeah.

**Lauren Balik:** it's $200,000 of free credits to be able to build this in this way. it's great for companies, who wanna save money, but what you're doing is you're creating tech debt in the data warehouse.

what you're doing, is you're creating shadow finance. You're creating financial metrics and financial outcomes that are held together by, files in the data warehouse. seen

**Loris Marini:** by non-finance people often,

**Lauren Balik:** Written by somebody who's not on the finance team, written by somebody whose job title is analytics engineer or data scientist or whatever.

And at its worst , companies will, they'll have, 4, 5 different, payments channels or different channels that they're they put it all together in the warehouse and then they make a Looker dashboard or a Tableau dashboard or whatever out of it.

And then as the company grows, they decide, okay, we need an accounting software now. And they buy an accounting software. Seen where accountants will go and download a CSV out of the BI tool because it's all the stuff that's been compiled together and then put that into the accounting tool.

Or now we have the option of reverse etl. And so you can just reverse ETL into the tool and this should all be done before it gets into the warehouse. in The US privately owned companies that are at a seed series A, series B, just do not give a crap about finance. I don't mean things are wrong or they're fraudulent or They don't get, I, they just don't, they don't give a crap about it. It's funny, like whole thing's very funny.

**Loris Marini:** so the incentives are not really there to do things properly from the start. There is a, it's much easier, much more appealing to just get Just mesh some sequel code. And then worry about things like integrity and meaning

and semantics.

**Lauren Balik:** it. a year later or worry about it two years later when you were raising a new round and an investor needs to have this correct or worry about it later. It's all pushing off reality to the future. think a lot of this is going away, to be honest. With the way, markets are compressing and there's less money floating around.

Money's not free anymore. In the US I think a lot of this is going away.

**Loris Marini:** but I'm, Just wondering Because we can't unwind the clocks. And we still, we have to deal with application spread or, the diversity of apps we use in the day to day. And there's always gonna be, especially if a startups I can need to be to be quick, it'll get some quick wins that's not gonna go away.

Even like in large enterprises, because money is not cheap. We need to be able to demonstrate value before we invest in a four, five year migration project or integration project or data warehousing project. it's part of how we do minimize risk.

certainly, the solution is not to go back to. some sort of waterfall approach. We wanna be a child, we wanna test things, we want to prove that there's value before we invest more. So that means taking quick wins, cutting corners, but there is a good way and a bad way of doing it, right? Like in software, those that do software development would've done it for a long time.

They know that tech debt is something that is always gonna be there, and they have to keep an eye, they have to be vigilant because it can. Through you might not notice it and you just feel the effects of tech debt down the track and it's always a difficult conversation. The one of refactoring, taking the code and taking the time to say, Do we need to improve it?

We reach that point where, It makes sense to prioritize this problem right now instead of developing the next feature. Really sit down and think about how we name things, sort of standards or write code that is not performant and write it in a much, much faster way that is more efficient.

Those are problems of maintenance. Engineers are great at solving those problems because their performance is measured against. How effectively can you transform, How effectively can you compute? How effectively can you get some sort of behavior out of the app or the system by writing code, obviously.

And so if SQL is the platform where we do things do? What is the next quick win that we can try to get to, to stimulate a long-term thinking and get to solve the fundamental problems at the right time or sooner rather than later? Because if we keep waiting, that thatt that keeps increasing and eventually your engineers will leave the company because nobody wants to work in a workplace where all you do is write is extinguishing fires with sql.

It's not gratifying work.

**Lauren Balik:** the past couple of years on an upswing have been grow at all costs. And if you break down that, term, there's grow, every business wants to grow. And then there's at all costs and the at all costs piece that is cost, at some of these companies out here in the last couple of years, and I'm not just talking about VC backed companies that are, on this.

Seed, a, b, c financing round route. talking about, publicly traded large companies. It's grow, digital transformation grow costs. And now it's okay. We've tapped out how much we can grow. There's actually only so much growth that any company can do realistically.

There are only so many human beings in the world. There are only so many dollars that the human beings have. If you're selling b2c it doesn't matter if you're selling to women aged, 18 and 25, who are college educated or you're selling to, are age 50 to 65, blah, blah, blah, for whatever your market is.

And it also goes for b2b. There's only so many businesses that are ever gonna buy your product. so this grow mentality a lot of companies on this digital transformation and on these grow growings have burned $2 for every $1 that they've made, or three or four, or in some of the crazier cases that I've seen it, some of these BC backed companies they're burning $10 for every dollar that they make and getting that burn rate down, like at a fundamental level in the business something that's, closer to profitability.

If not profitable itself is where we've been trending. And a lot of this is a function, in the data world of, if you're on the cloud it's cloud. opex, right? On-prem is CapEx focused. Cloud is opex. And a number of data teams, engineering teams, it'll hire new people, hiring a new software engineer, hiring a new pm, hiring a new data engineer.

And it's all grow, it's growth initiatives. because it's grow at all costs they will write all this non-optimal code and build these non-optimal sequel. Processes in the cloud data warehouse, the most expensive compute cycle you can do. we're seeing a pullback already. what we're gonna do in the future here is think before we leap and refactor.

And one of the funnier ones and this is just me laughing cause I have a good time and I laugh about things. DBT a blog post, an article, whatever you wanna call it, about how they one of their jobs fixed.

And now instead of taking three hours to run, it takes 90 minutes. And what they were doing was they were writing window functions over a 5 billion row table and doing this four times day. And Anyone who's an engineer, no sequel out there, window functions gonna, rank or assign a number, like it's gonna look through every single row.

It's a loop. And it's gonna say, This is the first instance, this is the second instance. And then assign that to it. And then it's gonna go into some other sequel thing and then get turned a newer report at some, And it's funny because they were doing this on Snowflake.

And they realized that this was costing them thousands and thousands of dollars a year, and they fixed it. would argue that they didn't fully fix it. They just partially fixed it if you read the article. But the fact that you have one of the data infrastructure companies out there today saying, Hey, here's some crazy thing we were doing on our cloud data warehouse of Snowflake, and we fixed it.

And now instead of taking three hours to run, it takes 90 minutes. Yes, That's an improvement. That's good. Also doesn't fundamentally solve the problem. Also, instead of spending however many thousands of dollars you were a year on this, you're spending lower thousands. You haven't actually fixed the problem, but this is the

**Loris Marini:** the order magnitude. Yeah,

**Lauren Balik:** fact of the matter is that there have been so many people who've been thrown onto this modern data stack who know how to write some sql which is good.

Like the modern data stack, the best benefit of it is that a lot of people have learned sql, But a lot of these people have also written a lot of poor performing sequel that ends up getting processed and compiled, and turned into all these crazy things.

And in the last couple of years that hasn't mattered that much, that extra $2,000 a year here or that extra 5,000 there. That extra 10,000 here, hasn't mattered that much. But as the data volume grows, was a hundred million records a year ago in a table might be 300 million now.

And, what was this size of something, it was a terabyte. Now it might be two terabytes. when you're writing these looping functions over this in it doesn't scale and it doesn't work. I think one of the biggest misses of what I see in how the modern data stack is executed is other companies when they're at that small scale or a smaller scale than they are, year from now, two years from now, when they wrote the first thing.

They just wrote stuff in sequel for whatever made sense at the time, and then when they scale, it doesn't work. that's the window functions. That's your cross joins, that's, cts of table the references, another table that references another table,

**Loris Marini:** Mm-hmm.

**Lauren Balik:** compile it all together and it's not scalable and it's not really data modeling either.

And that might be controversial for me to say, but the way a lot of this stuff works is not actually data modeling. Like a data model should be separate from the queries that go into a BI tool or the queries that are done ad hoc to answer a simple question. where this goes wrong is mostly on digital data When I say it goes wrong on digital data digital data is your clicks, your telemetry, like how many times you someone downloaded or clicked in the product or used the product, or, your IOT stuff. These are large data these are the things that matter the most when you're sequencing, sequencing, sequencing off them in the cloud warehouse.

If you're just talking about like gigabytes of order data, it doesn't matter that much honestly. If you're just talking gigabytes from one order system and another order system and you're sequencing them all together

**Loris Marini:** you're

**Lauren Balik:** running it through the cloud. You know what, you might be spending an extra thousand, $2,000 a year.

Who cares, right? No, no business gives

**Loris Marini:** But it's the logs. The

logs that are

**Lauren Balik:** right. It's the logs, it's the clicks, it's all the digital data that makes this the problem. people talk about like the world of atoms and the world of bits and bites the world of atoms.

You have your orders data, you have what a human being, the ticket that they opened or the ticket that they closed It's low volume, but in the world of bits and bites, it's all this other stuff. And that's where this adds up. And pre-processing this before you put it into a data warehouse is critical.

a lot of this stuff is really it's just an event. It's An event of Lauren clicked this thing, so it'll be my name or my user ID or whatever, timestamp and then all the attributes about me. And that is a wide table.

**Loris Marini:** Mm-hmm.

**Lauren Balik:** If you have a ton of attributes. But that's what

**Loris Marini:** D Normalized

**Lauren Balik:** yeah, It's one line. It's what I did at that moment. an event stream. And that's where I see this space moving towards. you, you can do this at databases. If you a database, if you have some kind of event stream set up and you wanna put this into your big query, your snowflake, your whatever, Google Analytics is set up like this.

Google Analytics is a Google product, obviously. Big Query is Google product. You can set that up easily. But a lot of the APIs that exist up there from your Shopify from, your top 20 other apps out there are not like this. They're not, you can't get data out of them in an event stream natively.

You can e tl them and de normalize them in some kind of a data frame or use some kind of a product that models the data. You go in there and you model it out and then you put it in shape it and format it as you see fit. But a lot of the stuff that we have out there today is not formatted like this out of apps.

And that's one of the biggest, I don't wanna say problems, but it's one of the biggest, like why we're in this boat

**Loris Marini:** I wanna move the conversation to act number six in our notes, how to do it right. Of course, a lot of this is gonna be a but if you, let's imagine we both staring at a whiteboard and we have a black and red marker. And we can design the ideal system.

So I'm looking really for a vision for the future, Something that is attainable based on where we are at in terms of the current state and where we want to go. What are the fundamental architectural choices you would make? It's Okay. So situation. We've got a bunch of apps, some used for in sales, some in marketing, some in operations, some in procurement, some in production.

It's everywhere, right? Every line of business uses this. Apps. Apps may change at a moment's notice. Maybe people realize that app is not suitable anymore. They switch. So we've got that problem as well. But for a moment, let's say that we don't switch apps. We just have a lot of them and we need to, and the data team is tasked with the idea of creating order in this house, modeling the data so that you can keep building solutions based on the information and identity you have.

Ideally, it's cleaned, it's being tagged properly, has documentation that is clearly available and easy to interpret by anyone in the business. You've got visibility, lineage that, and observability so that if something fluctuates and goes down, at the very least you're notified and you can go and check it out and see how to fix it.

How do we achieve is that all, First of all, in terms of vision, where we wanna go? Are we missing something? And how do we start this Lego project basically?

**Lauren Balik:** what you've described is all of the upfront work of lineage and correctness of these That is any data person's dream . Because how I think about this. The data warehouse and whether it's cloud or whether it's whatever, the most value you get out of it is by joining data between business domains or between LOBs lines of If we spend more time getting the finance input correct and making sure it's correct to their definitions, then we can pull it into the data warehouse. And if we spend more time getting the sales systems correct, then we. Put all the sales systems and get whatever we wanna get outta there. The customer account order line, whatever else is in there.

Correct that at the source. And then bring it in.

And then we can connect finance and sales by joining it in SQL if we want in the warehouse.

Or we can join it through a data frame or through some proxy of a data frame, through some, vendor or whatever. And then bring it in to the data warehouse.

and marketing let's add marketing in here. So like, all the marketing systems are correct to the definition of marketing, and we know what tables, what columns we're going to bring in. then we bring them in. And so now you have a funnel, you have your finance, your bottom of funnel, what actually was the transactions?

You have your sales who talked to who, how long did the deal cycle take? What are the timestamps of the deal cycle? Who are the people, who are the accounts, et cetera. And that's all accounted for system. And it's correct to the sales ops, sales leadership, And then we bring in the marketing and it's

**Loris Marini:** so, So you mentioned any interface essentially

between

**Lauren Balik:** Right and this is entity resolution, is the entity correct per the team that is responsible for the entity, like a finance team should be responsible for? Orders, order lines, returns, shipments, definitions. A sales team will be responsible for what is an account, what is a contact, what is a lead?

A marketing team will be responsible for? What is the campaign? What is the campaign that turned into a lead or is the campaign that turned into a contact, et cetera. And then by resolving that ahead of time, what you're doing with the warehouse or the store, or whatever you wanna call it, the lake house if you're on data breaks then the bi on top of it, you're joining the data together.

It's already been correct per the line of business. And now you're bringing it

in and you're, and what you're doing is you're just joining it between. Lines of business and that is the way to scale this. And that's what's going to solve the fundamental problem as best as I've ever seen of, hey, finances, numbers don't match data's numbers.

The finance numbers are the finance numbers. Those are the exact same as the data's numbers. And oh, the sales numbers don't match the data team's number. Here's sales, here's this, and that's the way you can approach perfection. And in the last couple of years, the idea of a data team has come to me in the data warehouse team, or the data storage lake SQL manipulation team.

And that's wrong, and that's a function of the job, but really it's. ensuring the quality and the contracts of the data, and the more work you put on that, the less sequel you have to write.

**Loris Marini:** And I'm gonna add to that. I love this, but I'm gonna add to that the fact that notice that in what you just said in the last five this concepts are going to be valid regardless of where you actually do this, right? So in principle, if you don't care about dollar per CPU cycle, you could do this in Snowflake or BigQuery.

You could physically like the space where you you could, It doesn't mean it's the best way, but you could do that. The thing is, I don't see teams doing that at all. There is, there's a missing piece, which is before we worry about optimization and demoralizing things and lowering the cost of computation.

If we had, and of course we don't, but if we had unlimited budgets for cloud spend for we could still add value in the sense that we could agree on those terms and clean up. And we don't do it. We don't do it. And I don't think the problem is cloud spending because we are spending for those window functions.

The bill is coming already. But what we are not doing is getting in, in, into these rooms and sitting at the tables with the folks that are, that have the domain knowledge, the tribal knowledge, those that know what matters and what doesn't. The shadow, finance example, We should spend time with the finance folks that leave inside the financial application and know what is what and what at the end the CFO cares about to clean up things before they come into the warehouse.

That doesn't mean though, necessarily, right? It doesn't mean necessarily that cleanup job has to happen in another server or in another cloud vendor or in another physically separate. Project. It could, but it doesn't have to. As long as we take that time. I think that would be a massive first step.

And then of course, we wanna do it right, We should do it cost effectively. That's what engineers do. And we need to worry about how do we do it? Where do we put it? Is snowflake the right place? Or should we do a cluster and do it in airflow and have have a whole bunch of Kubernetes action going to, to crunch tho those numbers and clean them up, but before they come into the warehouse?

wanna explore the how to in terms of visibility and lineage. Have you seen a solution out there to solve the biggest problem of all, which is knowing what the hell is going on from the input to the output and back.

And the second thing I wanna explore in the last 10 minutes the problem you mentioned in our prep call that reverse CTL brings tech debt back from the app layer back into the data warehouse find understood correctly. And I wanted to dive into that as well.

**Lauren Balik:** So fundamentally, here's how I think about lineages across applications

and when I say lineage is across applications, what I mean is that Salesforce, Snowflake, Looker is an example. What's the lineage across those three applications? A lot of tools out there are solving lineage in the warehouse. Or solving lineage in a piece. They're not solving lineage through different systems.

And if you look at I don't know you said, Do I know anyone out there? Manta is one

**Loris Marini:** beyond the tools I was thinking about the architectural choices. Do we need to change anything? Can we expect a plug and play kind of behavior?

Can we just, is it just a matter of shopping for the right tools, slam it on top of our DBT and problem solved? Or the other, as usual gonna be nuances and, people, the issues and challenges and other things that we might not necessarily.

**Lauren Balik:** Yeah. Yeah. DBT is not gonna solve the problem. They don't even have column level in the edge gonna solve the problem. But breaking it down at a fundamental level, like you have attributes that come out of an endpoint through an api. These turn into columns then these are, as metadata put into a BI application or whatever.

And, going through this whole process is just it's really just collecting the metadata. It's can collecting it at the column level slash attribute level. anything that doesn't do that is not real lineage. maybe that offends somebody, but. It's not real lineage. and this gets into what I was talking about earlier, like I'm a firm believer that like the stronger the app, the less data warehouse and the less processing is needed in the data warehouse and charging back these teams and holding these teams accountable for what comes out of the data, dump the api, the whatever, however they're egressing or moving or transferring or whatever you wanna call it, data out of their system.

Their application is what matters and not adhering to this is where all the problems start. And so as far as lineage goes, it starts with the application and if you are just dumping stuff in raw. And it's not defined. You've already lost A lot of audited companies know this. Publicly traded, audited companies will have some semblance of this because they have to, especially if it hits their financial books and records.

Everywhere else in the market, I just see vendor after vendor saying We have lineage, but when you look into it, it's just warehouse lineage or it's a piece of the puzzle. It's not the whole

**Loris Marini:** Yeah. I'm wonder if there, there's value with partial lineage. Probably there is, but I think that the game changer really comes in when you have full visibil.

End to end. Cause that's the only way you can manage the system as a whole. And of parts of it.

**Lauren Balik:** Yeah. Lineage is just a function of how data moves through different transformations, integrations, applications, et cetera. And the longer it takes for the data to reflect reality, the real world meaning of what it is more it's not gonna happen.

And so what I mean by that is, If I click a banner on a website and that data goes wherever it goes, and it turns into, this system has Lauren, click this at this timestamp did this, and this is what we know about that's an event record that should be correct in the system it's collected in.

And that represents what I did in the real world, which is that I clicked that banner at that time. And this is already de-normalized, right? It's already an event stream of what I did.

**Loris Marini:** Yep.

**Lauren Balik:** problem is when we add all this, all these other I like Ralph Kimball and I like dimensionalized stuff, but like some of this dimensionalized stuff that exists out there today doesn't make sense because everything is really.

If you break it down at the atomic level, it's an event or an activity or a thing that a person or a thing did. so I click this at this time and that's a row, and it should be a stream and it should go through the data process in a way that like is real time or close to it. if you wanna dimensionalize off of that things like by name or my gender or whatever other attributes you know about me, you can do that.

But that's all secondary. The primary thing is the event that I did. And so all of lineage between systems. you break it down at a fundamental level, it's really based on dimensions, not on the facts of what was a thing that happened at some place, at some

**Loris Marini:** Yeah,

**Lauren Balik:** that occurred and

**Loris Marini:** mutable

right to, to get to the ground truth,

**Lauren Balik:** Facts, it's immutable and as you remove away from immutability, you increase the need for lineage. And there's use cases for it. But I'll just go back and, dump on five Tran here. Iran wants to go dump in a bunch of normalized tables.

They're maximizing a bunch of rows to bill you on. Then you have to roll it back up. If you're doing lineage on that, you've already lost. Cause now you're just paying for a lineage vendor or a lineage system. And you're paying to roll all the data down and then roll it

**Loris Marini:** Yeah. Yeah, exactly.

Yeah. So there are cases where it makes sense but though, having said that, all the analytics you do on top of those events and the other tables that are part of the Stars snowflake scheme, whatever that's where having visibility of who is doing what with those feature.

Comes in handy. Cause a lot of the data science, for example, or the even simple just BI work is about merging, combining. And sometimes you wanna transform it because you're looking for something specific or you wanna prepare a a reporter or a dashboard, or you wanna create some training sets for a data science model.

And as part of the manipulation, you bring in features because it's part of the discovering and experimentation process, which is fundamental to the scientific process and any data science function. So it, so what should we do then? Should the data scientist take this clean, pristine source of truth from the data warehouse, then extract it, put it into some sort of Jupiter environment, maybe cloud hosted, and write some models on top.

And then it becomes the problem of. Productionizing those models. So it feels like there are different competing needs. One is how much fidelity can you maintain between the real world of the data storage and the what's inside your storage your platform. And the second is how quickly you can mesh together.

Join a filter, pivot this data sets to answer an, an important business question that just came through Slack from a cfo. The other is, how can you then build on top of it and do science with it? So there are really different needs. One is short term, one is long term, One is about accuracy and source of truth.

The other one is about GLD and sandboxing, experimenting and survival of the fittest. think that's what I think I love about data architectures because there is, one would be tempted to think that there is one system that can rule them all and give you all the functionalities. But the more I get into this stuff, the more think that we can have one system that does it all.

We have to think about interoperability and break it down so that every system, like in any engineering problem, every system is optimized for one thing. It does one thing, it does it well. In it's clear how it does it, there's clear defined interface so that someone else can consume that output and do something else in their own systems.

And it's dangerously close to. What data mesh is, is trying to say here, which is The reason why this stuff doesn't scale is because large organizations, big teams, lots of computing priorities, lots of moving parts. You can't have it centralized, fully completely decentralized has its cons as well.

Can we imagine something that is a hybrid where some parts are centralized, some parts are decentralized

**Lauren Balik:** So centralized versus decentralized, I think that's the wrong way of thinking about it. I think it's. Business as usual, or the models, whatever you wanna do it's in a warehouse, it's not whatever is what's keeping the lights on for business intelligence and the decisions that need to be made on a regular basis in a business one path.

And they should be clean. They should be actively managed and governed and secure. And that is exactly how deliver on the basics. Like you should not have your of, daily active users, if that's something you use to run your business, be in a Jupyter notebook that a data scientist is running that should be in the govern.

Path if it's something that is used to run your business. Now the exploratory piece is separate and it's different. And the ability to explore data and not get that canned report or those, canonical metrics that are important to your business where it's different.

And whether you're pulling that out of a warehouse, whether you're pulling that out of the source applications is gonna depend on the business. But functionally, it's not centralization versus decentralization. It's what is needed to run this business. What has to, like what reports, metrics, KPIs, do we need to know every week, every day, every whatever.

And that should be served through some kind of reporting layer and made easily available to all the relevant stakeholders. And the data science stuff is more exploratory. And whether it depends on where in the process they're picking data off of. If they're, they can pick it off of the source apps, they can pick it off of

the

**Loris Marini:** don't really know.

**Lauren Balik:** or, or they can do other stuff with it. That is, a function of every environment is different. If you've done the warehouse correctly, they should be able to pick it off the warehouse and then, go play around with it however they If you haven't done the Warehouse correctly, what they're gonna do is they're gonna start there.

They're gonna see it's a bunch of, silliness happening. And then they're gonna go back to the source applications and then pull it off of there. And then you end up in this, the two roads diverge and they get further and further, the more complex you try to make it. it's not an issue of centralization, decentralization.

It's an issue of do you have the ability to run like a weekly business review off of a set of canonical reports for the executive team, for, at a sub level for the marketing team, for the sales team, for the whatever GS team and breaking it down on more level. If those are all correct, then the data scientist should be able to pick off of those.

if not, then what they're gonna do is they're gonna man a bunch of stuff together from source apps, some from the warehouse. Throw it all together in some data frame in a Jupyter Notebook, and then their credibility is at risk. At that point. It's not their fault. know. I've seen this happen 150 times in

**Loris Marini:** Yeah. So to that topology. are you for or against the idea of keeping a continuous closed loop that synchronizes what's happening in the apps? Because we can't ask our financial people or our marketing team to help inside the warehouse and contribute directly. They, it's not their domain.

sits typically within engineering. So they use their app, Salesforce or whatever. So you do your thing in your app and then all of a sudden you say, Oh, it would be nice if we segment our customer base, go to the data scientist and say, Hey, can you build a model that segments our customers?

And then we're like, Okay, what do you wanna achieve? Okay, we want, increase rate for this particular offer. Okay, so we're gonna look at that demographic. And so that's where you start pooling different tables and join them to, and reach the model and bringing features. Then you dump into a model, there's some exploration.

You realize that through their composition, that five of those columns really matter. They are, strong indicators of the final probability of actual clicking. And so the data scientists go Look, we've done testing with split, test and validation. We are pretty confident we got an 85% probability of actually guessing, Right?

Are you happy with it? Yeah. Go. Good. Now how do I consume that? How do, how can you bring that knowledge. The result of an A analytics pipeline into back into Salesforce or whatever the marketer uses, because that's what they want. They want a tag that allows them to know, should I include this person in this email automation or So when reverse ETL I think is useful is because it solves the problem of getting that the information flow up and running as quickly as possible, going from a to back into the app. We need that, right? We need the ability to right back into the app. Otherwise we'll have to train all the marketers to learn sequel and come back into us and consume it directly and do some matching Excel, which is not advisable, Yes. It's just a mess, because things change. Models change as model drifts, so it seems to me that one would have to keep a loop, right? That's constantly synchronized. Is whatever's happening. Here, whatever's happening there, knowing that here and there, I'll lose concepts because someone else might be over there instead of here or there.

The data scientist can be consuming this stuff and putting in a sandbox and feeding it back into the app. So it becomes really a multi-tier, almost problem. What we have, the fundamentals the truth, the canonical metrics as you call them, and the fundamental entities. Yes. But we have also the experimental stuff.

We have consumers and producers that sit at different tables and speak different languages and that differently that they're more or less comfortable with writing code or we reading sequel, they just love their app. They leave inside the app and they don't wanna know about what's in the data science domain.

And so combining it all together into some sort of system that can be audited and it doesn't cost a fortune in cloud builds. That's It's a it's a tricky one.

**Lauren Balik:** Yeah, no, you're

you're correct. And so one I didn't really care about your reverse ETL positively or negatively. don't think they should be their own companies. I don't think that's necessary. That's a different conversation. there's plenty of organizations out there that already do the ingest and the, the reverse ETL piece or whatever you wanna call it.

**Loris Marini:** So you would do it internally in as part of the engineering function?

**Lauren Balik:** engineering function or you get a nla, Kebo, a rivery, a number of other companies out there that do the ingest and.

**Loris Marini:** Mm-hmm.

**Lauren Balik:** Reverse etl. But fundamentally, all of this is a question of integration. a lot of people don't know this or think about it, even if they work on it, but a SQL join is an integration between platforms.

So if you have Salesforce coming in and, blah, blah, blah system, whatever, coming in could integrate them together and drop the fields from this in here and enrich this, that's the, that's what we were doing a couple years ago. I pass is the middleware that was, called this Zap still does it, WeDo Tray.

There's a number of solutions that do it. That's app to app. What we've decided is we're gonna go app warehouse, and instead of integrating it, From app to app, we're gonna join it in sequel. there's also the other way of doing it, which is, I guess it's etl, but if you have all of the metadata from one system, metadata from another, you can join it in a data frame then put that in the warehouse or, there's a number of vendors out there.

There's Syncy, there's Matillion, there's, a bunch out there that you can use to metadata and then put it in. So it's just a matter of the whole ball game here is just where integrations are happening and where,

**Loris Marini:** Yeah.

Yeah

**Lauren Balik:** Where this all takes place.

And right now we're calling this transformation.

and we're calling this SQL joins and that's valuable, but I don't know, maybe it's not correct in a lot of cases. I don't think it is. I think this puts too much pressure on the data team and it makes the data team integrators in the warehouse, and this leads to the data team not building dashboards, not driving value.

They're just joining SQL together in the warehouse.

And this leads to bi, which can be good. It can be bad. I don't know. It depends on the org, depends on the number of factors, but data,

**Loris Marini:** If it's well managed. Yeah. It has to

**Lauren Balik:** yeah, the data team these days has become less and less about driving insights. And about driving value, and they've gotten further away from a p and l.

And what they're doing is they're just integrators using SQL in the warehouse. And I think that it's too expensive. It's, most companies have been able to offset how expensive this is just because of, interest rates and the fact that money's been very cheap. And I think this is gonna start moving down again.

And I think the apps are gonna become more powerful.

**Loris Marini:** Yeah, very definitely share that hope with you, Lauren. We need to get back into a state where we can. Some sort of value for the business, for our stakeholders instead of just meshing SQL to integrate stuff because it's it's low level job.

It's not strategic at all. And and then a lot of folks worry about why is it that we don't incorporate design thinking into, or UX principles into our data products. We talk about data products, but we really, we don't have data products. We have. Just a lot of sequel that is unclear who exactly is serving.

We might sometimes get to the semblance of a data product temporarily. Someone

**Lauren Balik:** Right, right.

**Loris Marini:** And we are able to build it and ship it, right? So we have data product right now, but it's unclear what's gonna happen to it. It's unclear how much is gonna be used. It's unclear how much value is driving, how useful it is.

So the whole idea of the full. Life cycle to manage products that we know how to do. I know a lot of companies that do just production, they sell physical goods. The world is full of those. Otherwise, we wouldn't be here. I wouldn't be staring at this camera, this laptop, right?

So production companies that have factories that run on the clock, they build things, physical things that we buy It's like an operating system. They know that there are parts that are coming in. They know that each part has a label. They know the specs of those. They can do QA on them.

Like we should probably take a step back and look at. How does an actual factory work and how can we incorporate principles into the data team, not just the data warehouse, like treated as a factory that has inputs and outputs and relationships internal and external. And it's like a system, if you think about it, it's almost like a startup or a company within the company, As a function. It's got a budget, it's got requirements, it's got customers internal and external. Yeah. And

**Lauren Balik:** so good. a lot of people say, treat data as a product. Treat your data team as a product team. I think, and like I've said this before, treat it like a business because it is a business because you're move. Data has inherent value. A lot of companies, like you can read Doug Laney's, Phenomics and all the other stuff about monetizing your data.

But like data is a business. And when you move data in system that is value, it's an asset. And then when you turn it into something else that has value too. And like any business, you want your inputs to be cheaper than the outputs.

**Loris Marini:** Yeah. Your

margin.

**Lauren Balik:** right? You wanna get margin.

Your data team needs to have margin. If you're just sitting here making SQL files. You're getting negative margin out of that.

**Loris Marini:** Yeah. You're just

**Lauren Balik:** you

**Loris Marini:** Yeah, getting margins means to build things that people wanna use. So whether the transaction involves actual money, Bitcoin or crypto , or just time, right? We demanding time. People say, Hey, use my product because it's gonna solve your problem. out of the billion options that could use the many Excels spreadsheets that they have, lying on their hard drive, they decide to use your product.

they do it and they come back because it adds value. They save time, they increase the click rate or whatever, right? There's some sort of value. And so you, we should be able to track that and go super lean instead of what we do is a lot of pressure from the business. You get a, cook something together or cares about tech debt, spend time writing sql that goes nowhere.

Or maybe in some dashboards that we don't even know who uses how often, whether they actually mean anything. Things shift, people leave. Like it's just a big mess. Like

**Lauren Balik:** do a layoff of your data team, lay off some dashboards, lay off some sequel files, You have a thousand dashboards, get rid of half of them. you have, this many SQL files in your, just lay 'em off.

Like every data team out there should be laying off work that they

**Loris Marini:** to see if someone

**Lauren Balik:** know,

**Loris Marini:** complains

If you hear Slack going, it's okay, maybe that was useful. After all, we can Dr. Cav it and bring it back.

Yeah. No. Cool. Lauren, I'm just conscious of your time. I think I could go on covered I think all the important points that we brainstormed together, but is anything else you want to add? I'm super open to it. Otherwise we can call it a day

**Lauren Balik:** Yeah, no, thanks for having me. This is fun. It's good, the noodle on things with with some folks out there. And I'm known for being 10 outta 10 chili pepper levels of spicy. I think we kept it at a two or a three here today. Yeah, no, had a time and thank, thanks for.

**Loris Marini:** Yeah, likewise. And I just maybe a word for the engineers that have listened to us whether this conversation is gonna be useful for you tomorrow when you go show up again at work or not, that's to be seen. But I just hope that you're not gonna give up because it's it's a, it's an important moment, this one in the industry.

And we are going through this transformation, this maturity, a path towards maturity. We are gonna get there and we are going to find the right way of doing this type of work. One that puts everybody on the same page that, balances the pros and cons, and hopefully it drives more and more value so we can stop worrying about am I gonna get fired and actually start talking to people enjoying, solving problems and advance the business because that's, that feels good.

And I think that's that's it for me. Laura, thank you very much. One thing is what's the best place to follow you?

**Lauren Balik:** Oh, LinkedIn Lauren Ballek. I'm the only one in the world as far as I know, so you can find me there. my website analytics barely use it. But yeah that's where I'm at,

**Loris Marini:** Cool. Thank you Lauren again, and catch up soon. See you on LinkedIn.

**Lauren Balik:** Yeah. Take care. Bye.

Contact Us

Thanks for your message. You'll hear from us soon!
Oops! Something went wrong while submitting the form.