Episode:
27

The informed company with Matt David

Loris Marini - Podcast Host Discovering Data

As tools get better in the modern data stack, business impact is increasingly dependent on effective communication, collaboration and governance. What does it take to do this well? Join me with Matt David, Product Marketing Manager (Data) at Atlassian.

Also on :
Stitcher LogoPodcast Addict LogoTune In Logo

Join the list

Join hundreds of practitioners and leaders like you with episode insights straight in your inbox.

You're in! Look out for the next episode in your inbox.
Oops! Something went wrong while submitting the form.

Want to tell your data story?

Loris Marini Headshot

Checkout our brands or sponsors page to see if you are a match. We publish conversations with industry leaders to help data practitioners maximise the impact of their work.

Why this episode

Today we look at how to maximise the business impact of the modern data stack with my guest Matt David, Product Marketing Manager (Data) at Atlassian. We talk about the process of creating data products, architectural and transformational patterns, the process of implementing and maintaining a source of truth, challenges in sharing clean data with the rest of the organisation, the role of data literacy, and other topics. If you want to dive deeper checkout Matt’s new book co-authored with Dave Fowler: “The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights” available on Amazon. You can follow Matt on LinkedIn.

📣  Enrolment open: data lineage crash course with Irina Steenbeek

Do you want to understand the business, secure funding for your data lineage initiative, scope and plan the work well, and deliver business outcomes? Are you a data management professional, a business leader, or project manager? If the answer is yes, keep reading!

Discovering Data has partnered with data lineage black belt Irina Steenbeek to bring you an EXCLUSIVE 2h crash course titled: “How to build a successful data lineage business case.” In this course you will learn how to identify your company’s needs in data lineage, scope your initiative, secure support of key stakeholders, define the key roles and their accountability, prepare requirements, choose the right approach to documentation, select appropriate tooling and much more. At the end of the course you will get templates and cheatsheets to apply this knowledge immediately at work. You will be able to confidently present the initiative to your key stakeholders, assess the readiness of your organization, scope the data lineage initiative, prepare requirements, and choose an appropriate execution approach and methods.

You can ENROL NOW HERE.

What I learned

Share on :

Want to see the show grow?

Your ideas help us create useful and relevant content. Send a private message or rate the show on Apple Podcast or Spotify!

Episode transcripts

Loris Marini: Hopefully, if there is one thing we all agree on in data, is that we have lots of terminologies. Many of those things are defined in different ways and they mean different things depending on who you talk to. If you pair that with the rate of evolution of technology itself, it almost becomes an explosive mix where it's hard to even reason out, share ideas, learn different points of view, therefore it's hard to innovate. There were many books that I read recently on the topic of clarifying this data mess we're living in at the moment. One that really caught my attention is The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights.

We'll dive into the book in detail today with my guest. Before we do that, I want to give you a quick overview of what we're going to be talking about. The focus of this one is going to be on the evolution of the data stack and how better tools highlight the need of thinking rigorously. My guest today is Matt David. Matt has 10 years of industry experience using data and is currently the senior product marketing manager for data at Atlassian. Previously, he worked at Chartio as head of data, and Audacity as product lead for the school of data science.

Data has become a pre-requisite skill set for more and more known data jobs as well. Matt is really passionate about making data concepts more easily understood to increase data literacy for everyone. We couldn't be more aligned in terms of our missions, especially with the mission of Discovering Data. Today, we will talk about the process of creating data products, the difference between data lakes, data warehouses, and data marts, and why they're all needed. It's not one versus the other. Why source data must be copied at least once; this is not something the engineers are particularly keen on doing. Transformational patterns, ETL and ELT, the problem of data security, access control, maintaining a source of truth. Some of the challenges in sharing clean data with the rest of the organization, and last but not least: data illiteracy and a bunch of other topics.

I'm really, really excited to have you on the show, Matt, thanks for taking the time. Thanks for being with me.

Matt David: Thanks for having me.

Loris Marini: Let's start with the term that I found in your own book. I found it really interesting: source hell. What’s source hell and why do we need to worry about it?

Matt David: Yeah. Initially, it was actually sort of a positive development, which is that almost every tool that you use nowadays spins off data. There are so many sources that you can potentially mine for insights but when you're a startup, a new company, trying to wrangle that amount of data is super hard. Oftentimes, you have things that are out of date. You don't know how to model the data. You're evaluating the data in all these different places and it's just a real challenge to try to build knowledge on top of all of these potential data that you have access to.

Loris Marini: Yeah. I was thinking about it this morning, how to tackle this. I want to play the game where I'm a skeptic. I’m going to keep asking you the tough questions, being a data illiterate who knows nothing about replication and access control. This might sound like a stupid question, but why can’t we do the same as we do with our Google docs and our SharePoint?  Why can't data just be in one single place and just share a link to people that need it and we're done with it? Why do we need to integrate that data into a single place?

Matt David: Data inherently is generated from multiple places. It could be from Zendesk, your production database, Salesforce, Google Analytics, all these places have super important information that is very valuable to evaluate in their own. Google Analytics for instance, has plenty of visualization capabilities built right in, but there's huge insight potential by combining those data sets together.

Knowing which pages you have on your website that actually drive marketing qualified leads that convert, and maybe do some important action in your app, that whole story is across multiple sources. It forces you to combine. It's also important to note your production database, if you're just running queries right on top of it, you're potentially impacting the performance of your application. Even that needs to be abstracted away so that your queries are only affecting your analytics and not your app performance as well.

Loris Marini: Right. We keep talking about the evolution of technology. This is still part of the game where I'm the outsider of the conversation. We keep talking about the technologies that evolve and are so powerful. Why do we need to worry about performance? Can’t we just keep crunching and sending queries? Do we really need to replicate?

Matt David: Yeah,  the big developments in the space and the dawn of the modern data stack got started, at least 10 or 15 years ago at this point with columnar databases. These are databases that are better at doing aggregations on the column level, which is a typical analytic function. Whereas most applications run on a row or a transactional model where it's very good finding individual rows and updating individual rows. We unlocked a ton of analytic performance by making that architecture change. There are several other reasons as well, but that's the core concept. So that kicked things off.

You then had these sets of ELT tools that made it very easy to grab data. From these source hell that we were talking about and bring them to that warehouse. More recently you have DBT, which is now allowing you to dramatically lower the bar on modeling that data and combining it together.

All those forces are making it that much easier to work with all the data together in terms of why we can’t just query our application database. You can. People do. Engineers do all the time to check things. When you're doing analytical queries, you are potentially crunching quite a large amount of data. I don't think it's appreciated how much computation is being done. Even on a fairly well-tuned data warehouse, certain queries that you would think are typical will take seconds, which is meaningful.

In an application database, any sort of thing that’s consuming compute could potentially affect things. It's, not always, but when you're doing these analytical queries, there's the chance that you could have it hang up on it or just suck too much compute.

Loris Marini: Yeah, we’re really talking about the scooter and the truck analogy in your book, which I loved. I've been looking for a simple way to explain the difference between transactional and analytical storage. The one that you guys picked is really, really effective. Everybody knows the agility of a motorbike or a scooter can’t compare to the sheer power of a large truck.

It's really about what you need to do. For an application, response time is really important. People get bored if they wait more than 200 milliseconds. You're gonna get into trouble if you're an analyst querying a post-grads production database, and you're looking at the guy that did it. I can testify that you're going to make some enemies because people obviously rely on the performance of that thing has to be live.

User experience depends on it. If you start querying, especially if you don't know what you're doing, like me when I started, and you put a star after a select, things get really, really tricky.

Matt Davis: Yeah. It’s maybe one of those under-appreciated things where writing a very simple query could hit an enormous amount of your schema and require the database to process all that information. You're saying if you just do a select star from your transactions table or some table that has an enormous amount of rows, it takes real-time. It’s not a Google search. It's something more intense than that.

Loris Marini: When you start from scratch and few people have the fortune of working on a Greenfield project but imagine being tasked with this, with the goal of moving all this data into a centralized place because someone high up understood that there is an impact. It's imperative to move it and put it somewhere else. We’re not just copying it because we don't want to deal with the engineers or the devs. We are copying it because we needed a truck and they have scooters.

Now that we did that, where do we start? I’m asking you this question because I have been one of the victims falling into the trap of, “this, not that.” House, lakehouse, lake, house on a lake, the lake goes to a house. It’s a bit all over the place. Give me some clarity, please. What's going on?

Matt David: Yes, this is the exact problem. It was largely the inspiration for the book, which was a vast majority of our customers at Chartio didn't know what to do in terms of taking those first steps towards better modeling their data. In the past, there was this tome in this space by Ralph Kimball, The Data Warehouse Toolkit. It's hundreds of pages long, super dense, very specific about how to model data.

A lot of the reasons why the book was written the way it was, was because of the constraints on the database. It was written pre-columnar data warehouses. Given all this new power, our 2 cents to add to the conversation, all of his stuff still holds in a lot of ways, but what we wanted to add to the conversation was “You have so much power.” Now you have tools like DBT, which make modeling the data much, much easier, and much more accessible.

Most of the issues that people have with data, at least initially, are just around usability. When engineers initially code field names and populate them with data, or when you're pulling data from all these applications, using a tool like Fivetran, the field names aren't necessarily written for you to do analytics on. Sometimes you'll get these archaic-looking things or cryptic codings. A lot of times, fields licensed type or user type will be in numbers. There'll be one, two, three, or four. I don't know what those are as somebody doing the analytics.

There are some simple things you can do with modeling where you can just make the data readable. Honestly, that gets you pretty far just in terms of being able to unlock the data and make it more accessible to the organization. If you really want to constrain how the data is analyzed, there's some pretty typical stuff around star schemas and wide tables and that sort of thing.

It's interesting that nowadays seemingly wide tables are as performant. It’s a controversial, debatable thing. In the past, a star schema is this golden standard, super performance. A couple of years ago, Fivetran showed that nowadays with these columnar databases that just turned into a wide table. It was essentially equally as performant. The reasons to do star schema are gone. If that's how you want to set it up for your company and that helps your company process information, great.

The point we were trying to make was you can just clean up the data, make it readable. It’s the 80-20 benefit initially. You can approach it with just design thinking and ask “what do my end users, my end analysts need?” What would be useful for them to have in here? What additional columns might be beneficial? We just wanted to show you that the first step is something that with just a little bit of intuition and a feedback loop, you can make progress.

Loris Marini: Yeah, We will perhaps dive into the star versus wide column a debate later because I think it's really interesting. Just to stay on the surface for a little longer,  the choice of a lake is not the end of the game, right? Once you collate all the data into a place, then you have to, as you said, refine it and add to it.

In your book, you make the distinction between the lake, the warehouse, and the mart. There's been a lot of confusion around the term data warehousing because I believe historically it started out in the context of a database that is specifically designed to power a dashboard, or some sort of analytical need. People have this feeling of stiffness to it. Something that you can’t change that is at the end of the value chain, almost at the end of the pipeline. Is it true with the modern data stack? What's the difference? What's this big deal between the lakes, warehouses, and marts?

Matt David: Yeah, the other terms, they came up in the middle of us writing the book. What didn't get added was lake house, which Databricks is pushing pretty hard. Generally, there aren't super hard boundaries. It is more conceptual and all of it is built on the same technology or at least it can be.

If you have Redshift or BigQuery set up, you can replicate your production database there. You could use Fivetran or Stitch to pull in all of your other various data from across your sources there. All the data is now in Redshift or BigQuery – that is the data lake. It's just getting all of your sources in one place.

What makes DBT powerful and why it's so exciting is not only does it let you clean those data sets of data up individually, and even join that data together with SQL code, it's that you can then do secondary modeling on top of the table you just made and so on and so on. You can create layers of modeling, which is super powerful. When I was talking about if you just do the cleanup, you've essentially built layer one of the data warehouse, which any amount of modeling of the lake that makes it more usable. If you just do the cleanup, you've already started to build your data warehouse. From there, it's “Oh, well maybe I want to combine user profiles across different tools,” and you can make a master user table. Whatever your business needs are. You can build up these layers; the term is directed acyclic graph or DAG. It’s what makes DBT so powerful. You can build your data warehouse and layers. Make this new cleaned-up tuned to your business version of the data. That's your data warehouse.

Data marts are a little bit of a blurry line, but it's essentially doing additional modeling and oftentimes scoping down the amount of data available for a specific use case. If you want to make a data mart for marketing, maybe you do some extra calculations about row S or some other metric that they care about. You don't have all these other tables related to HR or whatever. You can give them a space to query from that is tailored to them.

Those are the traditional distinctions. The lake house is essentially trying to argue for some space in between the lake and the warehouse obviously with the name. I would talk with somebody from Databricks to really give you the low down. It's similar in aspiration, which is we can deliver more by doing less on the modeling.

Loris Marini: Yeah. You mentioned directed acyclic graphs. For those that never heard of it, I know it sounds a medicine. It might feel like a weird term, but there are three words: directed, it means data direction things flow from left to right,  depending on the convention from one side to the other. Acyclic means that you don't have cycles. You don't want to end up where the output feeds back into the input and you don't know whether you're looking at an output or an input anymore. Because that's the realm of dynamic systems. For the folks that have a background in physics, that’s how complicated that is. You don't want cycles.

Directed, acyclic, and then graph because it's a graph. Let’s now dive into a bit more about the choice of the tooling, therefore the architecture. Before the tooling, we talked about the necessity of bringing data together into a lake or a place. We need them to clean it up. We need to apply business logic, transform it, and serve different customers within the organization. Now, how do we go about choosing tooling? What do we need to look for?

Matt David: Yeah. The tooling has just evolved so fast in the past couple of years, honestly. It's tough to give a single tool recommendation right now just because they've changed a lot and they've all been upgrading a lot. DBT is probably the one you're going to end up using. Depending on how much you like DBT, several of these tools have recently added a bunch of different integrations to it that might weigh on your decision.

In terms of the warehouse, the biggest distinctions are how the pricing works. Some are modeled on storage and some are modeled more on compute. BigQuery – every time you run a query, that's when they charge you. Whereas Redshift charges you monthly. Some people care more about predictability than others. I think people have done the math and on average they end up being about the same a month so that might be a wash.

There's also the notion of how much you want to be able to tweak your data warehouse. This is why Snowflake is such a big deal as they manage a lot of the stuff for you. If that's of concern, if you don't have that many resources, you might want to pay attention to which data warehouse provides more optimization stuff automatically for you.

On the ELT tools, Fivetran has really been a rocket ship for a while. They seem a clear winner. You have Airbyte that’s gotten an insane amount of funding. They're an open-source approach. That's the thing on each part of the stack, other than DBT’s part. There have been so many new entrances in so many different kinds of visions for how it could play out.

It's tough to give advice: “Here's what you should get it.” It keeps changing. Generally, the performance is similar. The biggest differences are the pricing model. With the amount of automatic management by the tool, if you're a small team probably skew to more managed stuff. If you have a big team and you need really custom stuff, skew the other way.

It'd be great if there were just one of those that were way more performant, but they're all pretty close these days. Maybe a level of integration again. What other tools are you already using? Does that matter to you? Do you want to switch to it to other tools?

Loris Marini: Is Airbyte the company using Singer, the opensource protocol to create the “taps” to interface the set of any source with any sync? I should have checked it before asking you.

Matt David: I'm pretty sure they're not, but I don't know a ton about them other than they've raised some ungodly amount of money and are open source first.

Loris Marini: One thing we didn't mention is the separation of storage and compute. The architectural level, may or may not impact costs, but it's certainly something that has changed a lot compared to technologies like Postgres or conventional, transactional databases.

Along with the columnar approach as opposed to just the row-centric approach, these new technologies are architectured in a way that you can pay for storage and compute in two different ways. You're not touching the data, you just pay for storage. Storage has gotten extremely cheap. Compute obviously is the stuff that matters.

We talked about the Physics of Information: Part 1 on this podcast. If you haven't listened to that. I’m talking to you, the listener, if you want to brush up in physics now. I'm kidding. It is a lot about physics. It's interesting to see why we fundamentally need to spend money to process information. Is that because we're not good enough at building cool machines? Could we imagine a laptop that consumes at zero? It turns out there's a lot of physics behind it. The answer is there are quite fundamental laws of physics that prevent us from having a free lunch or keeping our food in the fridge cold without paying the bill and all sorts of things that relate to information. Expect different builds, that's the conclusion of this short rant on the physics between storage and consumption

Matt David: I really don't know a lot about physics, but it is interesting to me that you can calculate theoretical latency by measuring the distance to data centers. Seeing how fast light could travel there and back. What's weird is that yeah, all this technology stuff does boil down to just hardcore physics at the time.

Loris Marini: Yeah. I don't know if this is the audience, but maybe I’ll share this short story. I remember studying fiber optics, cables, and different types of cables. One of the first things you learn in physics, Light 101, is that light travels at different speeds, depending on the medium. In a vacuum and the deep universe, the speed of light is constant and we believe it's the highest speed ever. There's nothing that could go faster than that. You just get that speed, no matter what it’s guaranteed. As light travels through glass, plastic, or other materials, water, it slows down because it has to interact with the material itself.

You can think about it almost like resistance to light propagation. Fiber optics are made of glass. The internet relies on glass. Essentially 95% of the traffic goes through fiber optics. At some point, people have started thinking, “Can we make a fiber that guides light without glass?” It turns out it’s possible: they’re called the hollow fibers and there's a whole bunch of cool physics around it.

How did they design the symmetry of it? Most of the light is inside air essentially, or gas. Even better because you can control with gas the chemical properties in engineering. It all started because people doing financial transactions where a millisecond for them costs millions of dollars. They really want to get there faster. At some point, someone decided to deploy an entire network of fiber optic cables that were designed to be extremely fast, as fast as you can possibly get it. Bit of random physics, but I love this stuff.

Let's talk about the organizational side of all this technology. We’ve deployed system, we signed partnerships we’ve got a cool modern data stack, which means one person can do the job of five, and now they can write SQL and use it to clean, ingest and reach and serve new data models. Let's say that for a second, the technology part of the problem is solved and it's a big assumption. What stands in the way of using these assets to do valuable stuff for the business for the organization?

Matt David: Yeah. There are a few things that are going on. They are super positive about this movement. It makes it more accessible for analyst types to contribute to the data model. You can again build the layers and iterate your way to a better data model. Some things start to get highlighted as new challenges like testing.

DBT has testing, but you have to write them all yourself. Initially, this may be manageable at scale. This falls apart fairly quickly. There are interesting new tools out now, though. I'm not sure if you're familiar with Data Fold but they embed directly in your pull request and it shows you how not only your code is going to impact the table that you're directly updating, but also how it's going to impact your whole DAG, all the way to your dashboards. I'm excited that the tool is really making it easier for analytics engineers to not accidentally break things, which is great. The other thing that's also being worked on right now a lot is data observability.

Loris Marini: What ****is it about? Give me a little pitch.

Matt David: For various reasons, data can become incomplete or incorrect and you need to figure out where. Not only that it is happening, but where it's happening. The biggest players in the space are Monte Carlo, who I believe coined the category BigEye. They monitor all your data. Something looks wrong, “Hey, this looks wrong.” You can then go fix it.

What I think is interesting about the previous company, Data Fold is that it prevents you from making mistakes. There are other reasons why there might be mistakes where you need something that's watching the pipeline. One potential reason for this might be an engineer upstream of you updates some event, column name, or something. It wasn't anything you did, but that cascades and breaks a bunch of stuff. You need ways to find out that that happened. Maybe there was a bug or an outage or something.

You need to be able to figure out, “Hey, something's broken.” The combo of those two tools is something that's visibility if anything happens, and then also gives you foresight. Every pull request, you know what you're doing. If other people are doing stuff, I'm going to get alerted. Those two things are solving the two biggest issues that analytics engineering was having, which was, “Hey, I'm an analytics engineer. I'm new. This is a new field. I start making stuff. I started breaking stuff.”

The other exciting thing to me is now that we have these tools like DBT to improve data quality. More stuff is getting built on top of the data. We're hearing more talk about data products instead of just dashboards. We're hearing about more ML, AI models built directly on top of your data warehouse. We're hearing about what's commonly called reverse ETL, which is where tools like Census are taking data from your data warehouse and pumping it all the way back into the original sources that we were talking about.

Loris Marini: So we broke the acyclical part of DAGs, right?

Matt David: Well, as more things rely on not only the data but on the DAG, the demand for data quality is going way, way up. That's why we're seeing data observability. Now we're also seeing proactive data quality companies such as Data Fold come into the mix.

Loris Marini: Yeah. It must be a nightmare for anyone accountable for the quality of the data. Now, we’ve democratized access to it, we lowered the technical barriers, but we’re struggling to think through what we're doing. The impact of what we're doing is going to be on everyone that consumes it.

It could be people, an application, bots, or models, and some of those models might be fairly complicated and not easy to understand. There are a lot of assumptions when we build a model. There's a lot that we bake in as data scientists into the model or the assumptions or domain knowledge and someone in the analytics engineering role might not be aware of how even a small change in the data, a where clause on an SQL statement could exclude some of the data points that should go into the algorithm because they are essential to compensate for things like bias.

It's very easy to not see that because as an analyst you just think, “Oh, I'm going to just make the data set smaller.“ Yes but if you are inadvertently selecting or removing selectively things from the data, you're essentially changing the distribution of the data, which is going to impact the performance of the algorithm.

I don't know how you see this, but I almost believe there's no way out here. There's only so much we can obstruct away and boxes we can put around people. We need to be more aware of the many ways we can screw things up and affect people's lives in the end. Because in principle that data set, that model could be used to predict whether someone should have access to a rifle or not in countries where weapons are regulated. There was a story I shared at the beginning of one of the recent episodes with Jan Ulryk from Manta. They do data lineage. They're talking about observability and lineage.

The folks in Australia are familiar with the story where essentially someone that had a history of felony violence got access to a weapon. There was a tragedy. The firearm place should not have sold a weapon to that person, but because nobody knew how the data flowed in the system and the flag was green, that person was able to buy a weapon.

It could be a weapon. It could be a plane that is taking off and gets a green signal instead of a red one. Now maybe you could say that's production staff. It's typically in the realm of transactional. The throat to choke is an engineer, and there's going to be a dev responsible for that. Yes, that's true but the principle still holds. We need to be careful.

Matt David: Yeah. A less extreme example, but it's with any online marketplace where they're constantly using machine learning to surface relevant products or remarketing or that sort of thing. A failure in the machine learning model could literally mean you're missing out on a ton of revenue.

There's going to be less and less tolerance for data quality incidents as more things get built on this stack. Full disclosure, I do advise Data Fold some, but if you haven't checked out the data diff feature that they have, it does do exactly what you're talking about, which is see all the way through the DAG.

Before you ever commit the code, they see exactly what you're impacting all the way through, and then you can just tag people in the PR and say, “Hey is this okay that it's affecting your ML model?” It’s a conversation that currently isn't even possible. The fact it's helping do that I think is actually huge. Innovation for the data quality demands that are going to just keep going up as more stuff gets built on the stack.

Loris Marini: I'm a bit worried about the pressure that we're putting onto some analytics engineers because data powers more and more strategic decisions, it's responsible for a chunk of the revenue of the company. Those folks become the point of failure. Using a negative term, a vulnerability for the organization.

They are not necessarily trained to evaluate the impact that data can have on the org. They might not have the domain knowledge to appreciate it. They might be simply cut out from a level of visibility and perhaps this is more true for vertical organizations than it is for startups and scale-ups. At some point, we know history is there to teach us that the horizontal model doesn't scale beyond a certain size. At some point, you need to report to someone just because. Our brains are limited. We can’t really keep track of everything at the same time. It's to be expected that people have limited visibility and when they do either they have to escalate up, so it becomes a problem. How do you tackle communication within the organization?

If your systems don't know what your people know, then we have disagreements and things break, even when the tech is right.

Matt David: Yeah. Generally, people outside of the data world don't realize how messy data can be. When there's any sort of issue, unfortunately, there’s this type of interaction where the person who's looking at the dashboard that was maybe off or broken or something they lose an inordinate amount of trust in the whole data enterprise.

It’s super frustrating because a lot of the issues are just systemic to working with data. Now, that being said, now that more is being built on it, quality demands are going up and that's why you're seeing data observability becoming a thing. You'll see proactive data quality interventions become major themes of the stack over the next two years because this is a huge problem. These people need tooling similar to software engineers who have the concept of observability as well. When this whole practice of analytics engineering is still sophomore or it still needs to fully mature, some of these tools just need to get built. Luckily they are getting built or they already have been built in some cases. People just don't even know to look for them yet because it's still such a new practice.

I have a lot of faith having talked with a lot of data tooling companies that the tools are either already exist or are coming. It's more of this. We need to know that we need to look for them. This is the big issue.

Loris Marini: Yeah, Do you see a lot of data literacy taken seriously in organizations? Do leaders realize the importance of it?

Matt David: Yeah, I don't want to be cynical about it. I've taught data a lot, both at Udacity and in-person classes. Here in San Francisco, I taught a data analytics course through a general assembly. The one time where everybody in the class worked at a tech company Google, Crunchbase, all companies you've heard of, everyone's struggled with learning the basics of data.

A lot of the basics are not as intuitive as we'd like them to be. Pretty much every company I've ever been at, there's been a moment where somebody from the data team says, “Hey, I'm going to put on a two-week thing about SQL and how to look at our data.” Everyone feels this pain and maybe people over assume that it's going to be easy to just get everybody tuned up, but you're teaching people things that are not intuitive and in some cases are quite counterintuitive to them. Especially in the face of the various cognitive biases that make interpreting data difficult.

Everybody wants to do it. Everybody realizes it needs to be done. I'm not convinced everybody realizes how hard it really is. Set the expectations perhaps a little lower in that we are people are, we have cognitive biases built in every single person that makes interpreting data difficult. You're not going to eliminate the cognitive bias. You can bring awareness, people can get more adept at incorporating it, but you're not getting rid of it because that's pretty hardcoded in your brain. You need to think about ways to accommodate it or work around it.

Different fields have come up with different approaches. In academics, you have peer review, someone else checking your work, that's a common way even though it's broken. Yeah.

Loris Marini: That's broken. It doesn’t really work well.

Matt David: I don't want to get too philosophical, but I think that the core problem is around epistemology, which basically means how you learn something, or how you create knowledge. What is your method for creating knowledge? You have the scientific method as a way to create knowledge.

Inside of companies you needed to develop this skill. You have all of these cognitive biases that are fighting you at every turn. You have the peer review methods, somebody's checking you. In the business world or in government this often looks like an audit giving a very separate third party to evaluate it. Some people try to avoid typical mistakes in business with things like insurance, which basically prevent you from taking certain types of actions as a way to avoid them.

The way that you ultimately win this is is with culture and establishing norms about not being absolutely certain thinking in probabilities, being skeptical. The scientific method has held out because it's one of the best ways we've come up with to build knowledge. More companies should reflect on what is our epistemology and how our biases potentially forwarding that.

Loris Marini: There's a story I shared on the podcast by Alexander Schachter, the effective statistician. Speaking of scientific thinking and being open to uncertainty, I once reported to a CTO where I was tasked to build an algorithm. I made some predictions. I went back and reported on it, presented my results. I said, “Well, that's what the algorithm predicted.” There's an error bar around it. It's a plus or minus 15%. He looked at me and said, “Is there an uncertainty.” How do I go about explaining uncertainty here?

It's not that I'm not sure. There is an intrinsic thing in any estimation called uncertainty and you can never be 100% of what you're saying. Yeah. I just need an answer. Is it yes or no? Okay. 70%, yes. With this confidence interval

Matt David: Yeah. It’s not that everybody needs to become a statistician, but just getting comfortable with the fact that as you just said, there are no certainties. You are making judgment calls and you can develop explicit frameworks or not. Ultimately all of this decision-making stuff you're making judgments. Even all the best ML, AI models, they're making judgments because you can't process all of the data. Granted, we have access to a ton more data now than we used to.

I don't know if you're familiar with Bill Inmon, he created the term “data warehouse” or he's the OG data dude. I talked to him the other day and he's totally focused on unstructured, textual data his points were that we're only analyzing 10% of the data inside of organizations right now. The total amount of data you could be factoring in to make a decision is infinite. It’s 10 times or so.

You’re always making decisions with constrained amounts of data. At some point, you have to accept that and make judgments and then take responsibility and learn from those decisions and build your own guts up better.

Loris Marini: Yeah. The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights is available on Amazon. I got my copy before the break and I really, really enjoyed it. I recommend you to check it out. We are approaching the end of the allocated slot, and I don't want to take too much of your afternoon.

Perhaps the last question is if you had a completely blank canvas in front of you and you could design the top three actions you would take to really crank up data culture in an organization, where would you invest?

Matt David: Yeah. BI right now if you look and pull up everyone's website, almost everyone has some version of modern data teams, collaborating team data, modern collaborative everybody knows and everybody's saying, “Where's the Figma of data?” Figma has been such an incredible both just growth story, but also a tool that changed how people interacted with design inside of orgs.

They brought more people by facilitating collaboration. How do we do this in data? Honestly, I don't think we've seen a great answer yet. We've seen some versions of multiplayer mode where multiple people can type in the same SQL ID or notebook-style interface, and we've seen comments being added.

I don't think enough has been really thought about how to bridge the gap between data people and business people. For instance, in Figma as a non-designer, I can still draw a little square and vaguely represent what I'm hoping to create. In data, there isn't an equivalent of this. Why don't any of the tools support me making a crappy-looking graph of essentially what I'd like to see?

There needs to be more thought about how to help business people speak a little bit of data language. I'd love to see somebody use one of the tools and try to cross that gap. It would really unlock a lot of data literacy stuff, a lot of bringing more people into the data process. I'm excited about that.

The other big trend that is just getting started in my mind is Stripe Sigma. They're providing a SQL interface to your data within Stripe. What they're essentially doing is creating their own miniature modern data stack that then they're exposing the analytics part to you. This is a huge evolution in what modern data stacks could do. To me, it at least signals that you're going to start getting even higher quality and much, much more data from these from source hell. It could really increase data quality, more stuff will get built on. Quality demands will go up.

I'm excited to see where that goes. Lastly, we've already talked about this a decent amount, but just analytics engineering leveling up, getting more of the support tooling that software engineers have. Stuff around observability, around proactive data, quality stuff. That's a really exciting space to me. Those are the three things I'm looking forward to manifesting.

Loris Marini: Yeah, I'm going to add to that to a bit of a curiosity. We’ve all been kids, obviously, there's no escape from that. One of the things you do when you're a kid is that you explore. You ask questions. Why, why, why, why? You take nothing for granted, even when someone gives you an explanation and your answer often is “yes, but why?”

Asking Yes, but why with kindness, not with I don't trust what you're saying, but tell me more because I really want to know more. It's something powerful and it's underestimated. There are a lot of people putting on masks not to show that vulnerability. It'd be really nice to work in a culture where it is welcome not to know. Nobody will ever think less of you. If you raise your question, “Folks, I have no idea what you're talking about. Can you give me a high-level explanation?” because yeah, you're open to learning.

Matt David: Let me extend that for a second. This is a phrase Dave, the founder of Chardonnays says all the time, “What if only 10% of your company could use Google?” That's the situation with data and BI inside of companies where there are a few barriers. Currently, there's SQL or BI interface but there’s still a huge challenge to get over beyond that. There is just this data quality issue. Can I understand it? Is it correct? How complex is it? Those two forces are really preventing people from being able to be curious.

You can also add maybe there's a lack of social collaboration potential as well as it's still floating around there. All those things are preventing the curious from getting their answers. Those are all the things that I hope get solved over the next few years so that more people can ask why.

Loris Marini: There you go. I would leave it at that. That's definitely hope of mine too. It’s certainly embodied in the new logo of Discovering Data. The pink smile on the side of the brain is there. I don't know if it was intuitive enough. Believe it or not, it’s there to signify the emotional brain and all of these biases that we have. These cognitive-emotional biases and just a way to stimulate any curiosity.

You don't have to know all the answers, especially if you're a senior, a director, a VP. You're expected to know already everything. Well, it's okay not to know, because guess what? Nobody really knows. Nobody really knows the whole story. We have areas of expertise. The higher up you are the better you can benefit from a wider understanding.

Yes, to the generalists. We need specialists too, but let's not forget about the generalist. With that, I want to thank you for sharing your insights, your wisdom with me.

A reminder of The Informed Company. It's on Amazon. Get your book and join the conversation on LinkedIn. I believe you guys also have a Slack channel.

Matt David: Unfortunately not anymore when Chartio got acquired by Atlassian. It’s going away, unfortunately.

Loris Marini: Right. What's the best way to follow you and get in touch with you?

Matt David: LinkedIn or Twitter.

Loris Marini: Right. You and me both.

Matt David: That'd be great. Yeah.

Loris Marini: Awesome. Thanks, Matt. Thank you so much. Enjoy the rest of the evening there and that's it for now.

Contact Us

Thanks for your message. You'll hear from us soon!
Oops! Something went wrong while submitting the form.