Data lineage is one of the most complicated and expensive data management initiatives. What does it mean for the business? How can we set it up for success? Hear it from Irina Steenbeek.
Join hundreds of practitioners and leaders like you with episode insights straight in your inbox.
Checkout our brands or sponsors page to see if you are a match. We publish conversations with industry leaders to help data practitioners maximise the impact of their work.
We need data lineage because we MUST know what happens to the data as it moves through systems. However, it turns out that "data lineage" can mean many different things depending on who you talk to.
What does it mean for the business? How do you scope it well? How do you maximise the impact that lineage can have on the business?
Today I am joined by Irina Steenbeek an absolute expert on this topic 🚀. Irina has a lot of experience managing ERP implementations, she is the founder of Data Crossroads and her background spans civil engineering, management, consultancy, and finance. She authored more than 60 blog posts on the topic and published 4 books: The Orange Model of Data Management, The Data Management Toolkit, The Data Management Cookbook, and her latest book, Data Lineage from a Business Perspective.
This conversation should give you a solid understanding of what data lineage is and what it isn't, the value, costs, and benefits for your organization and some ideas to talk about it with your business stakeholders.
Join me as I learn from Irina Steenbeek 😃
Your ideas help us create useful and relevant content. Send a private message or rate the show on Apple Podcast or Spotify!
[00:00:00] Loris Marini: So data lineage is a complex concept and has no aligned definition in the data management community. Various stakeholders have totally different views and expectations when it comes to it. Many companies are currently faced with the necessity of implementing data lineage, but doing that takes time and a lot of money as well as knowledge on what can go wrong and the techniques and strategies needed to overcome these challenges.
I ran a poll recently on LinkedIn asking, "what would you like to know about data lineage?" and someone replied, "Out of the many aspects in data management, data quality, data governance, and data architecture, for some reason, data lineage is the hardest to communicate to the business." And it made me think about why this is and what makes data lineage so tricky to get right.
So today I am here with Irina Steenbeck, a data professional with a complex background in many areas, including civil engineering, management, consultancy, and finance. Irina has a lot of experience managing implementations of ERP (enterprise resource planning) solutions. She is the founder of Data Crossroads. Data management for Irina is her profession and her hobby. She has authored more than 60 blog posts on these topics. She published four books: The Orange Model of Data Management, The Data Management Toolkit: A step-by-step implementation guide for the pioneers of data management, The Data Management Cookbook: a pocket guide for implementation of data management, and her latest book, Data Lineage from a Business Perspective, which is the book we will talk about today.
The hope here is to leave you with a solid understanding of what data lineage is, what it isn't, the value of costs and benefits for the organization, some tips for its execution, and perhaps a framework to communicate why this is important to your business stakeholders. So let's get to it with Irina Steenbeck.
[00:02:24] Irina Steenbeek: Hi everybody. And first of all, I would like to thank you, Loris, for this opportunity to talk to you and to all your audience, because data lineage for me is one of the most challenging topics in data management where I would really like to share with you my knowledge, experience, and challenges.
[00:02:45] Loris Marini: It's my pleasure to have you here, Irina. I wanted to start with a bit of your story because we discussed it before the show and we exchanged some notes and you have quite an interesting story. What led you to develop an interest in the topic of data lineage?
[00:03:06] Irina Steenbeek: It happened unexpectedly, to be honest.
A lot of people from financial institutions will understand the immense pressure from the various regulations: having to comply with these regulations and needing to implement data lineage. One of my colleagues contacted me to ask, "What kind of requirements do you have for data lineage?" And then he stated that everybody needs data lineage. Nobody could explain what they mean by that. And that's when I decided, "Okay, I need to understand what data lineage means."
And then I took the initiative to start developing requirements, and an understanding of data lineage. So it was like, you know, curiosity. I had been already involved in data lineage implementation for five years. This is also one of the reasons why I wanted to share my experience in this book, Data Lineage from a Business Perspective.
We can talk about it. And of course, we won't be able to cover everything in 45 minutes, but let's try it.
[00:04:33] Loris Marini: Yeah, of course.
So for someone that just heard the term at a conference, in a corridor, perhaps you have a bad dream, and you wake up one day and you're like, "oh my God, we need data lineage", what is data lineage and why do we need to worry about it?
[00:04:53] Irina Steenbeek: When you try to explain it in simpler words, it's simply the description of the paths that data flows from; its origin to its destination. Data lineage will give you answers to several questions: What is the origin of your data? What kind of transformation data had undergone along the data chains and where you can find your data and where data goes to.
The biggest challenge with data is documenting data lineage along the abstraction levels. Even if the company could demonstrate how data sets flow between various applications and attach these applications to business processes. If somebody will describe data or processing at the physical level where our data comes from, which tables, which fields, and which ETS has been built on this way. So this is one of the most challenging and trickiest points about data lineage: to define correctly what you're talking about.
Another challenge with data lineage is there are various concepts that are very similar or synonymous with data lineage. Think about you and somebody talking about data chain, and someone else speaks about data lineage and a third person speaks about data flow. It may happen that they are talking about the same thing, because for example, in DAMA publications, I found that there are five concepts that have a lot in common with data lineage.
And finally, data lineage is simply this documentation of the paths, how our data flows through different applications. We usually speak only about digital data. But you can also describe non-digital data, data floor for non-digital data.
[00:07:12] Loris Marini: I was thinking about it and perhaps this is a silly example but we all do laundry at home, right?
We manage our clothes and they go through a pipeline. There are steps that we take to ensure that every morning when we wake up, we have fresh clothes. Knowing where they are, keeping track of where they are, and what the next step is, is something we do without even thinking; we have a model in our heads of what's happening. "Do I have to take care of the laundry? Do I have to send it to the washing machine?" And that's true for anything.
We have these abstractions in our heads of these models of how the world works. We don't call them lineage, but that's what they are. They allow us to navigate the world and anticipate what's going to happen next and be able to troubleshoot. And when something goes wrong, you know what are you looking at, what came before, and the causal chain of events that led to something.
[00:08:41] Irina Steenbeek: Sure sure. There is even a simpler explanation because we all speak about life cycles. We have a product life cycle, a business life cycle, a company life cycle, but we also have a data life cycle. And there is no standard description of the data life cycle, but it usually includes several standard steps.
First, you define the requirements for data. You describe data, transform it, use it and archive it. And this is what really has a lot of analogy with various areas of our lives.
But then to realize this, that data life cycle, a company builds data chains, a set of applications, and databases. And one application, the company describes data with help of data models. in another set of applications, they transform data using ETL tools, a database, ERP systems. And when you have this chain in place, you use data lineage to describe it because the data lineage by itself, it's the documentation of various data chains. So for me, it looks very simple. One company would have one data life cycle, which is not related to their business model. Then they implement various data chains.
For example, one data chain to proceed with customer data. Another data chain to proceed with financial reporting, and then we have data lineage simply describe it. That's all. It's three steps. You have a data life cycle, it's a general concept. You have the physical implementation of the data life cycle, and it's simply a set of applications and data lineage simply describes it.
[00:10:50] Loris Marini: Is data lineage a line?
[00:10:54] Irina Steenbeek: Well, yeah, you can consider it as a line. You can say it's a chain, but of course, you may have various loops there if you mean that. But it's really a line, the order step by step by step by step. You put data in one application, then you proceed data, then data from one application goes to another application, you demonstrate how this transformation took place, and then you go there another database into another database, and then it gets changed. It's a line, it's a chain.
[00:11:28] Loris Marini: Yeah. And you can have multiple lines.
[00:11:31] Irina Steenbeek: You can have multiple lines. And you can describe these lines at various levels of abstraction. I started with a very simple definition of data lineage as a description of the paths the data goes through in various applications, but then the challenge starts at which level of abstraction you're going to describe it.
Let's make it simple, if your customer data comes from a web application then goes into a data lake, then from that data lake, it goes to the data warehouse, then it goes to the data mart. This is data lineage at the highest level of abstraction. Some companies call it data flow because you demonstrate at a very high level, but you can also go down and describe at the physical level.
You take a table and show how these data elements or data attributes go from one table to another, and other data attributes of multiple tables — that's when you speak about data lineage at the physical level.
You can also do it at a higher level of data models, for example, a logical one. That happens very rarely because only a few companies have logical data models in place. You can do it also on the conceptual level, but traditionally what happens is the company describes it at the high-level application level, or what they call data flow or at the physical level if they have it, because this is made the most valuable for a lot of business users.
[00:13:13] Loris Marini: The physical one or the logical one?
[00:13:15] Irina Steenbeek: The physical one. You provided the example of your poll that a lot of people speak about data management, about data modeling, but nobody really cares about data lineage. It's really funny because it's a contradiction. For every other data management capability like data quality, data modeling, you need data lineage. You need to know what happened to the data.
Data lineage describes what happened, and how detailed you're going to describe it. Of course, it's very dependent on the requirements or needs of the company. And then the challenge is, we can move to the question, why a company needs data lineage.
[00:14:10] Loris Marini: Yeah.
[00:14:11] Irina Steenbeek: And in my book, I described at least four different categories of these needs.
First is of course regulatory need because there are some regulations that demand data lineage. But here is something interesting: no regulation says that the company should have data lineage. They say companies should be transparent regarding data lineage. So companies then choose to find out how they make the data flow transparent and data lineage is the tool to do it.
The second very important group of reasons is business changes and because assume, for example, the company would like to change part of the application landscape. There are two parts here. Either the company needs, for example, new reports or a new insight into data. They need to know what kind of data they need to get from the source. You can do it all with the help of data lineage. It's some sort of analysis from a usage point back to data origination, but it also might happen that if you're going to change some source systems, you need to know what impact it will make on your report.
And for that again, need to have the data lineage in place. There are other types of business changes like digital transformation. I think only the lazy people don't speak about digital transformation, but in reality, what digital transformation is, is simply changes in the processes. But in digital processes, data is one of the key sources because data is input and data is output. You need to know what happened to the data and you again, come across data lineage.
I think that financial and risk professionals will understand when I say that they very often need to explain the origin of some data radius in their report. This is again the challenge with data lineage. They can't do it without knowing what happened to data and all kinds of transformation. Data has gone along the chain. And this is again, data lineage.
And then of course the audit requirements. And as I said, one of the important points is data management itself. Usually, a lot of companies start data management with data quality and assume they would like to make some sort of preventive data quality checks to build along the chain, but they can't do it without data lineage because they need to know the impact of all of the attributes. As simple as it is to implement data quality, you need to have a lot of other data management capabilities present before you do.
[00:17:21] Loris Marini: Yeah.
The other day I hopped on a train and I was thinking about what a railway system is if not a big, massive line that crisscrosses the city and you've got trains going through it. You want to hope that when you catch a train at Station A and you have a journey of 30 stops, there is someone that has the big picture of where the train is going so that if something goes wrong five stops down the track, say a tragedy or an issue with the electrical system, whatever it is, you want to hope that people know and communicate to the rest of the trains on the line that there is an issue so that trains don't collide.
It almost feels like if you can allow me this analogy, often in companies when you look at the data layer, it feels as if different aspects of the data life cycle were managed in complete isolation under the different people with different accountabilities and different responsibilities. No one is actually making sure to run a communication line, to open the channel, to allow people to know what's happening.
I remember this migration we did with a company here, I was leading the engineering team to move from a legacy system that the company used for three years to do their analytics, into a more modern, cool, scalable cloud-based deep metadata native solution. The idea was to get that elasticity and that gain, but we had to reverse engineer the black box. We had to go from a proprietary tool to a tool that we could control. But doing it in practice was terrible because we didn't know which metrics were used in which report.
And therefore we didn't know how to prioritize the migration, which ones do we focus on? We can't do them all given the timeframe we have. And even a simple strategic question like that is really hard to answer if you don't know the full picture of your data, how it moves through through the systems.
And there's an aspect in this conversation that makes me think: is what we build in an organization seen as a system? So if we take it for a second, the system lens, and look at a company as a bunch of parts that interoperate, communicate with each other and they try to achieve something, a purpose, a business outcome; some parts are machines, some parts are made of human beings, they obviously need to talk to each other. But if you take this and zoom out and go to the moon and take a big, massive telescope and look at the whole organization, it wouldn't look much different from an electronic circuit board with a bunch of wires. Of course, we don't have wires. We are wireless, but we do interact with each other. So there is almost an invisible wire, which is the communication that we establish. I know what your KPIs are. You know what I care about. We have meetings, we have meaningful conversations and a shared language.
And all these wires that crisscross someone should have a map of how the system is wired if you want to optimize it. Otherwise, how do you go about fixing a problem? You open the box and there's just a big mess. You just hope for the best and move to the cloud. Surely this will fix all problems.
There are so many examples of lineage and even if we don't call it that way, we should have an intuition of why this is important. What led you to write the book? What was your intention? Which problems and pain points were you trying to address?
[00:21:42] Irina Steenbeek: You know, for some time data lineage was a topic only discussed among technical professionals. Then I met the situation wherein business users started talking about data lineage. And here, I saw a very big gap between the expectations of business users who have reasonably low-level knowledge about technical stuff and what real data lineage is.
So I wanted to explain in plain business language what data lineage means. Especially also, because when you start implementing data lineage, a company often doesn't take into account the real needs of business people. And after a couple of years of regimentation, you may face the situation that everything is done, but business users have no use of it because it's too complex.
The second reason is when you start your data lineage journey, you can't oversee the complexity of the concept itself. And the complexity of the concept leads to the underestimation of resources the company should put in data lineage implementation.
So for me, before jumping into the data lineage adventure, a company should really clearly understand why they need it. They need to really pay attention to the scope and it should match the company resources. Because data lineage takes not months, but years. And it also may happen that various types of data lineage would be needed for various drivers, for example, for GDPR or PRI. So the compliance with personal data regulations, you may need to have data lineage at the physical level, but then only for a limited set of data. While, for example, for financial data, you may need another type of data lineage.
So my first reason for the book was to explain a little bit more about data lineage for a business audience or project manager and audience, what they should expect. And the second one is really to share my experience in the scope and data lineage initiative because a lot of companies fail in implementing data lineage because they didn't foresee in advance the complexity and resources required to them.
These are two key reasons.
[00:24:30] Loris Marini: Yeah, let's dive into that complexity because I read your book and you make a really clear delineation or differentiation between physical data lineage and logical lineage, business data lineage. There's a fourth one, which I don't remember. Can we go through those?
[00:24:47] Irina Steenbeek: Some sort of a standard approach to data, mobile and conceptual logical and physical.
[00:24:56] Loris Marini: Yeah. Conceptual. So let's define one by one. Before reading the book, I had an intuition that there was a difference between the logical and the physical, but I didn't have in my mind, in my mental model, the definition of conceptual data lineage and a business data lineage.
I thought, well, if you have the physical one, you're done because it's the one with the highest degree of detail, you know exactly which data type, which column goes where in which database. So why do you need to worry about these other levels of abstraction? And perhaps this is the demonstration.
I am the case in point that a technical-minded person and a business-minded person fundamentally sees the problem in different ways and they should be able to see the same question through a different lens, but perhaps I'm jumping again. I'll leave you to go through that definition.
[00:25:56] Irina Steenbeek: When I speak about business data lineage, I usually mean the description at the high level of abstraction, how data sets go through different applications. And it's also required not only from some sort of general sense of understanding but sometimes by various regulations.
One regulation is the BCBS standard number 239, which is very famous for financial institutions, which is the documentation of business processes linked to applications. This is where I refer to it as the business level, business processes related to applications.
[00:26:40] Loris Marini: So as an example, say that we're looking at my personal information, where I live, age, email address, let's say that this is the unit.
[00:26:52] Irina Steenbeek: This is a unit that's personal information. You fill in your personal information somewhere. So from a data lineage perspective: personal data set comes into a web application, from the application it goes, for example, to some data lake, which may happen.
And in the data lake, you have a data set called personal data. And then from the data lake, it will go somewhere to a data warehouse. These are the steps in the applications described at the level of data sets. This is what happens.
[00:27:32] Loris Marini: So at this level, we don't find details.
[00:27:35] Irina Steenbeek: No, we don't find details. Assume you need to understand what personal data is. Let's start with the natural person. Is it a customer or some sort of relationship or description of what a natural person is, the definition of it?
And then immediately you jump to the conceptual level. You have an individual, or person, or whatever, and you need to give a definition. And here you're at the conceptual level of data models. But then again, you, as a person may have address information or financial information and these are data elements or data entities at the lower level. Or it may be already at a logical level.
It's like a data model. One person would have personal address information, but address information can be split into several attributes: city, country, street. It becomes tricky because the same logical data model, like your address information, will be implemented in various applications in various databases. And that's one domain relationship. Assume you're in one database. It may happen that it will be two tables that will include your address information, or you need several tables to describe it. The same information will go to a data warehouse which is another physical structure, so you have another physical model.
And this is again, you describe it at the physical level. Data lineage has a lot in common with data modeling. And the challenge, of course, is that various companies can use totally different processes for data modeling, and in my book, I only used the classical one, but I also made comparisons with some other approaches to data models which can be used.
[00:29:49] Loris Marini: Let me try to work up and see if I understood it correctly. So at the business level, we have the highest level of abstraction. We don't worry about where the data is stored, what we're going to do with it.
[00:30:00] Irina Steenbeek: No, we do care about the level of applications.
[00:30:06] Loris Marini: Right. Where does a piece of information go?
[00:30:17] Irina Steenbeek: At the conceptual level, you describe business terms, for example, the definition of person and this will have various types of information, like address information, financial information. And then you start going deeper into lower abstraction levels.
[00:30:39] Loris Marini: Yeah. So once we agree on what the customer is.
[00:30:42] Irina Steenbeek: We define what kind of customers we have, because usually, you may have a retail customer, corporate customers, and they will have totally different attributes to describe this customer.
[00:30:56] Loris Marini: And in that example, the logical level underneath the conceptual one, the address example. The address has a collection of information that gives us that information of where that person is, but that address is made of unit number, a street number, a suburb, and a whole bunch of other details.
[00:31:18] Irina Steenbeek: Data lineage is movement between the various data models of different levels and in between, you have business rules that allow you to transform data from one model to another.
[00:31:35] Loris Marini: In the book, you go into detail about how one should approach these different levels and the strategies to use. If I understood correctly, there are things in lineage that can be automated.
[00:31:52] Irina Steenbeek: We need to know exactly what kind of data lineage is required. Because it will influence the way how you will document it. Because for example, at the business conceptual and logical level, you can do it manually. It's descriptive data lineage when you describe it.
But at the physical level, you should really implement data lineage using an automated method. And there are a lot of providers who do it. What that means is that their software would scan metadata in applications and treat data back into some metadata repositories and visually demonstrate the relationships, between these various objects.
It's very important to understand from the beginning what kind of data lineage a company needs because it will influence all data linkage business cases. What kind of software do we need? What kind of implementation methods are we going to use? How are you going to use it?
Because from the theoretical viewpoint, it's very nice to start from the business level then go down to the physical level. What happened in reality? A lot of companies start from physical data lineage because this is where people know it's easier. And sometimes it's only one way to find out what really happened, especially in the case of legacy software. There's only one possibility to find out what really happened with the data within the company. You start physical data lineage. Then you can of course go up to describe various models.
[00:33:42] Loris Marini: Say that I am a CTO and my CFO is really pressing me because we are running out of budget. And starts talking about these many different levels, the physical, the logical conceptual, business. And she comes to me and says, "look, I'm sorry. We need to keep it to the physical. We have a limited budget just to deploy this service. We went through a whole bunch of reviews. We identified the vendor, we're ready to go. Isn't that enough? Why do you want this many layers?"
What would you say to a person like that? What are the consequences of not thinking about this?
[00:34:15] Irina Steenbeek: I would first ask why do you need it? What will business users get from it? The challenge here comes, assuming you spent one year implementing physical data lineage. And you're going to demonstrate your development to business users. Even if it's a nice visualization tool, but then you've got hundreds and thousands of objects, and nobody understands what is there.
What are the business users' needs? Because it's one of the biggest challenges I experienced. We describe what happened to the data and that's metadata lineage, but business users need to know what happened to their data values. If they see 1 million in their report, they need to understand how that value has been built from various contracts. This is one of the challenges, and sometimes simply building the reconciliation reports and different data points may solve the issue.
Of course, data lineage, in that case, will enable them to build the reconciliation reports, because if the application landscape is complicated enough, you need to know what to reconcile this with. Then you need to have physical data lineage.
I also would like to share with you if physical data lineage has been implemented, immediately, two key questions arise. The first one, who can guarantee the quality of metadata? You need to prove that their data lineage is correct. Because you've got hundreds and thousands of objects and millions of relationships.
Challenge number two is, very often data lineage describes the movement of data between tables and columns but the real adventure is to describe transformations, to go deep into ETL jobs. People say, "okay, for one attribute and report, I need 50 attributes in sourcing format. But I need to know the critical one." How are you going to define it with the business rules?
Very often, business rules are not stored in one place, they are spread across the chains. They've been put in various ETL tools. And sometimes they're hard-coded and here is the challenge that comes up. So finally, the questions. Okay, great. We do have a data lineage, what then? And then you need to be very precise in the beginning. Who are your stakeholders? What do they need?
One of my advice in my book is that during the scope, you really need to understand what business users expect from data lineage.
I still keep in mind your message at the beginning of our conversation, that data management. data lineage are quite various topics. And when I wrote this book, Data Management Toolkit, at the same time, I started my journey in data lineage. It took me three years to realize a very simple logic. Implementation of data management follows the logic of the documentation of data lineage.
Data lineage is so complex. When you start to document data lineage, you need to have data governance, data modeling, and data architecture in place. And if you don't have these capabilities, you need to build them. And when you build them, you really develop your data management capability of function, whatever you want to call it. Without data lineage, data management can't function
Coming back to the reality of what happens in companies when they would like to start with documenting data. First, to document at a business level, demonstrate which applications have been involved in some data chains, and then scope which part of the data chain is most important to document at other levels because you know, a company should be very careful with defining the scope. Big companies have very long data chains and it's practically not feasible to do everything at once to document it all.
So they should choose something not critical. And the criticality always starts at the report insight, because if a company produces hundreds of thousands of reports, it may be that only two or three of them are really critical. So you need to define your critical reports.
[00:39:48] Loris Marini: And prioritize those.
[00:39:49] Irina Steenbeek: Which chains deliver data. And then they need to find the critical piece of this data chain.
[00:39:59] Loris Marini: Who is responsible for data lineage in a company? We can think of engineers or technical people, IT. It has to do with writing queries or reading metadata, but, is my understanding correct that as you go up the abstraction ladder, that's when you start broadening the scope?
[00:40:23] Irina Steenbeek: I would really split it into accountable and responsible. You're not using rocket science. Accountable is someone is responsible to set up the whole story. And in this case, it will be top management who has to be accountable for describing what happened to data.
Let me make it clear: data lineage implementation is a resource and time-consuming exercise. Data goes from various business lines so it requires the involvement of various business departments. So somebody needs to coordinate, top management is accountable and responsibility depends on the level of documentation.
I would say end business users and technical people are responsible for documentation, but the involvement, you know, even if a company starts with documenting physical data lineage, they still need to ask business people what is critical for them, where to start, how to scope it.
it's very difficult to define, but from the business perspective, it's subject matter experts. From a technical point of view, of course, it's database administrators, data engineers. From a data modeling perspective, its data modelers, data architects, et cetera.
So a lot of various professionals will be involved there.
[00:41:57] Loris Marini: How does this actually play out?
Let's run through a scenario. Imagine ACME is a company fictitious company. Pretty big, more than a thousand employees. They operate in four or five different regions, and everybody complains to the CEO that the reports are inaccurate.
it takes forever every time. Just getting a simple metric, like how many items of a type were sold in Europe is a question that paralyzes the entire analytics team. Everybody comes with a different answer. So it's obvious that they have to get their data together and build a whole bunch of capabilities. At some point, someone points out, "Hey guys, we need to have a map." Yeah. Sounds reasonable. So let's start modeling, building the metamodel. Let's start thinking about lineage.
What are the first steps? Who needs to be involved and what is the relationship or the communication lines between those who are responsible for implementing this capability and the folks like the data governance council? What do we need to do first and who are the right people to put in the room?
[00:43:20] Irina Steenbeek: The right people to put in the room are the key stakeholders that get the biggest pain. In this case, it's on the side of files. People like the CFO, maybe the CEO, some sponsors who have to make decisions to start this initiative.
Somebody needs to take responsibility. You don't need to hire somebody, somebody within the company can start doing it, and then you have to follow very simple steps from a reporting perspective.
I can give you an example. In a mid-sized company, I counted around 300 reports that they produced of which 60% were in Excel. It was finally the analysis that brought the company the idea to set up data management and to start the implementation of the data warehouse.
[00:44:40] Loris Marini: Yeah, it's pretty messy.
[00:44:45] Irina Steenbeek: The risk from finance, a report from the sales department about the customer sales and all of these reports delivered a totally different revenue picture. And the worst-case scenario was these three types of reports were delivered to the same authorities, and they had to ask which one was correct.
And then I found out that we had various reporting platforms for finance, risk, and sales. At the point there, the company's management decided to start the implementation of the central data warehouse where we can bring all data together.
Going back to your question, you should first discuss what kind of reports are the most important for decision-making because when you speak about the criticality of reports, you speak about real strategic decision-making then again you come to the challenge, that even the one report may have one hundred metrics. So somebody needs to think about the kind of metrics they're using to manage the company.
And that's usually on the finance side or risk side of the business to decide where to get these metrics, which we also need to understand how they have been built. And then the data architects come into play because they need to explain how the data for this report has been created through the whole company.
What are the chains? What application has been used? You need to decide at what level of data lineage you need to document, or what kind of data management capabilities you need to develop to document it.
And maybe you need to set up data governance, or data modeling, or data architecture. Maybe it's already there but you're simply not at the point. You need to make some sort of a gap analysis, what you have and what you need, and the difference between that will be your plan for the future.
[00:47:00] Loris Marini: And in all of this, what is the role of the consultant? Which types of customers approach you with data crossroads, and what is your role? What's your work?
[00:47:15] Irina Steenbeek: I prefer to use the word coaching instead of consultant because for me, I have a couple of beliefs. One, companies already have data management in place. No company in the world exists without managing data, but some companies do it in an informal way. And some companies still haven't set up a formal function.
The types of companies that approach me are the large, international ones. And of course, we need to understand that there are companies that have had data management programs for years. And there are others that don't. So my role, in that case, it's very simple. I need to have dedicated people from the company who might not have the proper experience or proper knowledge, but who are willing to learn very quickly.
My role involves discussing with people what they want, advising them with what they need, and how to build these capabilities within their limited amount of time.
I had a project for nine months to set up a data management framework. But to be precise, a data management framework is simply a set of rules that companies should apply to start delivering some artifact, for example, data modeling. If the company doesn't have data modeling, then you have to know how they should do it. They need to define what kind of data models and techniques they're going to use.
And these data modeling policies and standards, this is what I'm helping people to do. They need to agree on how they're going to do it. They're going to use their internal staff or they're going to hire somebody else. Coaching for me means training the people who already work in the company.
[00:49:15] Loris Marini: To be the agents of change and actually lead this initiative as an inside job.
You said that every organization, no matter the size, has some sort of data management and, if you think about it, it's true.
If we think of that, our management has a bunch of databases and bots and automation probably, that wouldn't be true because some companies are still sticking to a paper and pen, but that's also a form of data management. It might not be the latest, high-tech stuff, but surely there is a lot of tribal knowledge, a lot of assumptions.
A lot of that information is in the heads of people that are working day in, day out. They do follow some sort of standards, otherwise, they wouldn't be able to operate as a business. So one should use that organizational knowledge and guide the people that are already doing the stuff, day in, day out to implement things in a more scalable, properly formalized way.
[00:50:30] Irina Steenbeek: My role is also to assist companies to formalize their way to manage data. Because you start from a definition, what they mean when they manage data. What kind of capabilities do they really need. Every company should choose the scope of data management that feeds the company's needs and resources.
[00:50:57] Loris Marini: Regulatory compliance, as we've mentioned many times, is one of the biggest drivers, but sometimes we informally refer to that as the stick, as opposed to the carrot, which is sort of more thinking long-term and strategically, and trying to look at what a data management program can enable: types of outcomes for the business that will directly be possible as a result of that capability.
In your experience, what is the percentage between carrots and sticks, and does it happen that people that came in because of the stick realize that there are way more benefits than what they thought at the beginning?
[00:51:41] Irina Steenbeek: Yeah. It's exactly what happened because for many companies it's really the stick that pushes the company to start thinking about data management. But then on the way, you realize that compliance with regulations is not the biggest driver. The biggest driver is really their usage of data lineage for business users.
I can give you an example: the financial department. Controllers need to explain some figures in their reports and for them, they need to spend days going through different departments to investigate how data has proceeded. It takes days.
When you speak about properly implemented data managementt, it should only take a couple of hours. The process of development, even the process of implementation of data lineage, can still take a couple of days to investigate various business rules, but at least a person knows to whom he can contact to get this information. Otherwise, you pick up your report and you start going through different departments asking, how they came to this figure. And it takes really a very inefficient way.
When I made my own investigation, I still recall the statistics. The company had almost 300 reports from which 60% of reports had been built in Excel, or it happened like a pyramid with Excel above Excel above Excel. Only 40% of reports are really, really based on data taken from physical databases. Investigating the report flows, I found some sort of closed-loop where reports generate data and go through various reports and it feeds itself.
[00:54:04] Loris Marini: Ah, fantastic.
[00:54:05] Irina Steenbeek: What is the biggest advantage of data management? Of course, we also need to think about the number of applications. Because if you keep multiple applications in your application landscape, similar data, even if it's master data, customer information, which is not in sync, imagine how much time it requires to align this data manually. From what I've seen in my experience, you have several places where customer data is located. It's not in sync. And you have to do it manually, maybe in Excel to compare it. And then you have to pay maintenance costs for multiple applications to keep it.
[00:55:04] Loris Marini: Why do that?
[00:55:05] Irina Steenbeek: Yeah. And then you start implementing data management. And of course, I wouldn't say you immediately get a sort of cost reduction, but in the future, of course, you'll get it.
[00:55:19] Loris Marini: Yeah, it's this long-term thinking that really is key, right?
[00:55:25] Irina Steenbeek: Data management. It's not a project, it's not a program.
It's really business as usual. It should become business as usual. The company should have professionals who do it in a professional way, but it never stops. You always get some challenges with data. And of course, it's not only the company itself but changing our environment will always require new data.
Tomorrow, a company would like to go to a new market or start selling a new product. They would immediately require some reports, some information about the profitability. For me, data lineage and data management capabilities, are very close. You can't do data management without knowing data lineage and you can't document data lineage without having various capabilities of data management.
[00:56:18] Loris Marini: Yeah, absolutely. They're intertwined. They're two sides of the same coin.
I know that you write a lot of very interesting articles. You have the writing style, which I love, which is quite rare to come across in the data community because you don't take anything for granted.
I absolutely admire you anything and everything is dissected, analyzed and you go down to the most atomic level, the lowest level definition, and then you build upwards, so it's really cool.
Is your blog the best place to follow?
[00:57:02] Irina Steenbeek: Yeah. My blog is the place where I publish everything. I publish it later on LinkedIn. It's my site datacrossroads, or my LinkedIn profile, or my company's profile — these are the places where people can follow me.
[00:57:27] Loris Marini: Perfect. We'll make sure that we'll add those to the show notes for anyone interested. I definitely recommend reading more about the book, Data Lineage from a Business Perspective and to follow Irina and, the many, many good articles she puts out there. Definitely privileged to have the opportunity to have this conversation with you.
And I hope you'll enjoy the rest of your day.
[00:57:53] Irina Steenbeek: Yeah, thank you very much, Loris. It was a pleasure to talk to you. Thank you very much.
[00:57:58] Loris Marini: Thanks.