Why data lineage?

Loris Marini - Podcast Host Discovering Data

Data lineage is about gaining confidence in the data to save time, money, and even lives. How do you sell it to the business? Follow me as I speak to Jan Ulrych VP of Research & Education at MANTA

Also on :
Stitcher LogoPodcast Addict LogoTune In Logo

Join the list

Join hundreds of practitioners and leaders like you with episode insights straight in your inbox.

You're in! Look out for the next episode in your inbox.
Oops! Something went wrong while submitting the form.

Want to tell your data story?

Loris Marini Headshot

Checkout our brands or sponsors page to see if you are a match. We publish conversations with industry leaders to help data practitioners maximise the impact of their work.

Why this episode

What I learned

Share on :

Want to see the show grow?

Your ideas help us create useful and relevant content. Send a private message or rate the show on Apple Podcast or Spotify!

Episode transcripts

[00:00:00] Loris Marini: Some of you might remember John Edwards's story. And a bit of a disclaimer, this is one of the saddest and grimmest stories that I read today. Those of you that live in Australia might remember the case has been going on for a while. John Edwards had a history of family violence, and one day he was able to walk into a firearm place, buy a gun, and with that gun shoot two of his children. And obviously, everybody was shocked, but then an investigation found out that the New South Wales firearms registry actually failed to perform its key responsibility, which is to keep us safe and ensure that people who are not supposed to carry a gun, don't actually have a gun.

And if you think about it, this is an example of a very simple error in the data. It only takes one, a single bit of information, to change the state of a person from green flag to red flag. It's a single bit. And yet one bit that was very costly. It cost the lives of two young boys.

So why are we talking about this? Well, because the quality of the data, the integrity of the data, are not just a matter of business and profits and financial gains and bottom lines. Sometimes it has an impact on real people. And if you step out a little bit from the story and you look at what an organization is, every organization is a system and there are many moving parts. Some are made of computers and machines. Some are made of people, you know, flesh and bones. Information comes in. Information goes out.

If we want to keep the system safe, we need to have a map. We need to know how the information flows so that when something goes wrong, we know where to look. If we don't have a map, well, not having a map is not an option, really, because if we don't have one, we still have to build it when something goes wrong, when we need to search for a solution. And the result is that the process takes a lot more time: projects are delayed, migrations take forever, reports are inaccurate, and it's impossible to manage risk. So that's where data lineage comes in.

You know, the core idea is data lineage is the ability to fully understand how data flows from one place to another. Without it, data assets very quickly become liabilities. Now it sounds obvious, right? Who in their right mind would build a system and not keep track of how it works internally? Well, the problem is that it's not as easy to do in practice.

And so to understand this today, I wanted to speak with Jan Ulrych. Jan is a data consultant and an integration architect with more than 10 years of experience in data management and experience working for brands like DHL, Moody's (credit analysis and economic research firm), and Thomson Reuters.

And I thought Jan would be the best person to talk about this with for a couple of reasons, firstly, because he has a mix of business and technical skills. And I think this is exactly what is required to understand the nuances of this challenge and plan accordingly. And also because Jan works for Manta, the data lineage automation company, where he makes sure that everything runs smoothly from proof of concept to actual deployments of data lineage capabilities.

So today we talk about many things, including the business drivers and benefits of data lineage, why it's so hard to implement, what are some of the most successful strategies when you are trying to develop this capability, and a lot more, so let's get to it.

So I'm here with Jan Ulrych. Jan Welcome to the show.

[00:04:14] Jan Ulrych: Thank you, Loris. Great to be here.

Tracing Jan's journey into data

[00:04:15] Loris Marini: So why don't we start with your background? How did you become interested in data lineage?

[00:04:25] Jan Ulrych: I've actually worked in data for basically all of my career. I started as an ETL (Extract Transform and Load) consultant. Over time on a few projects, where I spent hours and hours in meetings; we were talking about where the data was coming from, and adjusting, transferring, and moving from one place to another. These meetings were endless because people were arguing about where the data was coming from, is it coming from this place or is it coming from that place? How does the existing process look like? And it was just hours and hours that felt like a waste of time. So this is exactly what lineage checks solve, and that's how I got interested, and why I'm so passionate about lineage.

Data lineage is about gaining visibility

[00:05:22] Loris Marini: Right. Yeah. And so when we think about lineage, let's start with the very, very basics. How do you think about data lineage? What is it?

[00:05:34] Jan Ulrych: So for me, lineage is basically a map.

Let's start with some real-life examples, a real-life analogy. I like to think of lineage as a map. So if you're driving, for example, to get to a location. You need a map to actually get from your home to the place where you chose. In the past, you used paper maps, right? So when you were driving, someone was helping you navigate using the map, giving you instructions. Today, we have it even better. We have GPS navigation like Google Maps. Put in the target address and it starts navigating, it gives you the instructions. So I'm really thinking about the lineages as a map, because this is exactly what we need for our data environments, right?

When we are about to make any change, for example, moving to the cloud, we need to understand what data we're using. Who is consuming that data? And if we take this one system and replace it or move it to the cloud, what flaws will need to get adjusted? What rows will need to get adjusted? So that's my view on it.

[00:07:06] Loris Marini: Yeah, absolutely. While we were preparing the notes for this episode, we were trying to come up with ideas to explain it, because it's not a term that we hear a lot, like we don't wake up in the morning thinking, "ah, where's my data lineage?"

But it's actually everywhere you think about it. This conversation we're having. Yeah, it goes through the internet and the internet is, as we know a bunch of nodes of computers scattered all over the world, with fibre optics in between.

And the information chunked in small blocks and each block follows a different road, is routed in a different way because what matters is to get to the destination in the shortest amount of time possible. Engineers figured out really complicated systems to ensure that each bit of information arrives at the destination in the right order.

And one of the key elements to that is to know which router the packet is going through. It's possible with this type of information to trace the entire path of every single packet, of every piece of information that ever flowed through the internet, from the source to the destination. And that's how we made the internet possible because we have that map, that visibility. When there is an outage, and things go wrong, machines fail, we know how to reroute packages exactly, like you said, with the analogy of the actual mapping in the real world.

But we use lineage in other fields of science. In anthropology, we find lines of descent from an ancestor. Think about the tree of life. The concept of tracing a line is useful in genetics. Everybody's familiar with COVID now. And we know that scientists were able to trace mutations of the virus as it propagated through space and time because of the information contained in that genetic material.

It's useful in any production plant. The example that comes to mind is food. When you buy something and for some reason, the ingredients that were used to produce it has an issue, an unconformity, someone has to call the shot and trace who bought the item, where they are, and how you go about managing a recall. That means tracing the pipeline through which this product has moved. And so that's a form of lineage.

Lineage in finance: knowing where the money goes

You know, when you start thinking about it, you find examples everywhere, like in finance. And I know you have a background working in the financial sector. And I was thinking about, do you think that the general ledger could be an example of data lineage in the financial sector?

[00:10:06] Jan Ulrych: Yeah, definitely. I mean, you need to understand where they got the data for a general ledger, for sure. And then, in the end, you need to understand who's actually using the general ledger and if all the systems that are responsible for using this centralized information are actually using it. So absolutely.

[00:10:27] Loris Marini: Yeah, because you got to record transactions and see how the money flows. You've got to manage the books and make sure what comes in, goes out, and you're not making up numbers.

[00:10:37] Jan Ulrych: Exactly. Exactly. There are many more examples in the financial sector that you could think of, like regulatory reporting, Basel reporting, or the anti-money laundering act, or the Sarbanes Oxley Act. If you remember the early 2000s, the issues with Adderall and Tyco and others. So the finance sector is a great example, for sure.

[00:11:10] Loris Marini: It's interesting how we value currency a lot. It's easy to see that $1 is an asset. It's a small one, but nobody would argue that one single dollar is not an asset. But when it comes to data and information, we have a different perspective. And so one bit of information is not often perceived as an asset or a liability, and also the cost can vary. With the example of that story that I mentioned at the beginning, one bit can make the difference between a person having the right to buy a gun and committing a tragedy or not.

A dollar is always a dollar. It's kind of how we invented money as an abstraction that we all agree on. There are exchange rates, but a dollar is always a dollar plus or minus, whereas a bit could mean nothing, could be the extra bit on your hard drive, then it gets corrupted because of ionizing radiation from space, or it could be the difference between life and death.

They key questions that data lineage can help answer

So I find it interesting, but how do you think people understand data lineage? Do we have the same perception of what data lineage actually means?

[00:12:26] Jan Ulrych: That's a great question and I wish I actually knew the answer, but the truth is that I believe that everyone has run into data lineage needs at some point in their work-life especially in the data space.

So the most common questions about data lineage are: where did this data come from? Can I trust it? Does it really come from the source that I was told it comes from? Or, "what kind of cleansing or transformation is being applied to the data before it gets to me?" Or, "If I'm about to change this particular data island, this particular table in a database, am I going to break anything? Is something going to happen? Should I be careful about it?"

All of these questions are really about data lineage. Where is the data coming from, who's using my data and if I make a change, who's going to be impacted? Do I need to ask someone before validating the results? Do I need to notify someone about this data not being available anymore so that the next process doesn't actually crash?

Our environments over the past 10 years have become so much more complex that there's basically not a single person in the organization to still have it completely in their heads. Then being able to answer the question, "Is this used by anyone or is it not?" No one has that information anymore. So we need this map, this data lineage, that actually gives us information whenever we need it without having to go and ask individuals for instructions, for directions.

[00:14:44] Loris Marini: Absolutely. When an organization is small it's very easy to keep everything in our heads. Like you mentioned before, that's a brilliant observation. Lineage is always there. Whether you have a map or not, there is always a causal chain that links events that happen.

If we agree that causality is a thing, then A happens and B happens and C happens. And then the whole thing evolves and propagates through a network of interactions. Real systems are complicated, as you mentioned. So go figure exactly how each piece relates to each other. And so that's really the complexity of the problem — tracing that map and keeping the map up to date is not easy at all because the scale and need for change are really making the process challenging.

How data lineage gives data scientists visibility and makes models repeatable

You think about data science, everybody wants to do data science. But doing science really means, I think, being able to have theories and then test them and learn from the outcome of that process, whether the model was right or wrong, it doesn't really matter. What matters is that you can adapt and change and refine those estimates over time. So change is really the essence of the scientific process, along with having hard evidence AKA data.

So data and change without a map, you can't really trust what you see.

[00:16:25] Jan Ulrych: Now data science is a great example. From my perspective, there are two aspects wherein data lineage is critical for data science. If I could stop using data lineage for a moment, in data science, we are talking about transparency or observability of the data science models and repeatability.

So if we go one by one, first of all, in many cases, data science or machine learning models could be viewed as sort of a black box. We provide some inputs, so we train that model, and then it generates the results; we hope that it resembles what we want. But, in many cases, it's really a black box. We don't really know how it is actually working.

Which of the inputs are important and which of the inputs or features are actually not that important. Which impact the result and which don't. So that's the observability and transparency in data science where we need to understand what that model does and make sure it's not through a black box.

But at the same time, if you think about how we're training these models, by using some training data sets for each line, you need to make sure that the data that you're using it on is basically the same, or has the same source, same patterns as the data that you used for the training.

If we apply the data model on a different data set that does not have the same patterns same features, it may not work. It probably will work, but it just provides results that may not be expected or desired. So data lineage, the understanding of what data we are training on, and what data we are using the model on in the production, is extremely critical.

And even more, as you talked about change. Our environments go through a lot of changes every day. We are implementing new systems. We are making fixes, we are making improvements. And some of these may actually affect or impact those models that we trained, these algorithms that we trained.

So being able to understand what changes have been made to the source of our data, ideally being notified about those changes so that we can make sure that we actually review whether these changes have an impact on our models that we trained, are actually extremely important because otherwise even though you still might be using the same data set, it is actually changing under your hands as you're actually using it. So that's about the repeatability: to make sure that you're using still the same data set with the same patterns, or if it changed, you need to make sure that you go through that process again to validate the data science model.

So I believe that that's a great example of understanding your data flows and data pipelines and is extremely critical. And again, as you mentioned at the beginning, if we don't do it, then maybe it may have some critical consequences.

Two critical business drivers for data lineage: time to market and operational efficiency

[00:20:00] Loris Marini: Yeah. And there's this model drift, I believe it's called model drift when you have the situation you described, where you have a model that's based on the assumptions that are no longer true because the reality in the meantime changed.

[00:20:15] Jan Ulrych: Oh, yes, exactly.

[00:20:17] Loris Marini: And so the model obviously doesn't perform as well. I'm thinking, in terms of business drivers, and when it comes to explaining to a board why data lineage is a capability that is a must-have for any data seers, data governance, and data management initiative, that it can be tricky. Why is that? What's the real challenge here? Is it the technology? Is it the perception? Is it a mix of both?

[00:20:54] Jan Ulrych: I believe if it's a mix of both, plus from my perspective, it's efficiency and time to market. So if I go back for a moment to that analogy with a map. If you're going on vacation, are you going without a map and basically stopping on every corner and asking strangers, "how do I get there?"

You don't do that. You have a map and you follow the instructions. You follow the map, which is much faster actually because you don't have to stop and wait and ask questions. You just keep driving.

In a data environment, this is exactly what we are doing. We're basically doing this lineage, this map building, we're doing it manually. You're asking people, you're stopping on every corner to ask, "what's the next step? Where do I go next? Who is the right person to ask? So it's really about this huge inefficiency, which is getting even worse with the growth of the complexity of our environments.

What we've been doing for the past 30 years, it's not really working anymore. It's not scaling. So we need to change how we are doing things because otherwise, we are seeing a lack of flexibility, lack of agility. And that's exactly where small companies and startups are so successful.

They are small. They are focusing on one thing, so they can go actually really fast. They can change the direction extremely fast. They can introduce new products, new services. With a large company, they cannot really do that because then they take a look at their environment and it takes months to actually even understand what they need to change, which systems they need to touch on.

So this time to market and efficiency is, from my perspective, the most critical driver. Does it make sense?

Enabling agility and flexibility at scale

[00:23:00] Loris Marini: Yeah. Yeah, it does. I was just thinking about this concept that you just explained, how the size or scale actually has an impact on agility. And I'm wondering if, perhaps this is more of a philosophical tangent, the problem of scale is an intrinsic problem?

Like, as in every system that grows beyond a certain scale becomes incapable of reacting quickly, or there isn't such a thing. And even the biggest enterprise can, in principle, if they had the right data environment and the right platform, exhibit the same flexibility and agility that a startup has, in which case they would completely dominate the market. Because they already have that momentum and that mass. And if you add agility, then it's really hard to compete.

But that's not what we see. What we see are these big companies, huge, monstrous sizes that have worked out a formula to become that big. Now the terrain is shifting underneath their feet and they don't have the systems to react quickly enough. And then a new startup comes in and the right one, the lucky one will then disrupt the market and render many of these big giants obsolete, maybe if not obsolete, definitely less dominant in the market.

[00:24:33] Jan Ulrych: No, I completely agree. The fact that large organizations are typically struggling is caused by how they designed the systems in the past. And I believe that if you designed the systems right with the map sort of built-in, then even larger organizations can be very flexible because those issues that you are seeing nowadays with slow responses, lack of agility, that's really caused by the systems, how we designed them, how we are thinking about metadata. And we still have all these systems. It's basically impossible to get rid of them all at once and start on the green fields.

So I believe that the approach should be for new systems to build them with this map built-in and make it one of the key aspects of building these new systems. And for the old ones, we simply need to get the map, so that they can move fast. And even if in the future, we may want to replace these systems, this map can still help us to understand what the system is connected to and what it actually does.

So from my perspective, start thinking differently about metadata, how we are designing the systems so that even huge companies can keep their agility and flexibility.

[00:26:16] Loris Marini: So that's a message for you. If you're working on a startup and listening to this, you have an advantage, an unfair advantage over big, established companies in that you do have the freedom to choose the right systems. And it pays off in the long term to have the right ones.

Step by step into automation

I want to ask you, what is the right system? What can go wrong? So let's say we have an enlightened CEO that walks in and makes the pitch of a lifetime. They talk to the board and convince them, "Hey, this is a priority. We need to focus on the system level and have a map. Have systems that we can query. That you can ask what's going on and the system should be able to tell us exactly what's going on."

So the money, hand on the tap, is poured into it. We're ready to go. What can go wrong and what are the right ways to approach it?

[00:27:08] Jan Ulrych: Great question, what a tough one. It's great to go step by step. Take one system that you need to cover, that you're having challenges with, that you're planning to migrate to the cloud or sort of like the centerpiece of your environment, and map that one. Once you do that, you can immediately start using the lineage for operational needs, support teams, developers, issue resolution, and so on and so on. And then you add another system. And another one. And this way you're basically growing your coverage with the lineage. But at the same time, you're immediately getting value back.

The second one is really about automation. We already talked about it. Our systems change a lot. They are basically changing on a daily basis. With the democratization of data, we're having business users create their own reports and consume the data. We have data scientists going through the data that they can use. So there's much more use of data. It's much more distributed than it was before. So thinking about doing the lineage manually or creating this map manually or through any sort of process is very likely doomed to fail because with so many users, so many consumers, it's very likely to go out of sync at some point sooner or later.

So from my perspective, automation is really the key. As well as the incremental, step-by-step approach are two main aspects that I can think about.

An investment that keeps on giving

[00:29:08] Loris Marini: Yeah. I actually wanted to ask you, how much can we expect automation to fix the whole problem of data lineage? And is there a people element to it? Because I'm thinking maybe there are different stakeholders there in that are interested with different things.

For example, I was working on this project once and we were looking at building a recommendation algorithm to see which marketing campaign was performing well, and which one was not. And we had to get some data in, do some manipulation, some modeling, and then produce the outputs, but the data was coming from a number of different places. It wasn't just one data set, one database. And, as a data scientist involved in that project, I did not have the full picture of where the data was coming in. At some point, one of the engineers said, "there's an endpoint, hit this endpoint, HTTP request, get the data, and off you go", which is what happens often.

We have this limited visibility. Even if the systems work well in principle, that information should be accessible, but there are some barriers. Do you have the rights to access that piece of information? Is the culture of the organization working towards empowering people to ask these questions and actually see how data flows within the systems or are they actively blocking it?

Is there an element to this, based on your experience, or are there people that actually engage in developing this type of capability? Do they understand the importance and are they ready to open the doors of the map to anyone that is interested?

[00:30:58] Jan Ulrych: I believe that a cultural shift is required. Because what you just described very often, some of the customers actually come to us, and I'm sorry to be picking on the data side, but it's often the data science team that comes to us and basically says, "So you got the data set, but you really need to understand what happened to it. Are we working with the raw data or are we working with somehow cleansed data? I need to know that so that we can build the model properly."

And when we actually ask the source to give us this information, it's a different team. They need the budget for it to actually give us that. Maybe even find it out. Only doing it basically for us because they themselves may not actually know. So this is a great point. And I believe that it's really about a cultural shift and understanding how much time it actually costs us, not having this map.

It may be somewhat hard to measure but at least some guidance could come from, for example, Gartner reports. They actually published an interesting report recently that talks about data scientists spending 90% of their time preparing data, with only 10% doing data science.

So 90% is working with metadata, looking up the metadata, basically stopping on every turn and asking questions about where data comes from, what happened to it? Did you do any cleansing? So from this perspective, you're paying for not having data lineage anyway.

It's just buried in some budget somewhere in data science and the migration project, all of this is buried there. And when we build this data lineage for the project, it's sort of a one-time task, a one-time effort. It's just done so that we make the other person, the other team happy. And you forget about it.

When talking about lineage, what I'm thinking is sort of like considering it as an investment. You're building something that's usable in the future. It's universally brought not just for the one team, but for anyone, something that is automatically updated so that it's not a one-time thing that it's unusable after a month because our environment changed. So that's the biggest thing that I probably see that mind shift to consider lineage as an investment for the future and make sure that when you're building it, it's something that you can reuse.

[00:33:56] Loris Marini: Yeah. And for financially driven leaders, people that really need to see the numbers before jumping in, they might listen to all the philosophy that I mentioned before and go, "ah, who cares about physics and information? In the end, we need to see the bottom line going up."

The ROI of data lineage

So for those folks, if we were to engage in this exercise. The whiteboard, big black pen, and we were to build a model for how you go about estimating the return on investment, well, obviously you need to know how much you are spending now to work out the savings and what the gain is. What is it that makes the process of measuring the current state so hard? Is it because we have too many departments and the inefficiencies are scattered across different levels of complexity?

It's almost a self-fulfilling prophecy. If you had visibility, you will be able to tell what's going on and where the inefficiencies are, but because you don't have a map, you don't have data lineage you don't know what's going on. And so you're blind and if you're blind, you can really estimate the ROI. Is that kind of what's going on or there's more to it?

[00:35:13] Jan Ulrych: That's a great way to put it. In many cases, these inefficiencies are, as you said, scattered across different teams. If you look at efficiency gain, when you have data lineage, it's definitely different, across different projects and different phases of the project.

So during the initial analysis phase, lineage will save you a lot. It can save you easily 50, 70% because it basically tries to map what data we need, who consumes the data. That's lineage. So if you have it, you can speed up the analysis phase.

But even more importantly, what we see quite often, and again, one of the analysts' reports confirms that it's actually very hard to assess a project's scope at the very beginning. So if you think about how we typically approach the project, we say, well, we want to migrate our on-prem data warehouse to the cloud. It has twenty-thousand tables and it's going to take us, I don't know, a hundred people and six months, something like that.

But how do you actually arrive at this estimation? Well, it was very high level, a lot of guesses, maybe some previous experience and so on. Basically, you don't know what's sort of happening and it actually turns out no one knows until the initial phase of the project is about 70% done that you understand the true scope of the project. And then you actually start adjusting either the scope to fit into your budget, or the budget and the timeline. So this is actually very scary. You're basically going blind into these projects.

You don't understand how long it's going to take, and you just make your best effort to guess and hope that they're not too far away. So again, with the lineage, the idea is that you get these estimates much more accurately. You get them at the beginning of the project so that you can actually calculate your return on investment much more accurately, on the whole project, not just on the lineage itself.

But you also get your timeline much more accurate. So you don't have to go back and explain to the stakeholders that this project is going to take twice as long, it's going to cost us three times more than we expected.

Estimating the ROI of data lineage

[00:38:06] Loris Marini: Yeah, and it's also having that capability of knowing what happens and how data flows, which benefits everyone. So that means that we should stop putting the budgets for things like data governance and data management under one domain. Perhaps it's time that we start to think about this stuff. This is essential for the organization as a whole.

[00:38:37] Jan Ulrych: So that's truly a great point. Sometimes people think about lineage in terms of "how much is it going to save me?" Will it save me one person on my team? Can I let that person go or use it for someone else? It turns out that in many cases, you could save a whole FTE and let them do something else.

But in many cases it will save, I don't know, 5, 10% of the time that everyone in the team spends on various activities that are related to actually investigating the data origins or data use. So it's actually quite a small number. On the team level, it does not sort of help you that much in terms of reducing your budget, but that time can be used for something else that the team can do.

And if you actually aggregate it and sum it up for the whole organization, the savings can be really huge. You can simply do so much more, you can move so much faster. If you actually start distributing it onto the specific teams, it can actually turn out that the team will be able to save that much in terms of people that could be reallocated to something else.

[00:40:01] Loris Marini: Exactly.

Especially if that team has dependencies with other teams that don't really talk to each other. And,  maybe because of lack of visibility, perhaps because of internal politics, other types of barriers, even if they had a hundred percent lineage as a team, they would still hit roadblocks outside of the team.

It's one of those cases where you really need to have every part of this chain strong and well-connected with every other part of the chain, that's really like optimizing the system.

[00:40:39] Jan Ulrych: Exactly. And actually, that's a great point, these dependencies. So in many cases, saving even millions on people may not be that interesting because you would still have to implement your system, which actually starts out with lineage and so on. From a C-level perspective, saving a few million may not be a huge priority, but what should be paid attention to is time to market.

Because if money is not a problem, the time to market, if someone does it faster than we do, if someone is able to grasp that part of the market faster than we can, that's sort of the key aspect of that, that I see as critical nowadays. So, that may be a way to actually shift that conversation about ROI, not necessarily about saving on people, but saving on time.

Where Manta comes in

[00:41:41] Loris Marini: I don't have a background in biology, but I'm passionate about the field. Perhaps in part, because my wife studied biotechnology. I'm an engineer so I have a very curious mind. I ask questions. We engaged in conversations around how DNA works, the RNA.

And when you study these structures, you realize that the entire evolution that made us possible is based on the information exchanged between cells, between neurons, between membranes. And it's interesting because we look at companies, they really are information processing machines in a way; they're made of people.

So there's definitely a biological and an electronic substrate component to the system that is an organization. And nobody can automate a hundred percent of it and nobody would ever dream of doing that because that would mean that we, as humans, are redundant and we need to get out of the way.

I'm not that keen or eager to see how that future would actually look like. But with the current state we are in at this moment in time, in the end, it's about how quickly you can adapt as an organization. How quickly can you listen? How quickly can you implement the right changes? That's all there is to it. If you think about it from a system perspective at a really high level, it's all about listening, adapting, and proposing the right products at the right time.

So as you say, timing is everything. You have this amazing product but you ship it a year later and you've lost your advantage. You lost that competitive edge.

For those that are listening, that really want to understand how this thing actually works in practice, how does automated data lineage actually work? Is that a thing that you deploy, a piece of software to install? What should we expect?

[00:43:48] Jan Ulrych: There are definitely different approaches on how to automate lineage gathering. The one that we use at Manta is focused on your existing environment. So we do that by scanning the code, your stored procedures and databases, your reports, your ETL jobs, and by reverse engineering that code that is actually responsible for data movement, the documented data lineage.

So that's one approach. It's extremely powerful. It works really well. And that's basically what we do. But as you said, there is not a single approach that would master everything. So in order to get the complete lineage, you often need to augment this lineage scanning with other approaches.

So with the new systems idea and your design systems have metadata and data lineage built in them, they can simply provide it to you and you can just collect it and add it to the lineage that you already have gathered from the existing environment. So that's another step.

Definitely, nowadays, there are a lot of metadata-driven approaches. I love that personally because you can often actually derive the code instead of writing it manually. And these metadata-driven approaches with lineage built-in, basically give you the information.

But that may not be the complete story. In many cases, there are homegrown systems or, very rarely, use systems that do not have metadata built in that there is no automated scanner available for the system. So in some cases, you may need to do manual lineage, to basically describe what the system does and ingest it, or, depending on the system, there may be some shortcuts to take, like building your own scanner, for example, to process that homegrown stuff.

So it depends on a case-by-case basis how the system looks like. But what we see is that 70 to 90%, depending on the organization and on the scope, can be done through automated scanning. There's typically something like 5 to 20% of systems that already have lineage built-in or data-driven.

And there should be something around like up to 5% of manually provided lineage. Ideally, it should be more because otherwise, it takes a lot of manual effort to actually do it, keep it updated. And the way we look at it, anything that is rather static, that is not changing, it's fine to do it manually because you will do it once.

But if there are new systems that are changing often or systems that you're continuously improving, expanding, that's the piece that should be covered by automation, or a metadata-driven approach when it's built-in so that you don't have to worry about it in the future. And it's not a reoccurring cost to you to actually manage the lineage.

A heuristic to decide which dataset to map first

[00:47:21] Loris Marini: That's actually super useful because it's a lot easier to estimate that aspect, how frequently data changes as opposed to estimating who is using it. What are they doing with it? And is it worth including or prioritizing that data, that part of the system?

[00:47:43] Jan Ulrych: I'm sorry to interrupt. That's a really great point. From a lineage perspective, if you think about it, in many cases, it's not even how often data changes, but how often the algorithms and code that moves the data change. Because the data may change on a daily basis but as long as the code that you wrote five years ago is still the same and you're not touching it, that's what constitutes data lineage: the code that is actually transferring the data. And that's also something quite new from the conceptual perspective.

Everyone is used to talking about data, how it changes and how much we have, but from the lineage perspective, discussing algorithms and codes that move the data.

[00:48:42] Loris Marini: What you do with that data.

[00:48:44] Jan Ulrych: Exactly. So in a way you can think of data lineage or Manta as this algorithm catalog. You hear a lot about data catalogs that categorize and catalog your data assets, but who catalogs the code, the jobs, the workforce, and transformations that are actually responsible for most of the complexity in our environment? Those dependencies that you mentioned earlier, these dependencies that are complex, that's actually coded in the real world.

[00:49:23] Loris Marini: Yeah. It's definitely something that could be, in principle completely automated because it's machine code. So as long as you have the right interfaces and systems to talk to each other, there should be no reason why we struggle with data lineage. But we do, and perhaps it's also because of the fragmentation of tooling and technologies.

Connecting the pieces together to build a map

My understanding is that's what Manta really is: Manta stitches together different systems that would not otherwise natively talk to each other. And it gives you the full map. Is that the right way to understand it?

[00:50:05] Jan Ulrych: Yeah, basically. Manta is showing you the data lineage across different systems and platforms. Not only showing but we also harvest it. So Manta has scanners that catch lineage from your existing environment. So if you have, I guess a data warehouse, we can scan your SQL code, store procedures. If you're using an ETL tool, we can scan the transformations and data movement from these ETL jobs. And by doing that, we actually harvest the lineage, but also connect it across these different systems so that you see it unfold.

[00:50:51] Loris Marini: I want to ask you a billion questions, but we're running out of time. I wanted to ask you, operationally, when you deploy the product, these crawlers, these bots that try to be in the system and build a map out of the response that they get, do you have to worry about clogging the system, or is there a kind of intelligence that understands that a database is a production database and you need to be careful in terms of how many requests you make per unit of time and things like that?

[00:51:21] Jan Ulrych: Since we're working with metadata and the code, not the actual data, the volumes are really small, so you don't need to worry about clogging the database. The data volumes that you're processing are really small.

[00:51:40] Loris Marini: Right. And there's no data flow from and to because again, it's all about the metadata. It's not about the actual data.

The future of data lineage

Since we are running out of time, I want to wrap up by asking: who is your typical customer and ideal customer? And, what do you see is going to happen in the future in the field of data lineage?

[00:52:05] Jan Ulrych: Those are lots of questions.

[00:52:06] Loris Marini: Yeah.

[00:52:10] Jan Ulrych: So our typical customer is I would say, someone who actually realized that they are having an issue with efficiency and they want to solve it from the perspective of the industry because of the regulation. So 50% of our customers are from the finance segment. But it seems there's more and more interest from basically everyone else, whether it's retail and manufacturing, and others who have this issue with lack of efficiency, lack of flexibility, being slow and getting slower as you grow, that's sort of something that's driving these activities because customers simply realize that they cannot continue the way they've been doing, managing their metadata.

In terms of the future, I see the future of data lineage as very bright. So one of the things that were very interesting and very recently was actually the discussion about ethics and ethical data use. A lot of organizations, actually the biggest businesses that are now on the market do not provide any product.

But a platform to share thoughts, discuss, and so on. There is not any realignment in a product that they offer. They will really just manage shared data. Now, the question is, how do they do that? Do they use our data or personal data? A few years back no one cared about it, now we have the GDPR, CCPA, and similar regulations.

People will start asking more about how the data is actually being used. So this, I think ethical data use is something that I believe will become a huge topic and you actually need to understand how you are using the data, where you are using it, what data you're actually collecting to be able to answer these questions and say, "Yes, we are using data ethically and for the reasons that make sense are available to the customers."

[00:54:33] Loris Marini: It's interesting how this whole field of metadata management is kind of expanding the horizon. Instead of just focusing on what directly is standing in front of us, as in our team or our organization, it's really a matter of a much bigger picture of how we operate in an ecosystem, whether we like it or not.

And, there are consequences to each action and we need to be aware of what's happening. Then at least have visibility, then it's up to us whether we want to take action or not. Hopefully, we take the right kind of actions, but at least at a bare minimum, having visibility is a must, otherwise, nothing else really matters.

I love this conversation. Yeah. And then are you active on Twitter, LinkedIn? I know we met on LinkedIn, but what's the best way to follow you?

[00:55:26] Jan Ulrych: I don't do Twitter. LinkedIn is the best one. And Manta's LinkedIn is an even better one because a lot of my colleagues are contributing as well. So that's a good one to follow.

[00:55:40] Loris Marini: Oh, fantastic. I'll definitely check that out. And, hopefully, we'll be able to see each other again in person because I am planning, fingers crossed, if everything goes well, I should be in Europe, around May. And, I haven't checked exactly what's going on when it comes to data conferences, but if there's something hot going on, I'll make sure to join.

And hopefully, we can catch up for a Czech beer. Or, an Italian coffee.

[00:56:10] Jan Ulrych: That's actually really nice. I'm looking forward to that.

[00:56:12] Loris Marini: Awesome. Thanks. Yeah. And then enjoy the rest of your day.

[00:56:15] Jan Ulrych: Loris, thanks a lot for the opportunity for having me, it was a pleasure. Have a good day.

[00:56:21] Loris Marini: Pleasure's mine. Thanks.

Contact Us

Thanks for your message. You'll hear from us soon!
Oops! Something went wrong while submitting the form.