Episode:
14

Why data science needs data management

Loris Marini - Podcast Host Discovering Data

Data science is about modelling the right business problems with reliable data and domain knowledge. Without a solid data management program this is simply not going to work.

Also on :
Stitcher LogoPodcast Addict LogoTune In Logo

Join the list

Join hundreds of practitioners and leaders like you with episode insights straight in your inbox.

You're in! Look out for the next episode in your inbox.
Oops! Something went wrong while submitting the form.

Want to tell your data story?

Loris Marini Headshot

Checkout our brands or sponsors page to see if you are a match. We publish conversations with industry leaders to help data practitioners maximise the impact of their work.

Why this episode

What I learned

Share on :

Want to see the show grow?

Your ideas help us create useful and relevant content. Send a private message or rate the show on Apple Podcast or Spotify!

Episode transcripts

Introduction

Loris: [00:00:00] So today is yet another episode of The Data Project. And, um, I want to try a bit of a different introduction. Imagine that it's your last day of uni. And over the past five years, you've been doing research in Statistics and Mathematics, and you absolutely loved it. You are in fact so passionate about data that you get to start a new job as a junior or a data scientist in an organization whose business model directly depends on data trust.

You're still solving problems using data, in a way, but in a very different way. And you learn this on your skin — the hard way. But time flies. And it's been now more than seven years since your day one. And sometimes you wonder, "If I could time travel and leave a message to my past self, what would I say?"

Well, you're about to see this experiment unfold live with my guest and friend, Vlad Ardelean. Vlad is a lead data scientist at GFK — Growth From Knowledge — a company founded in 1934 in Nuremberg. Vlad has a background in Mathematics and Economics and has a PhD in Statistics.

Vlad is a father of three boys - and I don't know how you do that! He loves running and climbing and he's interested in algorithmic bias and the ethics of what we call “AI”. Today we talk about the lessons he learned on the job and the importance of having clean and standardized data to deliver timely insights that you can ultimately trust. So let's get to it, Vlad, welcome to The Data Project.

Vlad: [00:01:44] Thank you and I must say, I don't know how I managed with three boys. It's just, you just roll on and, and hope for the best.

Loris: [00:01:52] You just go with the flow, I suppose. Yeah. There's no, sometimes I'm complaining and I have only one girl. So imagine times three, times 1.5.

Vlad: [00:02:02] But age makes a difference. I think your girl is a bit younger than my youngest boy. I think that's, that's a, it's a key difference, but anyway, we're not talking about being a dad. Okay. We could do that, but maybe next time.

About Vlad & GFK

Loris: [00:02:15] Heaps and heaps to say about that. Okay. So where do you want to start, perhaps, give me, give me a bit of a background. What is GFK? How did you end up there? What are you doing?

Vlad: [00:02:25] Yeah, let's do this. So before we start, I think the introduction was spot on. So I was doing my PhD in Statistics. I was, I was doing, at that time it was a bit harder to calculate on multiple machines. Nowadays you just specify the number, of course, that the algorithm should use and that's it.

But we did it with the Microsoft parsing interface. And then you have to call each note individually, send the data, and wait for the results. So this was really, really painful. And then GFK asked me to join. So as you said, GFK is Growth From Knowledge and we try to give insights to our clients, what they should do.

And GFK has three main pillars. The one being: what are people doing online — called media measurement. Which websites are they using? That's one pillar. The second pillar is what fast moving consumer goods are they buying? And I see you're puzzling so fast. Moving consumer goods is yogurt, frozen pizza, everything which basically goes into the fridge or you eat directly.

So fruits, vegetables, yogurt, so fast, everything which is fast moving. And that we have one in individual levels. So we have, if you have a panel and these people tell us, "We have bought this frozen pizza and we prefer this pizza over that pizza because it tastes better value, whatever". And then the third pillar where I've worked for seven years now is we know what people are buying in terms of, in an electronic superstore, so in a point of sale.

POS data in 80 countries

And we do not know this on a personal level. So we do not know that you, Loris, have bought an Apple phone, but we know how many, iPhone, Samsung phones, whatever they have sold. And then we can, we can help the brands develop, see how are they in different markets? What is, what is the key feature for a smartphone or a fridge?

Keep in mind that GFK is active in 80 countries of the world. So we can track brand behaviors for 80 markets of the world. And we are tracking roughly 150 product groups and a product group is, as I said, it's a smart mobile phone, it's fridges, it's washing machine, it's cooling devices, it's fans, but also very exclusive product groups such as kimchi coolers, which is obviously only relevant for some part of the world.

Loris: [00:04:56] Wow. That's so interesting. So there is so much to track and, uh, I think the first question I'm going to ask you is where the hell do you get all that data?

Vlad: [00:05:07] So let's focus really on the third pillar. So there's the point of sales data, right? Because this is where my expertise lies. Basically, we have contracts with the retailers. And the retailers send us the data and they are incentivized by GFK because they will get insights into another product from GFK, where they can compare themselves with other stores or in a region.

So not on a store-by-store level, but on a store versus region level. So they can say, “Okay, well I have, how is my inventory different to other inventories?” And they can answer these kinds of questions. And for that-

Loris: [00:05:43] And so the contract includes sort of the data, the schema. So how the data come in or do they just give you whatever they have so that you have to clean it up and structure it?

Vlad: [00:05:53] I would say they give whatever, so whatever they have and we put it into a contract.

Loris: [00:05:58] Wow, nice. So you've got a massive data management function within the company, I suppose.

Vlad: [00:06:04] Yes. So the first step is, and one of the key steps actually, is to standardize the data from each retailer, because you cannot enforce mandatory — they are mandatory fields. Yes. Obviously, we want to know what they have sold and, but then there are other optional fields and that makes it a bit different to track.

When Data Management Means Business

Loris: [00:06:26] Yeah. And in a sense, you, I suppose you consume this data, sort of downstream, because, and correct me if I'm wrong, data science comes downstream of data management. So I suppose, there's going to be someone in your team that looks after the data quality, or is it a mixed role, a mixed responsibility?

Vlad: [00:06:47] So, just to put everything into perspective. It's not that it's a five-person shop to adjust the data. This function, which we call the data platform, is 100 to 200 people. And that means it's a lot of people, right?

Four blocks in the Data Value Chain

We have four basic pillars, the first is data ingestion — which means standardization of the data.

Then the second step is the translation of the data, because as you can imagine if you just get a dump out of a warehouse system then you translate it and standardize it. And we wanted to talk about the same, about the same item, right? And an item can be a tricky thing. Now think about a smart mobile phone. It can be Samsung Galaxy S10, right? But then the details: how many gigabytes of RAM, and what is the storage size, and so on. And the same holds true for any kind of item. So if you think about a panel television, there are a lot of different ones that deviate only in screen size or in pixel. So you want to carry that with you so that you can really distinguish one item.

The third step is to get a full market view. And the fourth step is to make it usable and actionable for the clients. And what you would typically expect is that data science comes in the fourth part, right, to make it accessible to the users.

Domain Knowledge is Key

But this is, I think, where GFK excels, and said that, “This is actually too late in the process because there's so many steps that data science needs to understand”. And now coming back to the initial question, what would I tell my, my former self is, "Yes, you have studied statistics, you have the basics of machine learning, but you have no clue about domain knowledge. So whatever models you are applying, you don't understand if they are good or bad and what you need to correct for.”

Do you need to correct for bias or what kind of seasonality you are having? So that domain knowledge is key, and I think GFK figured it out and said, “Let's add data science to each and every part so that we can have data science steps in between to support the process, to make the right changes at the right time and not one change at the end of the process.”

So it's about transparency and bringing domain knowledge to the individual subdomains.

Closing the Data Science Loop

Loris: [00:09:34] That's fantastic. So does that mean that you track the link between a number or column in a database at the very beginning when the data comes in and structured, standardized with the real world, with what's actually happening inside the shop? Do you actually go physically in the shop and talk to people? How do you get that domain knowledge?

Vlad: [00:09:55] As a data scientist, do you mean? No, I never visited a shop. I would love to, but we get the data in a database format basically. But what I meant with domain knowledge is that we are part of this, of these specific subparts of the process. And we talk to the people that are doing the day-to-day jobs. We are part of the software development teams.

This is what I meant by getting domain knowledge. I think it would be interesting to join a shop and to see that, but ultimately, it's just data being sent from A to B. We give recommendations to larger brands, we can make inferences about prices, so we can make an inference about how good or bad their promotion would be. So that is one step.

The second one is we can tell them, look, see this, this feature, say we're going off for smart ovens, right? And a smart oven now up until I would say five to ten years, nobody talked really about smart ovens. An oven was there, it had to have power and self-cleaning maybe. But now you can connect it to your smart mobile phone and you have these kinds of fancy functions where it can determine if the meat is ready or not.

So brands see, okay, well if this feature picks up the pace they can incorporate it into their design process.

Trusted Data in a Trust Network

Loris: [00:11:17] I see.

Vlad: [00:11:17] Does this make sense?

Loris: [00:11:19] Yeah, yeah it does. And so you're closing the loop basically for the brand that sells it.

Vlad: [00:11:27] You can compare basically two ovens and see, okay, well, why does the other one sell more? Because they are from a technical perspective the same, but there's this feature that basically is valued higher by the client or by the customer because it has four functions more.

Loris: [00:11:46] I see. So from the perspective of the retailer that doesn't really know the full dataset around each product, I suppose, or maybe they have access to that data, but they don't actually know which features matter, right? So you come in, you basically almost do a reduction job where you're saying out of those hundreds of columns, you've gotta focus on these three because these are the ones that drive revenue.

Vlad: [00:12:11] And we know the full market picture. So while larger retailers that are multinational might have a good picture of how they're standing in each country, they still only know their footprint in the country. We can give them the full 360-degree view where we say, okay, "Well, these are your competitors. This is going on. Be careful, there's this item that has this feature and this is a killer feature. So please be aware of that”.

So we can give them a data story basically to give them insights. So that's, that's the job. But for that, you need this good data foundation, right?

Even the best data can be "bad" data

Loris: [00:12:49] Tell me a little bit about that. What do you mean with good data foundations here, from the perspective of the scientist? And I'm asking this question because a lot of people that are not used to the data science process or in general the scientific process, they think or have the impression that as long as you have some data as an input with some magic AI, you can get insights at the output.

Part of what I'm trying to do here is explaining why that is not the case. So I'm asking you, you have way more experience than me and you're working with really cool data sets. What happens when the data is not structured? And what does it mean for data to be of good quality?

Vlad: [00:13:31] Let me start with an example from GFK. So I looked into panel televisions or flat screens. Does a brand follow a time series or does an individual panel television follow a time series? Can we look for macroeconomic patterns or sales or price, is there any correlation? And then, I've seen this weird behavior.

In one particular month we've got this weird outlier, so large deviation from the prediction of the model. So the price was too low and the sales were too high. I said, “Okay, this is interesting. What happened here?” I called my colleague and he said, “Yeah, sure. I can explain it to you, the Media Markt had a promotion and they gave back the VAT (Value Added Tax).” So suddenly it was correct that the price was too low because the VAT was deducted..

Loris: [00:14:30] The tax component?

Vlad: [00:14:32] That was returned to the client and people bought more. And so what I mean with good data is that there was somebody in the company that could explain to me, "Okay, you found something which is interesting, but in fact, it's not interesting because there's a root cause for that.” So if you go to the client and say, "Look, guys, we found this out", they would laugh at me and say "Okay, well we know that. So what is so cool about data science?" Right?

Loris: [00:14:58] Here comes this smart alec with a PhD in anomaly detection and time series. We knew that!

Vlad: [00:15:09] So, this was my first answer, good data means you understand why the data is changing. With this example with these Value Added Tax if I would have known that this happened regularly or irregularly, then you would have a promotion flag, but how do you find promotion?

So this (knowledge) makes a model deeper and structured and would help us answer more data management questions because then we know that if we want to model this relationship between price and sales unit, we need to correct for special events, such as soccer championships because that is where people are buying things.

We have a rich data management function that needs to track this as well. And then we'll find out something else. Honestly, I'm not sure that, that I, that even though I would sit 10 hours and think about cases I want to have. I would assume this 20, 25. Right? And, but it's the that's, that's the scientific part of the process.

What is a clean dataset? So if you think about Kaggle, and I like Kaggle, but I think that these Kaggle competitions are oversimplifying the data science process. Because of what you're doing there, you get a dataset that is super clean more or less perfectly annotated.

And I admire people that have the time and, and, and, and really can do cool things with data science and engineer very cool features, bring together different kinds of kernels, different notebooks. And, and this is all very, very cool. But this, I think, is the last 5 to 10% of the way, right? 90 to 95% happens before. So, how did the data get to this stage? Why has it been cleaned? What has been removed? Why has this been removed? That's the core job to understand if this makes sense or not.

Why Science needs Data Management

Loris: [00:17:23] Yeah, so gather that context around the number. In this sense, there's a lot of people that think that if you have a strong data governance team in place, and you've got policies and you have procedures, and there’s clear ownership, and there is a centralized team that looks after all the data, then data scientists can save that 80% of the time that they will normally spend cleaning up data set.

Do you agree with that or do you still feel like there is a big component of the job of a data scientist that has nothing to do with writing code, even if the data is well annotated?

Vlad: [00:18:10] I would go for the latter, but let me answer the first part of the questions a bit provocatively. So do you remember the role of the data steward? How many do we see? How many do know?

Loris: [00:18:27] Very few. I don't know many. It’s very boring and unsexy, somehow.

Vlad: [00:18:31] So that's, that's my fast answer to that. So it's an important job, but if you, if you think that there's all this hype and now it sounds like I'm very old, but I have been now for 10 years into the data science game. 5 years ago, there was this hype about data lakes. So you just plug everything in, have a data steward watch over it, everything will be curated manually. And now you have the derivative term for, they have this data swamp, which is, basically, nobody wants to live. I've heard now data lake-house as well.

Loris: [00:19:04] Yeah there’s a bit of a mess. We’re not really good at coming up with terms, are we?

Vlad: [00:19:11] I think the biggest problem is that any data science, data management function looks at it in a retrospective way.

Loris: [00:19:20] Exactly. I 100% agree.

Vlad: [00:19:22] We know what we need, and then five years from now, we do not only need to track promotions but other stuff as well. So either you change your database system, and then you have a structural break because some of the features will not be available from now on, or you change it retrospectively and do some imputation. And then the feature you are having such as promotion is guesstimated. So I'm not sure what is good and what is bad.

I think that data management is always reacting in a retrospective manner and you're never done.

How Much Chaos Can You Afford?

Loris: [00:20:00] You're never done. You just gotta keep managing data because if you leave it by itself it's gonna do what things do in the universe when you do nothing, which is increased entropy, increased chaos.

And eventually, you have a very expensive database that it's useless because you don't even know how to query it. But another aspect, I think interesting, in that dynamic that you mentioned about the retrospective thing is, the fact that I feel like in data there is this pattern that: we're all trying to give the responsibility of the hard, boring, less-rewarding job to someone else. Data warehouses have been around for a long time, you know. Obviously, they were based on transactional databases and they were designed to serve dashboards. So a very optimized, low-latency, type of workflow where you know exactly what the query format looks like.

Then obviously we realized that "Hey, we want to do more analytics". And so conventional data warehouses don't do the job anymore. We want more elasticity. We want flexibility. And so we introduced this concept of data lakes, but a lake by itself has no structure. And that's what a lot of people got excited about because finally, we don't have to worry about structuring and the data standardizing. We can just dump it there, and then one day, if we need it, we'll have it.

The problem is that you're cutting a corner, but you're doing a huge disfavor to yourself in the future because knowing whether that data is going to be relevant in understanding the relationship within the data and everything else, five years down the track, but even, even in a month later is going to be really hard.

It feels like a little bit about the concept of borrowing from the future in economics: how much can you borrow from your future self in terms of structure, and how much chaos can you afford? And, I think there's no free lunch there. What do you think?

Vlad: [00:22:04] GFK is a more mature company. We have been around for 85 years, so we basically know what is relevant, but it can also be a fallacy, right? But I think if you are a more mature company, it's easier for a data management program to come into play. At least the path you're going, it's a bit, it's a bit clearer. There's still uncertainty, but it's clear, at least we're going north not going south.

As a young company, as a startup company, you do not know what you will need. You have no product. It's, it's like, “Okay, do we go north, south, east, west?” So I think it's even vaguer what you will need. And then are you a data company? Do we need to make data easily accessible for other companies? Or do I need to be super protective and see and make sure that nobody gets access to the user's data because they pay me for a service and that nobody gets the data?

So in that way, I think it really depends also on your business model, what your data management function should do. But I liked that you said it's a program, so it's never, never really finished.

Loris: [00:23:16] No. Another thing that interests me is the topology of information flow within the company. So I'm talking a little bit more in abstract terms now. So I speak about information as an intangible asset, probabilistically defined as the inverse of the probability of knowing what's going to happen next.

As you know, the actual mathematical definition of entropy. If I think about that definition at the level of the organization, and I think about what these new pieces of evidence (aka information) do as they propagate through, I think about the difference between a centralized versus a distributed data management style.

Distributed or Centralised Data Management?

In an organization like GFK, do you think it's even impossible, let alone the more efficient, to have a central team (a core team) that does all the structuring and all the annotations and curation of the data, and people that consume that information entirely reliant on that centralized team?

Or should we, and this is an open question because I don't know how the GFK works, incorporate everyone? No matter whether they are at the edges of the system or whether they are deeply involved in that data governance sort of council team to be responsible for the data? So it’s a bit of a long-winded question.

Vlad: [00:24:58] It’s a long question. A good question. GFK has programs such as the GFK university that teaches people what is done, a kind of a small Coursera to teach people to use this data.

But I think it's important that there's one single source of truth. This is GFK's business model. Imagine me being a data scientist — I have access to the data. I have done this analysis and we found out that there was this weird anomaly in a particular month. If I go to the client, it would harm GFK's reputation. So I think for GFK, it's good that there's a centralized business unit taking care of the data. And then on the fringes, people can then do things with the data. In a startup environment, this could be totally different because what you want to do is not set in stone.

And then you have this idea, you collect this data and you create business value for when you add it to a core dataset. And I have a different value or purpose: I started collecting different kinds of data and I tried to merge it. I think it's okay that our two fringes do not need to connect, but they should connect to the core business.

Especially if you follow the startup “fail fast, fail earlier” principle, then I think it's cool that there is a small core set and everybody can add data to this core set. For GFK, what we're doing, we're also having these kinds of experiments and seeing how can we have a data partnership program that says, “Okay, how does this fit into our core business model and our core data asset?” But it's also a bit more centralized and it's not like free-floating.

Loris: [00:27:14] Yeah.

Vlad: [00:27:14] My answer was quite long.

Data Management in Startups

Loris: [00:27:17] Yeah, I don't think I agree 100% with the startup bit. And I know that economically, it doesn't make sense to invest in a full-blown data management program. If your data is the raw ingredient to extract information and information is your fourth intangible asset, and you're not sure what you want to do as a business, then it still makes sense to look after that asset.

And so in this sense, it might not be worthwhile to spend a huge percentage of the resources dedicated to the data team on management, but there are some principles that are really useful and it's better to incorporate them very early. I have a personal example that I can share. You know, Data Foundations is the company that I started a bit over six months ago.

And I am a single person, right? There's no one else. And I do have a lot of data, like any small business: contacts, people, projects, tasks — all of the things that make up the operational side of the business. Even the podcast itself. The Data Project has a section on Notion as you know — if you've been a guest on the show — dedicated to it, and it's linked to ideas and notes.

The ability to know for sure what is happening at any given moment and connect it to everything else so that I don't have to remember necessarily the structure of my data and information, it's enough for me to remember one bit, and then basically no matter where I enter the system, because of the power of relations, I can find everything else.

Now you could argue that because I'm not an established business, I'm definitely in start-upping mode, I shouldn't worry about data management, but I realized that I wasn't doing that.And in January 2021, I introduced this function and it's a core part of what I do. And I found incredible benefits.

And I'm wondering as a startup, can we incorporate the learnings from the field of data management? Not necessarily that we establish a council and appoint five people full-time to be the stewards of your dataset, but learn the framework (the thinking) and apply it to the day-to-day job. I am going to argue that your future self will be thankful if you start with the right foot.

Vlad: [00:30:12] You're right. And maybe I was a bit too fast to say for a startup company. Because basically if you think about a startup what you do first is create a culture, right? I mean, if you have this kind of function, you're not taking five people full-time, you do not have the resources. But is still then part of your "DNA". And so you set up an early data culture, that's the point. And this means that people know a bit, or at least know where to look for information, to see where data comes from and how they can connect it. And I think that's maybe the key point with creating a data management function early.

Loris: [00:31:00] Absolutely, and choosing where to focus is a challenge. But that's a challenge in enterprises as well. It is part of the data governance team to understand which dataset matters, which one should be particularly curated. But that requires strong communication between the business and the data folks.

And this is an important point that we wanted to touch on in this episode because there is a huge disconnect. You know, just a couple of weeks ago, a prospect came with a job description. They asked me to give an opinion, what I thought about it. It was the description of a unicorn: someone that really doesn't exist. They wanted this person to do everything. But most importantly, the culture was young, scrappy, hungry, and ready to do everything. Which I understand in a startup environment, but actually in this case was a scale-up.

In a scale-up, you need that drive. You need people that are really motivated. If they're sitting on a chair, waiting for someone to tell you what to do, that might not be the right environment for you. But in the data domain, being too scrappy can also backfire and, and finding the sweet spot, it requires a lot of communication between the business and the data team.

What is the vision? What is that we want to do? It's true that maybe our strategy is going to change over time, but people don't change over time, you know? The customers that are paying today, they're real people. They exist, they signed up, they put down their credit card. We already have a relationship with these people. And if you know that the dataset that represents those human beings can be trusted, surely you will find a way to do something in the future, even if you pivot. And so, yeah, there's no question, it's just a thought that I wanted to share.

Scaling Data Science - The Human in The Loop

But diving a little bit deeper into the business and data divide. What are the lessons you learned there?

Vlad: [00:33:10] I came from academia and from research. I thought I would do research forever. So I think that’s the first one. The second is: domain knowledge is key. So we cannot say, “Look, guys, I have a PhD in whatever. Let me just do the job better. And you will be replaced.” But rather go in and be a bit more open, talk to people. What are they doing?

What we do at GFK quite well is to kind of bring AI and HI together, which is artificial and human intelligence, which is a bit buzzwordy. But what we do is we do proposal systems for a human because there's a lot of human work in the loop to create the data. Starting with, quality checks of the data. Is the amount of data we get in line? Is the volume in line?

The translation part is also human, and then you can do also machine learning to help automatically translate retailer text to GFK items, and then for quality checks as text.

Loris: [00:34:09] Sorry what was the text? What did you say you wanted to translate?

Vlad: [00:34:10] The retailer text.

Vlad: [00:34:12] We get the retailer text and we wanted to translate it to a unique GFK item. So we're always talking about this Samsung Galaxy, whatever, with the same gigabyte and with the same RAM. What does AI excel at? It's doing boring tasks or repetitive tasks quite well. What it does not do very well is edge cases.

There was a very interesting article about the return of investment of AI. And it is not as high as software because software is scalable. AI is not as scalable because the more you scale,the more computing power you are using, but also the more edge cases you are having. So then you have two ways of going out of this dilemma. First: add more resources to kind of cover all edge cases, or what GFK is doing add the humans (in the loop).

So, say we have machine learning that gives probabilities to outliers. And whatever we say is 80% an outlier is done automatically, but the edge cases are done by a human and this has two benefits. So first of all, humans feel more valued because they are not replaced by machine learning. The domain knowledge still stays in the company. And you also already have a feedback mechanism that says, “Okay, well, we can check if the algorithm runs wild and amok and suddenly picks up a pattern it should not pick up.” So this is where GFK basically did a good job, I think, in the last years.

Loris: [00:35:48] That’s amazing. Closing the (data) loop is an outstanding challenge for many companies.

Vlad: [00:35:52] And as I said for the first time, I did not consider humans in a loop. We can do everything better. And this is total crap.

Loris: [00:36:03] That was your just-graduated self?

Vlad: [00:65:05] Yes. I have never done their job, so why would I take it away? This was a horrible experience. But rightfully so, but I think I learned. And now my first thing is to go to them and say, “Look, guys, what is your actual problem? What do you want to get rid of?” Because it's a perspective switch, right?

Data Science should Support not Replace

The key switch was that we are not asking them, “Hey, we want to get rid of your job.” But rather, “What part of your job do you want to get rid of?”

Loris: [00:36:40] (...) So I can automate it.

Vlad: [00:36:40] Right. What can we automate for you?

Loris: [00:36:45] And there are cases where you can not automate it, I suppose. Not everything can be automated.

Vlad: [00:36:49] No, no. And that's good, I mean that's acceptable. As I said, you can either throw away a lot of resources into kind of finding all the edge cases to automate everything, or just keep humans in the loop with the added benefit of having a feedback mechanism for your algorithm, which I think is key that you monitor that — and, this would be another topic for some episode, like how the rise of MLOps. How do you put things into production? How do you monitor them? I think it is paramount that we have these kinds of feedback loops.

Knowing how to do experiments — there's basic statistic knowledge there. And for that you do not need a PhD, right, I think. So, it's good that you know the basics, how to run experiments. If you want to run out-of-the-mill algorithms, and I'm saying this in a very positive way: your XG-boost, your support vector machine (SVM), your logistic regressions, I think you do not need a PhD because it's just applying the model. And I think then communication skills are way more important and how to talk to people and how to make this run.

If you are on the other side and have a completely new problem that has not been solved before, I don't know, computer vision comes to mind, some kind of NLP problem, then definitely a more research-based approach I think would be good. But my experience is, for 80% of the tasks you do not need a PhD. And I think that the only thing that a PhD was beneficial for me, besides personal growth and enjoying this time, is you get borrowed trust? Does it make sense? People automatically trust you more.

Loris: [00:38:55] Hmm, I see. Yeah, you have more credibility.

Vlad: [00:38:57] Yes. Thanks. You have more credibility in a company. You have to back that up. That's what I meant with ”it's borrowed”, right? Because this PhD doesn’t hold forever. It will hold you for the first one or two projects. And then if you can deliver, that's fine. But if you cannot deliver it goes away.

Loris: [00:39:19] Where does the scientific mindset come into this picture? When I was in a position, for a brief amount of time, to hire for the data team that I was leading and I was looking for people that could think in abstract terms, that could really just use system thinking to understand and solve a problem. To my surprise, it was not easy to find that skillset. And I found that all of the people that I interact with that come from a scientific background, have that ability to just go one step higher and think abstractly.

On Hiring Data Scientists

Do you think that this is important for the type of problems that you're solving at GFK? How would you go about hiring?

Vlad: [00:40:17] So we have hired different, I think 25% with PhD and 75% without PhD, I would say. I think in recent years we hired fewer PhDs. But then again, all our heads, lead data scientists have a PhD.

Loris: [00:40:37] Hmm, I see.

Vlad: [00:40:38] Most of them, I would say. It also depends how you do your PhD, but this gives you a certain kind of drive and you know how to structure yourself. I think that's one of the key learnings from doing a PhD is how to structure yourself, how to do experiments.

Loris: [00:41:00] So independence, you mean.

Vlad: [00:41:02] Independence, yes. And basically to be able to be kind of, to be pointed to one direction and then survive on your own. If your PhD advisor says, “okay, we'll do this.” And you say, “okay, I have to do this, this and this. I have to build a tent. I have to make a fire. I have to make a kind of food, get food.” So you know how to survive in the wild. But this skill is, I think is not unique to PhDs. You said it's about self-motivation. It's about structuring yourself, structuring others, but other people can as well.

Loris: [00:41:40] Absolutely. Yeah. And there's many different domains. I mean, if you ever tried to start a business, I'm sure you have a lot of self-drive even if you didn't spend four years being underpaid doing research in a university.

Data Science as a Business Function

I'm gonna take a couple of steps back now, and I'm going to go back to where we started: the business focus, ensuring that as we manipulate data we do are fully aware of what the business wants. How does your team and how does GFK do this? What's the culture there in terms of facilitating the connection between data and business?

Vlad: [00:42:19] So this is a very interesting point. So what we have is this data platform program, which is itself very, very cool. For each of the domains, you have a delivery lead and a tech lead, an architect lead, and a data science lead. So the data science is, is part of, of each domain. And I think that's key that part of the leadership team is data science. And it's not the only one time that I've seen that other projects and other domains, larger projects, almost programs that run over one or two years had data science in their leadership team.

And I think that is very interesting that you put this high emphasis on data science as well.

Loris: [00:43:08] Yeah. It connects to the principle of design thinking, where you try to bring as much awareness early on in the process because things will inevitably drift. And, if you don't know what the implications are on day one, you might end up making assumptions that are not valid when you actually get to the implementation stage.

Vlad: [00:43:32] Or even worse? There was this some years ago when data science/AI/ML was the silver bullet. So everybody thought, “Okay, we just need one data scientist and that will solve all our problems, miraculously.” And that was basically a problem for many companies that just hired one or two data scientists, hoping that they would generate value.

These two data scientists, as good as they could have been, did not find the right data management structure. What could they do? They had no domain knowledge. They were expected to solve business problems on a weekly basis. So I had one interview where they said, “Okay, you have here this backlog of 10 projects, we expect you to do them in four months.” How do you expect me to do 10 projects in 4 months? If the data is there I can apply basic algorithms and we can see if they make sense or not, but I would not consider them done.

So this is what I like about GFK is that the data science has this high value also on a project management level, that it's part of the table. And we can manage expectations because I think that's the second skill I would like to tell my future self: always manage expectations.

And it's good to say, “Well, we can give you a small solution in two weeks and then we can see, does it make sense to improve or, or engage further, or finalize this?” But, you have to manage the expectations and not tell them we'll find the best possible model in six months. And then you have to go back, by the way, which metric are we optimizing now? So, managing expectations is key.

Managing Expectations in Data Science

Loris: [00:45:19] How do you actually do that? I learned a couple tricks and one is to keep it incremental and focus on short-term wins immediately so that people kind of feel that, “okay, there is a response from the data team. Things are moving.” But the other big part is to strike a balance between using firm, reassuring words and also being cautious, especially if it's a new project and you don't really know the dataset if it's something experimental.

I've seen people assume that data scientists can figure it out just because they delivered really well on other projects that were based on more structured data. And so I guess part of managing expectations is also to make people understand that if you work with different ingredients, you'll have to adjust the timing.

Vlad: [00:46:16] So multiple things help. First of all, I do emphasize the science part that we sell. We are not software engineers, so there's always a high risk of failure or a high risk of success. But what does science mean to me? There's an uncertain outcome, right? That's the first thing. The second thing is we will not leave you waiting for the results. We will kind of update you regularly on whatever you want — a weekly, bi-weekly basis, to show you incremental improvement. And I always give the caveat: this first incremental review process starts when we have the data.

Loris: [00:46:56] Yeah.

Vlad: [00:46:58] So these are the three things I would encourage everybody. So when you have the data, that's a key point because I don't know how much time is wasted waiting for data. This is the single biggest waste of data science time: waiting on data or waiting on database access, for that matter. I think where we have improved as a data science profession is that we're not reliant so much more on files, but as one of the core skills for data scientists is to be able to do some queries with SQL. We are more self-reliant on how to get the data, but still the infrastructure needs to be there and needs to have these access.

Where also things have improved is AWS (Amazon Web Services) or any other kind of cloud provider. You don't have to wait as much, everything is faster. So I think these two problems have been mitigated to some extent, but it's still, for me, the time when we have the data. And then provide incremental updates — honest incremental updates. And at the beginning, say, “Look, this is a scientific process. We can align on a metric.”

Loris: [00:48:15] What's a KPI for data science function? Is it like added revenue at the end of the quarter, the reporting period for your customers directly? Or is it more like how well can you track future drift, what's the net promoting score (NPS), or an equivalent of that?

Vlad: [00:48:36] I would say it's not as easy to kind of have one single KPI for everything. So we have, for each sub domain, we have different metrics. For example, how many of the proposal systems are automatic? So we have a data science KPI based on performance of the model, not on the output of revenue, because sometimes it's very hard to measure. Quality is super hard to measure.

Automation and Change Management

Because what do you say if you have a proposal system and are using less people. If this is the metric then I will just propose everything as good quality because then we do not need any more people, but this in the long run will backfire.

Loris: [00:49:28] Well, you can re-state that by saying that if you automate a lot of the tasks, people have time to do other things pending on the to-do list of their project management tool.

Vlad: [00:49:40] But sometimes it's their only job.

Loris: [00:49:41] Sometimes yes, it's their only job.

Loris: [00:49:46] So you're effectively taking away and you're getting these people to a point where their position is no longer justified because of the algorithm. But there are also other things that the business can do. Maybe this will require a bit of adaptation and a bit of flexibility from the person involved.

Every company wants to grow and do better for its customers and increase shareholder value. So if you have more people available to do work, that's a nice problem to have. Managing that it's another challenge, but that doesn't mean it's bad for the business.

Communication and Data

In terms of people communication and problems that arise when different personalities clash, and different expectations and fears clash, what would you tell to your past self? What did you learn?

Vlad: [00:50:44] That I am not as introverted as I thought. My favorite joke is, uh: What is the difference between an introverted and an extroverted mathematician?

Loris: [00:50:59] I think you mentioned this joke to me and I didn't understand it. What was it?

Vlad: [00:51:03] So for the listeners. The extroverted mathematician looks at your shoes instead of his shoes, anyway, I think it’s very funny.

Loris: [00:51:13] Oh, I see.

Loris: [00:51:16] Hang on. You gotta explain it to me. Is that because they tend to be so introverted that they typically just look at their own shoes? And so if they look at your shoes they must be really extroverted?. Is that it?

Vlad: [00:51:27] Yeah. That's, that's the point. Yeah. That's, that's the joke. So that's a mathematician's humor. So you see how socially awkward I am. And you can imagine how I was, no? But, the point is that, I still would not say I'm extroverted, but I do enjoy talking to people more. And I still feel exhausted after being at a conference for one or two weeks. And then I'm still feeling exhausted and tired, but I still enjoy the process of doing that.

So I would say that it's not about changing yourself from being extroverted to introverted or the other way around, but rather to enjoy this and then still acknowledge that you will need this recovery time from social / people stress.

Tips for Introverted Data Practitioners

Loris: [00:52:19] So it's a very nice way to put it because I know some folks that are trying to change the way you are. “Are you introverted? I'm gonna teach you to be an extrovert.” Sometimes you just are who you are. You gotta acknowledge that. As life will throw situations and environments at you that are not perfectly ideal, that don't resonate with your nature. There's going to be an added expense in terms of energy to deal with those situations, then take some rest, take a day off.

Vlad: [00:52:52] You can still enjoy them. So I still enjoy talking to people or being at a conference being asked lots of questions that challenge me and I like that. And that is where I'm still not extroverted, but it's something that you still can enjoy.

I think they're the things that I would tell myself. So do not be cocky with your PhD. Because you're nothing special, in a good way. I mean, you did your stuff and you learned a lot of things.

Loris: [00:53:17] A lot of people do.

Vlad: [00:53:18] It does not make you more special. It’s people skills basically. I read a book: “be interested, not interesting.” I think that's also true. So try to find some, something in each person then take a bit. Because every person can teach you one thing or another. If you just let them.

Loris: [00:53:39] This is going to be one of the best quotes. Everyone can teach you something if you just let them. Boom.

Vlad: [00:53:48] That's good. I think that's that's key. And because this is something that I did wrong, as I said at the beginning, we said, okay, we can automate everything. I don't know why you're still here. And this is not the person I wanted to be. If I would go back in time, I would really wash my head and say, "Vlad, what the hell were you thinking? Come on. You're an arrogant bastard. What has happened?"

Loris: [00:54:16] Yeah, it's part of the learning. I totally feel that pain. Perhaps I never got to a point where I actually believed that I could automate everything, but I did feel that sense of superpower of coming in and going like, “okay, I'll show you how to do it.” Because I've, I've done complex stuff.

If I solved those problems, surely I can solve yours, not a big deal. But then you stumble against an invisible layer between humans and tribal knowledge and all the invisible stuff that you don't see. Absolutely. There was something else that you wanted to add there or did I stop a thought?

Vlad: [00:54:54] No. I just was recapping, but I think it was the key point. I think we have the key points.

Loris: [00:54:59] Absolutely.

Vlad: [00:54:59] What I wanted to teach my younger self, but I think it was a good intro. And now we can get back in a circle to the intro and hopefully, we've answered the questions.

Loris: [00:55:11] Yeah, I hope so too. I really enjoyed this one Vlad, so thank you for taking the time. I know that today it's especially busy for you because you're also alone with the kids. So I don't know how you did it. The magic of Vlad Ardelean. Thank you very much for being on the show and I'm looking forward to our next chat.

Vlad: [00:55:29] Bye, Loris. It was a pleasure to be here.

Contact Us

Thanks for your message. You'll hear from us soon!
Oops! Something went wrong while submitting the form.