What is semantic modeling? How and when do you need it? Can you automate it?
Join hundreds of practitioners and leaders like you with episode insights straight in your inbox.
What is semantic modeling? How and when do you need it? Can you automate it? Today I speak with Panos Alexopoulos author of the book "Semantic Modeling for Data - Avoiding Pitfalls and Breaking Dilemmas" published by O'Reilly.
Panos is Currently Head of Ontology at Textkernel, leading a team of data professionals in developing and delivering a large cross-lingual Knowledge Graph in the HR and Recruitment domain.
Panos is also a regular speaker and trainer in both academia and industry and I encourage you to look at his website: http://www.panosalexopoulos.com
Join the Discovering Data community!
Do you want to turn data into business outcomes and get promoted? Discovering Data just launched a new Discord server to connect you with people like you. Discover new ideas, frameworks, jobs and strategies to maximise the impact of your work. Data can be a lonely and challenging career, don’t do it alone!
Request access now: https://bit.ly/discovering-data-discord
Do you want to showcase your thought leadership with great content and build trust with a global audience of data leaders? We publish conversations with industry leaders to help practitioners create more business outcomes. Explore all the ways to tell your data story here https://www.discoveringdata.com/brands.
Want to help educate the next generation of data leaders? As a sponsor, you get to hang out with the very in the industry. Want to see if you are a match? Apply now: https://www.discoveringdata.com/sponsors
Do you enjoy educating an audience? Do you want to help data leaders build indispensable data products? That's awesome! Great episodes start with a clear transformation. Pitch your idea at https://www.discoveringdata.com/guest.
💬 Feedback, ideas, and reviews
Want to help me stir the direction of this show? Want to see this show grow? Get in touch privately or leave me a review with one of the forms at discoveringdata.com/review.
Your ideas help us create useful and relevant content. Send a private message or rate the show on Apple Podcast or Spotify!
Loris Marini: The last time someone shared a data set with me, it took me hours to understand what I was looking at, and I'm sure you had the same experience. so I was thinking how cool would it be if we could share a piece of data, whether it's a spreadsheet or the view of a database That's it, and be confident that the receiving end understands fully the context of that data its meaning and the relationships between those data points and the rest of the universe of the data in the organization. Maybe. But I think we can do a lot more to make it happen.
We would have enormous benefits, not only in terms of team velocity how quickly ideas flow and how fewer boring meetings we need to attend, but also we could integrate automation and software with the data and start really exploring the human machine interface. But before I steal the thunder for the episode of course I'm talking about semantic and we're talking about semantic But I wanna know, or at least I was asking myself, what can go wrong when you do semantic modeling? First of all, what is it but then how do you do it well and what are some of the pitfalls? So to explore those questions today I speak with Panos Alexopolous. Panos an expert in semantic modeling.
and works at the intersection of data semantics and software is the author of the book, Semantic Modeling for Data, Avoiding Pitfalls and Breaking Dilemmas. Published in September, 2020 by o Rally Media. PS is currently head of ontology at Text Kernel, leading a team of data professionals in developing and delivering a large cross lingual knowledge graph in the HR and recruitment domain.
Pans is also a regular speaker and trainer in both academia and industry, and I definitely encourage you to look at his website pans alex solos.com. Alrighty, so I'm here with Pans Ataps Panels. Welcome to the podcast.
Panos Alexopolus: Hi Loris. Thank you very much for hosting me.
Loris Marini: It's my pleasure. It's absolutely my pleasure. y introductory question for those like me that overtly robust semantics except our recent episode with Jessica Talisman and then some, a follow up on that with we hear a lot about data modeling and that's something we used to, but what is semantic modeling and how did you become interested in the field in the first place?
Panos Alexopolus: Yeah, let's start with data modeling, right? For as long as we have had data, our work and in the information technology sector, we have been asking ourselves, So how is the best way to organize them? How is the best way to represent them so that machines can use them so they wouldn't use them in software applications?
anyone in our audience who has even built a simple database, a relational database or has created an XML file or adjacent file, this is already a task of data modeling, right? You're writing data in a way that you expect your machine. Now semantic data modeling. So adding the term semantic in this highlights something that should already have been happening in the effort to make the meaning of the represented data as explicit, as several shareable and as machine processable as possible. So practically you create descriptions of data. This can be metadata or other types of representations which help you, both, you as a consumer of data, but also an application, a system to understand what the data is about.
A very simple example of semantic modeling, very simple. Not in the way that you would imagine is, for example, in xml, the tags that you. To frame an element, right? So you say, for example, you have Loris and you put a tag name already. Name is a sematic description of the data.
Loris how good name is as a sematic description is another discussion. And this is where the actual art and craft goes, right? But it's already type of data modeling.
Loris Marini: So the three things you mentioned. So one machine has to be shareable, has to be explicit. The context and the meaning behind it. So I'm using context, the meaning here, Is that a good idea or should, or I should talk about
Panos Alexopolus: can help define further the meaning. The meaning is what is meant and meaning can be context dependent or something may have a slightly different meaning in different context, but,
it's not the same thing. Some data can have the same meaning across different contexts, right?
Some other piece of data may have context dependent, interpretation and meaning. I will give you an example. Let's say that you have the attribution toll
Loris Marini: Yep.
Panos Alexopolus: the adverse population. We can agree more or less to a definition of toll, right? Putting some thresholds over a particular height. You are tall if you transform, If you go to the context of basketball playing, the definitions change. So the meaning of toll in basketball is very different than the meaning of toll for the adverse population.
Loris Marini: That's right. So context again, changes the
Panos Alexopolus: plays a role, right? But it, they're not concepts, meaning and context.
Loris Marini: you mentioned three attributes for a semantic description. We wanted to be machine shareable. We wanted to be explicit and we wanted to be
Panos Alexopolus: Going to be, sorry. Explicit machine. Interpretable machine. Usable so that the machine can parse it right daily and sell.
Loris Marini: Shareable
Panos Alexopolus: And I need to insist on the Cebu.
This is actually my pet peeve and the issue that I, my book is also mostly about, it's about the shareability and Cebu means that
when at least two parties, this can be two applications or two individuals read the particular description, consume the particular description, they understand almost the same thing.
I deal exactly the same thing, right? It's like in everyday communication, if I tell you Okay, go and get me a cold beer, I expect that you understand what cold means and that you won't give me a warm beer, and you say, Yeah, for me, this is cold, It's an extreme example, but you can imagine this, right?
and the thing is that when we talk about humans, right? Especially when we write
data, when we write descriptions, we have the curses of knowledge. We all think we have a lot of implicit and common sense knowledge
that we do not put explicit in our data. We have some expectations as to what the other party is going to understand.
But if we don't do this correctly, this is where things start going south.
Loris Marini: M I was at a conference recently in internal business conference. And for 50 minutes, the top person in the room presented stats and acronyms and names, which had a ton of meaning behind them, he assumed that everybody in the room would understand that, what those terms meant.
Of course, I was, perhaps the least knowledgeable person in the room with the least experience, so maybe it felt particularly weird for me. then I noticed that a lot of other people, especially. The most senior folks had similar doubts, and so this is something that we hear a lot in conferences.
Please ask, there's no stupid question. if you are doubting something or if you have a doubt about something, chances are someone else, hey, is having the same thoughts as well. So speak up and ask those stupid questions. That was the context there of a business strategy and business leadership.
So the impact can be pretty serious. Imagine if line managers don't understand what they're trying to optimize for
Panos Alexopolus: if, I made to copy to your original question, what is sematic data modeling? So exactly. SEMA data modeling is an effort to represent within the data and along with the data, explicit descriptions of their meaning. And in that area, there have been a lot of techniques and types of artifacts that we have been building, including things like taxonomies on toss entity, relationship models, and other d So all these are artifacts of, with different capabilities and with different characteristics that computer scientists and information technology practitioners and data practitioners build or consume every day, as part of building systems. That's why in my book in general, in my personal view, I define semantic data modeling as any, such an artifact. aims to achieve these thingss about data s explicitness and machine interpretability. Which means that's important, that it's not only ontologies, it's not only the semantic web, it's, they don't have the monopoly of semantics as sometimes
It may seem to have, because I've seen debates saying, your
database is not semantic, but my RDF graph is, let's debate that not with you now, but There are things that can make something a bit better than the other in terms of semantic clarity. But I don't think that any technology should have the monopoly of saying, We are semantic and you.
are not Let's to put
this out there.
Loris Marini: We're talking about the concepts
here. Absolutely. Panels, just one question to walk the talk. You mentioned artifacts. We're talking about artifacts as a product, not as a distortion of something.
Panos Alexopolus: Yes. Yeah. Artifacts as a, As an object. As something as a
Loris Marini: An object.
Panos Alexopolus: Yeah. yeah, yeah.
Loris Marini: Fantastic.
Panos Alexopolus: And that's why it's called model. It's actually an approximation of reality.
In the same way that a machine learning model, it's also an approximation
of rules and behavior. model, a sematic data model is an approximation of reality, right?
We're trying to put in an explicit representation what is in our heads in terms of knowledge, and of how we
Loris Marini: do.
And let's go back to the concept of shareability. so is shareability just what happens when we physically give access to someone, to a database, to a view, to a to the schema, Or there's a bigger effect, a ripple effect. it
applies also outside of engineering between people that don't necessarily, chew SQL how do you see shareability? What is shareable?
Panos Alexopolus: I'm not talking about the physical aspect. I'm talking about the meaning aspect. Shareable means that I give you a spreadsheet, for instance,
and you are an application by reading the spreadsheet.
Understands exactly what, for example, every column means
every column name.
And when I say understand exactly, it's, understands the meaning that I attended as a creator.
Loris Marini: Oh, okay. So That's, deep. So
Panos Alexopolus: That's what
no, The, intended meaning. let's, again, let's make an example, right? I'm in the finals department and you are asking, you are the
ceo and you are asking me give you a list of strategic clients. So I'm making a list and I say strategic clients, and I make, and I list in the call in every row, I dunno, 10 clients, right?
10 customers of the company. Then the CEO does the same, asks the same thing from the r and d department. So you have two data sets. It's two tables. It's have the column name, strategic client, probably with different input. What are you going to do this? Are you going to join them? Is the strategic client the reputation of the one creator of one table the same as the strategic client of the other?
Loris Marini: Right?
Panos Alexopolus: and that's where the conceptual level comes, and that's where the semantics come. A string does not have meaning. It has the meaning that we give it as humans.
Strat, the name of a column,
Loris Marini: Right.
Panos Alexopolus: not have inherent meaning. It's just a string, at least for the machine. So it needs to map to a concept or to multiple contexts.
And you know very well that language is messy.
Loris Marini: Yeah.
Panos Alexopolus: we use the same terminology to refer to many things. We use different terminology to refer to the same thing. Even across communities,
Loris Marini: Yeah.
Panos Alexopolus: So very quick question. What is the class?
Loris Marini: what's a class? Yeah.
Panos Alexopolus: If you ask a machine learning engineer, he will tell you something very different than a knowledge
Loris Marini: my software
Panos Alexopolus: Than a software developer.
Loris Marini: a class? for, so the developer is object oriented. Modeling for a machine learning engineer is about segmentation and clustering and
Panos Alexopolus: and in machine learning, it's
about classification. So it's categories called the class.
Loris Marini: Yeah. If you ask my kid, the class is
Panos Alexopolus: Yeah. Yeah. I'm not going there yet. Yeah, of course. In the broader sense. But even within a particular, the context of information technology, right? The concept
Loris Marini: Yeah. Yeah.
Panos Alexopolus: work.
Loris Marini: So, so, But then what is the solution here? Do we have to work with some sort of vocabulary or set questions that are always looping in a closed loop in our brain? Every time we say something, we have to double check that everybody agrees and understands the meaning.
because it sounds like it's gonna be a taxing activity to do that context that you said before, the curse of knowledge is there, I believe, for a reason, because it saves us time. We don't have to go through the basics. We just can't assume a whole bunch of stuff and just keep the conversation Is there a way to do
this properly and ensure that we can share information in
context without spending more time worrying about the context than the time we spend on acquiring the
Panos Alexopolus: Exactly. So this is where you can say it's it's a spectrum.
start with there are the two extremes, which I think they're both harmful So one piece of the spectrum is I don't care about semantics. I don't care about shareability. This is my data, deal with them.
Obviously this doesn't work. And that's why I think everybody understands that it won't work. It's not practical. It costs a lot, as you said the other extreme. Is to say, Okay, let's model the world, all the world in a common, shareable way with one model to do them one, one ontology to rule them all, like one ring to do them all.
You know this also won't work. It's very impractical, right? The one effort of the somatic work has been exactly this. And this is, even though semantic web has created a lot of nice developments,
Loris Marini: Sematic web.
Panos Alexopolus: sematic web. There is no common
ontology or a common knowledge graph of everything that everybody uses and adheres.
There are multiple views. So
if you are, if you want to be pragmatic, and I think here we, we have, as practitioners, we want to be pragmatic.
Loris Marini: Mm-hmm.
Panos Alexopolus: Suggested viewpoint and what I, let's say evangelize in my book, is to start with the. Pain point. So you have data, okay?
the worst case, you don't make any metadata description.
You just make a very simple thing. Do you have problems of How do you know you have to actually be able to observe them If you start looking at patterns that indicate that, okay, there are, as you say, there are lots of meetings, there are lots of in the application. There are all side of things.
So you start monitoring the situation and you need to be able to see, is this a technical problem? Is it a problem of data definition?
Loris Marini: Mm-hmm.
Panos Alexopolus: If you identify the problem of the definition, you need also, Cause you might have a lot of data to say, Okay, where exactly is the problem? In which parts of which aspects?
And then start making iterations on clarifying things. To in a grid scope, you have to negotiate the scope. give you an example kernel, right? In my company, we built as I said, a knowledge graph about professions and skills,
very difficult domain. Why? Because imagine that you have professions.
You can't have people agree necessarily across industries or even across countries. What a particular profession is a data scientist. Even a data even, we cannot agree on what a data scientist is, and if it's equivalent to data engineer,
Loris Marini: Exactly.
Panos Alexopolus: when we build the knowledge graph in our company, what is the scope of agreement that we want to ensure of shareability?
Is it the whole world? No. Very cynically, No. , it's our clients and the, and it's our users. the range of, I dunno, countries that we cover, the domains that we cover, we cannot care for the rest. So we start with that and it's a daily struggle to, work with So mentally and methodologically wise, that's where I would start. You cannot just model the world from top down, you have to define the scope. and define the pain points. That's the most pragmatic way to do it, even before you choose a technology, even before you decide.
What I need is a taxonomy. What I need is an ontology. What I need is a knowledge graph. you might need none of those. this is not a technology problem. a mentality problem. I think once has mentioned in many of his talks, et cetera, about the knowledge first approach.
but this is what meaning, So it's not about how you represent the data, it's the mentality that you care about what the data mean rather than how I, how I transform them and move them from pipeline to pipeline, which is a very important job. It's a very important task, and that's what the engineering comes in order to be able to build fast software and to build good software.
It's the plumbing.
Loris Marini: the focus. Yeah, but the focus that
Panos Alexopolus: but the focus of semantec modeling is not the plumbing. It's what you put in in the, in the.
Loris Marini: right. Okay. So the, this is super interesting. tell me from your direct experience, I'd love to know a little bit more about what happens in practice. So imagine you find a report or a dashboard is inconsistent and you suspect that different stakeholders, different people are using the same term in a different way.
what is the process? How do you approach the communication and the people side sort of the challenge to
get their buy-in to, prove that, hey, we might have a semantics a problem of a meaning like
Panos Alexopolus: Right.
It depends on many things. So it depends on what is
your organization like or what are your stakeholders that, you say? and it also, and if you have access
to them, because in many cases you have data and you have no access to the creators of the data. But let's say, but let's say that you have that that you have access.
Loris Marini: be optimistic.
Panos Alexopolus: It's practically you view like the Socratic method and you have to elicit from them what they actually mean by making probing questions.
Loris Marini: Uhhuh.
Panos Alexopolus: not about telling them, Oh, this is wrong or right. It's about, Okay, here you say, for example, strategic client. Can you elaborate on that? What criteria do you use to classify a client as strategic or not?
They will tell you, hopefully , You do the same with the other part. And then you compare if there is a room. Actually though you practically mean the same thing, we can make this, or really what you use is very different. What you mean is very different.
So let's try to fix that. And how is that fixed? It doesn't, it's not fixed. By making them do the same things. There are many ways to fix them. One way is to say is even to change the names
Loris Marini: Uhhuh.
Panos Alexopolus: or to add more context with the descriptions. A very simple example would be for when you have the call with strategy client, for example, to have a defin, a human definition behind with nice explanation of what is that, which is not machinery double because it's just pure text and structured text.
But still for the human consumer it's important. And then if you going to go more sophisticated, you may say, okay, this is the way to represent. The distinction by adding relations, by adding attributes, This is where the actual engineering comes, the actual modeling if want.
Loris Marini: I'm jumping back from a very pragmatical question to a very philosophical one, . it has a cost, right? The process of checking and fixing, it takes time or brain power, and that's associated to dollars that we spend doing that.
So from a purely philosophical perspective, I would say, Oh, it would be great if you could do it all the time and build, even though the scope is small, we could agree that the scope is within a particular domain of a large enterprise or within the town where the enterprise as a team of 250 people. So it could be relatively small, the domain and it could be just internal, right?
So even with that reasonably size domain could wake up one day and say, Hey, we should encourage people to define what they mean every single time they have a conversation and every single time they write an email and every single time they put data into a spreadsheet or create a plot.
Do you think that is something that, long term pays off? Do we get some sort of exponential gain if we were to do that? Every time we have a chance to do it. Or it's something that simply is not practical and we should never try and do
that at all.
Panos Alexopolus: It depends on how you use. I don't know philosophically if I'm not thinking about the pragmatic approach, the effort and the money that you put into making your data as explicit as the quality of the data to improve the quality of the data and the shareability, Indeed at some point it might have diminishing returns,
Loris Marini: Mm-hmm.
Panos Alexopolus: Really depends on where you use the data, what you use the data for, and whether you can achieve your goal.
less clear semantics.
Loris Marini: right. Yep. It's a trade off, right?
Panos Alexopolus: Yeah. That's why I said you start with the need and the pain point and the application. You don't start with,
Loris Marini: you walk
Panos Alexopolus: everything clear. impossible, and it's very taxing.
Imagine and I have this as an attitude in my book. Say what attitudes to avoid.
You can be pedantic for everything. You can
find sematic distinctions for almost everything in the world. A nice example that I have is, do you think that a fi and a violin is the same or a different thing?
There is a difference for most, in most of the cases it's not important for most of their contexts and application contexts, it's not important.
But if you are a specialized
violinist or Right, who says no fi. Not
Loris Marini: different. I
Panos Alexopolus: thing, right? Yeah. Okay. You will be disappointed.
Loris Marini: Mm-hmm.
Panos Alexopolus: But it's not so practical and there is always a dilemma when it comes to semantic distinctions as to the, what's I call the granularity
dilemma. What you consider having different meaning and what you consider having the same meaning.
there are some things that are clearly the same. There are some things that are clearly different, there is a lot of
gray areas and there is a lot of, there are a lot of entities, types of entities that are abstract. And when you go talk about abstract entities, problematic to define.
And if you notice, I dunno if you've ever seen most of knowledge graphs that you see out there, talk about concrete entities, locations, people, organizations. Concrete means that they have a physical existence, so it's easy to identify them, right? So if, when you build a database of people, a knowledge graph of people, it's rare, easy to define for 90% of the cases what a person is.
Loris Marini: Mm-hmm.
Panos Alexopolus: To define your own taxonomy of I don't know, political of religions or political affiliations or things like that, and then we start discussing, Okay, tell us what communist means.
Let's start debating what it means for something to be classified as communist and what is not. You can imagine the chaos that we will have, right?
Loris Marini: We would spend Forever
Panos Alexopolus: yeah, that's why, and I think you mentioned at some that how easy is to build and all, Maybe it's the next question, but I can
get you ready and this is why building a knowledge graph.
in general. You cannot say it's easy or difficult.
It's easier than before, there are two things that I think who starts an initiative needs to think One is the scope of shareability, how wide you want to achieve agreement.
The more scope you have, you need, the more difficult it will be,
Loris Marini: Hm
Panos Alexopolus: Right? It's easier for us to create now an ontology and agree. and it gets difficult as we add persons.
That's thing. And the second thing is the type of knowledge that you want to model. The more concrete is the know. So information about people, they, how old they are where they live. This is concrete stuff,
Loris Marini: That's cool.
Panos Alexopolus: Right.
But we start talking about abstract things political parties there are, what is their ideology, what is the relation between ideologies, how something affects something
So we started with causality, relation things become much more difficult.
Loris Marini: Messy, so if you are, if you're on a project like this and you wanna
deliver impact, you wanna try and be in one quadrant of that plane with two axis, small scope, if you can,
Panos Alexopolus: if it's a
Loris Marini: And
Panos Alexopolus: quick win. It's a quick win usually.
Loris Marini: it's a quick win.
Panos Alexopolus: But again but that shouldn't give you the illusion that because you have quick wins, you have conquered the problem of shaman
data model. In the same way that
building some really narrow machine learning applications hasn't conquer the problem of machine learning
Loris Marini: intelligence. interesting analogy there. So let me then keep surfing this wave of artificial intelligence and ml. So I'm trying to imagine, I'm someone that is not a specialist. I am more of a business person, and I listen to this conversation and I think I want the push button.
I just wanna buy, I don't my team to spend
so much time trying to agree on things. Can we deploy an AI that does it for us?
Panos Alexopolus: No
cannot do it
with a push button at least. To best of my knowledge, you cannot do with with a simple button. Then of course we wouldn't have a job.
can see that there mean there are lots of efforts in what we call automatic sematic extraction of information, right?
And of even of automatic knowledge graph building where, for example, you give system, you have, I don't know, corpus, let's say, let's talk with extract, right? CorpU and you identifying their entities and you relate them to each other. So for example, let's say you want build profiles of politicians, for example, and instead, Going and doing it top down, start building data.
You say, I'm going to use newspaper articles from New York Times, I don't know, from the last decade, and I'm going to identify of politicians. What is their political party affiliations another formation, right? Height or dunno what they have there.
actually this kind of task is well studied in the research community and to some extent in the industry.
There are some industrial frameworks that do But what happens is that There are many sub-tasks that building a model, an knowledge graph model entails. And the current state of the art is not optimal, hasn't fixed the problem. of the tasks and for some parts of the entities.
So let's take a basic example. One task is entity recognition. You have texts and you want to identify the entities that are there without having them already so let's say you want to identify the politicians that are mentioned in the text.
Loris Marini: Yep. The names
Panos Alexopolus: The names who are politicians,
Loris Marini: Yeah.
Panos Alexopolus: but not the others, not the athletes or the others, right?
Loris Marini: And you don't know anything else. You just have a bunch of texts
Panos Alexopolus: Yeah. You don't exactly. You don't know. You don't know anything else. You don't have a list of names that are politicians
because this is what you want to build. This is your goal,
Loris Marini: That's the opera.
Panos Alexopolus: right? There is a task well known in the machine learning and NLP
community called Entity recogni.
Loris Marini: Mm-hmm.
Panos Alexopolus: Which gives the text and tells you this, this, have been trained using machine learning techniques where you give examples of text and you tag them and the system learns also.
Deep learning is using and probably be even DPT three may be able to do that.
Loris Marini: Yep.
Panos Alexopolus: is that currently it comes to commercial systems, the majority of these systems try to deal with concrete entity types. So all systems identify persons and locations and organizations. these are most covered types of entities.
if you want to be more granular and you want to say, I need a system that will can give me in a text and tell me the difference between a politician and an athlete, for example. There is not such a thing, at least to the best of my knowledge, that works out of the box. What there is that you can take a system, give it your own exams and train it.
If you want to build a system that, but there is no something that magically will, We will push that. Yes. Assuming you have the data or you can use other techniques. So there, there are frameworks like.
Loris Marini: I would be tempted to summarize the answer to my question. Can I do it automatically? Can I deploy a piece of technology that helps all my stuff create a sort of common language and vocabulary and protect when they're going
off sync? The answer is you can do it at a very superficial level.
Maybe you could do it at a, the level of physical entities, but as soon as you go past that
Panos Alexopolus: there is technology, there is currently technology to help you with that. Definitely. So all the companies that provide either software for managing and organizing like your knowledge graph or even for mining,
all these are helpful. But it's not something that works with a push of a button.
many times, it requires human in the loop. It requires proper preparation. It requires proper quality control, quality monitoring, because all machine learning based and NLP approaches to create semantics are error prone. Also, humans are error prone, but that's another discussion. But if we talk about the full automation, it's error prone, it doesn't work completely with, I don't know, 99% accuracy in coverage.
Loris Marini: Right. you would still need
the human in the loop, someone to facilitate that process of extraction. But it's valuable if you think it in those terms. So you get someone to.
Panos Alexopolus: Yeah.
Loris Marini: work with the algorithm
Panos Alexopolus: Yeah. And this is actually, this, there is this fight always between the
symbolic community and, the syman the top down people at the
bottom map, right? So you
have the machine learning people saying, Give us the data, we'll get it anything from
there. And you have the
model say, No, you need to do it.
Top down, And this is the silence, both in the research community and also in the industry to make these two things work well together.
I have given talks and tutorials with relating knowledge graphs in particular with nlp, natural language processing in both ways. So you can use not so longus processing and machine.
To help you build your knowledge graph and maintaining and find errors and things like that, and vice versa. You can use and use some, In many cases, you need to use ontologies, knowledge graphs and semantic models to boost and help your NLP or machine learning application.
Loris Marini: So it's not a either or. It's a combination of the
Panos Alexopolus: Yeah, it should
Loris Marini: taking
the strengths of each
Panos Alexopolus: I think even the companies, there are many companies, that claim to do it all. With machine learning,
Loris Marini: Mm-hmm.
Panos Alexopolus: Ever go to them
and, get access to the systems,
you will find gers, you will find list of terms, you would find informal taxonomies.
You will find heuristics, you will find rules which are traditionally on the symbolic approach. They don't always, as they claim, use machine learning.
Loris Marini: So let me try to work up everything we covered so far. So we've got a few cool ideas here. First is, if you're developing a knowledge graph, or in this case a semantic model which is part of a knowledge graph. Remember that the smaller the domain in terms of the smaller is the pool of people that you're trying to build for, the easier your job will
Panos Alexopolus: the domain, the scope, The domain can be big, but
When I say domain, I am thinking of topics
like, I don't know, recruitment or politics. This is domain in my view, but it's also about who you are going to share it
with. Who is your user base?
Loris Marini: How broad is that user
Panos Alexopolus: Yeah. Your
Loris Marini: they use day to day, right? So if it's across the enterprise or it's just within sales, for
Panos Alexopolus: Yeah. One thing is across the, If you're talking about internal loyalty, whether it's across the enterprise, but also if you're talking about customers or users,
Loris Marini: Yeah.
Panos Alexopolus: Broad So for example, you're being Alexa, right? Or Siri,
who are you targeting? Is it going to be, if you want it in all the languages of the world and all the cultures, you need to adapt whatever you have to all of them, to satisfy all of them at the same time.
Not just North America, not just Greece, not just
Loris Marini: It's gonna be very hard. So that was one concept. The second concept is you entities, physical entities, people, locations organizations, or are they abstract
Panos Alexopolus: The more abstract are the entities, typically the more difficult is to rigorously define them.
Loris Marini: Yeah. And the third important point, which actually was the first we started from that, is you don't want to try and build a knowledge graph for the whole world or organization. You want to try and start from the pain points and walk backwards to understand do I need to add context to the data?
Do I have to
clarify The meaning? Do I have to get people on board in a room or in a virtual
call to try and debug why we're having those issues? And if those the solution involves adding meaning and context,
Panos Alexopolus: and making them how, say, embrace this and making them as, for them as seamless as possible, but it's never with a
push of a button.
Loris Marini: exactly. And then the last point we covered is this push of a button idea. Can we completely automate the generation of a knowledge graph? Answer, probably no. For most use cases, you could do it at a superficial high level. But the second you really start drilling down and you need more specificity humans in the loop.
And you might leverage artificial intelligence, linguistic programming to help you with some of the more tedious tasks. But you still need to have the humans to contribute and help machine learning algorithm do a better job. we covered them all? Is there anything missing that should add?
Panos Alexopolus: Because data scientists build models, analytic models. so let's say you have a knowledge graph and you want to do some analytics on top of it, or to feed parts semantic data into training machine learning system.
If the semantics of data that you use are not compatible with the semantics of the analytics that you want to have,
not going to. The desired result.
let's say that you don't have a model, but you, you have.
create this relation and you have data and you have two types of data. You have job vacancies,
Loris Marini: Yeah.
Panos Alexopolus: practically, Okay, I'm looking for a data engineer who needs to know sql, who needs to know ontologists, things like that. But you also have cs. This is panels, he has held these professions and he has held these jobs.
Now it's very easy to apply simple correlation analysis, right? A simple algorithm that will try to find what are the most typical skills for a profession based on the two pieces of data.
Loris Marini: Yeah,
Panos Alexopolus: if you, are really careful, notice that the output the you will get from one data set and the relation you will get from the other does not have exactly the same semantics.
Why? Because when you mine these relations from vacancies, what it says is that these are the skills that are not essential for a professor that are accessible for a profession. On the demand side, This is what the employers want.
Loris Marini: Yeah, not the need.
Panos Alexopolus: Yeah. If you mind what they need, if you mind the relation from cvs, from the people's profiles, what you are really mining is
what is available out
there, What professions possess particular skills,
Loris Marini: But
Panos Alexopolus: not what you need.
This is a subtle distinction, which may or may not be important, but if you don't know it, because you don't know the provenance of the data you have created from,
and you haven't made it explicit in the name, you just say it's important for without making it. More specific.
Loris Marini: right? Yeah. I see what You,
Panos Alexopolus: someone who just takes that and uses it somewhere else will lose that.
Meaning We lose that assumption and detail that is inherent in the way the relation has been created.
Loris Marini: Yeah, Very interesting. are assumptions part semantics or they're on top of semantics and they help us understand the context
Panos Alexopolus: It's part of the context
and bit of both. the context and the provenance. Many cases it's important so that you can interpret as humans, because it's impractical in many cases to make everything explicit. practically as a knowledge engineer and as a data modeler, a trade off between information that you want to be fully machine processable
and information that you want the human consumers to use, which are the other engineers the end users.
Because you cannot model everything in an explicit way. You're not making everything grand. So if I have this relation that I mentioned, and there are many things that you can do, which are already easier, One thing is to use better names. I have a whole sat in my book about how to define relations and entities with better naming.
Loris Marini: To minimize the
Panos Alexopolus: to minimize the probability that is, yeah, that is misunderstood.
Loris Marini: I should definitely need to read that chapter straight away. . see what you mean. it gets impractical. So I wonder, it reminds me of the famous metrics where you see, when should you automate a task based on how frequently you use that thing right there, there's a range.
If you use something once and it takes a long time, but you only use it once, makes no sense automating it. But there's this trade offs. So I wonder if something similar can be built for this, like a rule of thumb or a that can tell you, try to estimate how often or frequently you're gonna use this thing.
Try to estimate the impact that thing has on the organization. Try you in terms of dollar value. Are you talking about something that, impacts few cents or few dollars on the skew? Or we talking about something that potentially can have catastrophic impact? If we don't get that meaning right, we might end up producing the
Panos Alexopolus: Yeah, It It can be, you can build something like that. I have no idea exactly how or at least in a genetic way. I cannot give advice on how exactly
you can do it. It really, it changes on, depending
On the context. I don't have a frame
Loris Marini: on the context,
Panos Alexopolus: It's the same with all others. Even with the machine learning, right? Even with other types of ai, you do the same thing. have to decide,
okay, how data will I need first before I deploy my first model? What is the tolerance I want to have in my quality of precision recall how much time I need to, how often should I retrain my models and make a new, And actually that's a problem that also happens in demanding modeling.
How often do I generate a new version of my reality?
Loris Marini: Right. Model drift
Panos Alexopolus: Especially that's model evolution because
in some domain, if the knowledge is static, if we're mostly talking about historical events, this doesn't change. But if you are, if you have volatile knowledge, that changes a lot probably need to do faster iterations.
Loris Marini: panels. I have a few questions from some of my past guests that follow you and they super appreciate your work and so I ask them, what should I ask panels? So the first one is from Jessica Talisman. and her question is around property graphs versus knowledge graphs.
What is the difference in your view? we talking about the same conceptual model or the different.
Panos Alexopolus: Great question. And actually it indicates the problem with terminology that we're facing
There is no such a thing
as knowledge graph is an abstract, in my view. It's an abstract concept of is an after artifact that says we have entities and relat. not, if I understand correctly, Jessica implies when says no, Gassi means RDF which RDF is a way to represent semantics, to represent knowledge and label property graphs is also another it's another model that tells you, okay, these are the elements you have practically this gives you different tools to do your
Loris Marini: Mm-hmm.
Panos Alexopolus: work.
RDF tells you, look, what you have in order to represent knowledge and semantics is have relation, you have predicates, right?
Relations, you have properties, the properties that type properties. You have classes, you have relations, and you have some relations that have a predefined meaning that, a meaning that we hopefully agree what it is. So when you say, for example, RDF sub class, It has a particular meaning.
So you can use this when in these situations the label property graph model gives you slightly different Legos. You think? I have to think as Legos, based on which you build things. They give you notes, they give you exist called relations. It gives you attributes, And it gives you labels.
So practically you have a different toolkit two cases to build something, to build your semantic model. Obviously when you have different tools, they have different capabilities. There are things that you can easier represent with one toolkit, but not as easy with the other.
Loris Marini: I see. So it's not one is better than the other. They're just different
Panos Alexopolus: There are different, and we can discuss advantages, disadvantages not that one has the monopoli of semantics and the other not,
what RDF and all the sematic stack has are, that property graph models do not have is a predefined set of that have associated meaning that
Loris Marini: this is for
Panos Alexopolus: that yeah, that we all, that whoever uses it agrees to use it correctly.
So for instance, if you want to have class hierarchy, you define classes, you can do it both in with a property graph and
in, not if it's easier. Why? Because you can immediately say, this is a class. You can say it's the type of RDFs class, and then you can use the subclass relation to.
relate subclasses and then any other system that will consume. RDF knows what rdf subclass is because it's in the specification of rdf.
Loris Marini: Uhhuh.
Panos Alexopolus: graph does not give you a subclass, so you have to define it on yourself. You have to say, I'm going to create a relation, and I'm going to name it panels me, I'm going to to name it subclass.
What's the problem with that? The problem is that because this is not a standard name,
Loris Marini: Mm-hmm.
Panos Alexopolus: Easy for any other application to use it unless it knows it's meaning. So RDF gives you a predefined element that helps you with
however, it doesn't ensure that quality of the sub class relation in one is better than the other.
There are many ways that the subclass relation is misused in RDF
models. it doesn't prevent you from putting garbages in.
It's an additional structure that helps with clarity of semantics
because it gives you additional elements that have clear more clearly
semantics than less ones. it gives you less freedom to define and to define your own things, but exactly with less freedom to define your own, you go towards better shareability.
Loris Marini: so let's explore the other way around. So why should I choose a
property graph over a knowledge graph?
Panos Alexopolus: Okay. First of all, it comes to the models, there are two layers where we can do the discussion. One is the conceptual layer, where as a modeler I'm like, okay, which is a better tool to do my work. And the other is the implementation aspect. The engineering aspect, because all these have a physical manifestation.
If I have an RDF graph, it needs to live in an RDF repository and I need to do spark and I need to use particular technology stack to do my work. If I use property graph, I use different APIs, different pipelines, and there people more knowledgeable than me, can tell you what are the merits and problems with.
Loris Marini: So ca I
Panos Alexopolus: Which, which
Loris Marini: down to
Panos Alexopolus: which, Yeah. Which makes faster queries, which is more indexible. So we go to the database level, go to the physical level,
Loris Marini: the physical.
Panos Alexopolus: which I cannot comment on that. And I don't know in my, actually in text you we use property
graphs for that in it's mostly leg. Cause this is what I found when I gaming,
Loris Marini: right,
Panos Alexopolus: when it comes to the modeling layer as a modeling perspective, I would agree that RDF slightly better in because it gives you a bit more structure things.
And it already gives you predefined elements that you don't need to think how am I going to define them? And especially if you want to interoperate and to communicate with models also. in rdf, it's slightly easier than if you want merge, for example, property graphs.
One problem with the property graphs is that it doesn't have a schema language, so the schema you have to defined on yourself whereas RDF has a schema language and at least you have some expectations that you can
Loris Marini: Right. Similar to no databases versus just tabular databases. With a piece of Jason, you have total flexibility. You can create the schema on the fly, you can nest them.
Panos Alexopolus: personally, I always go for
a schema based solution. Even if there is no
schema, I need to create it and have them. Everybody
Loris Marini: it enforces the, Yeah. Yeah, Speaking of interoperability, a question from Ashley Faith. She asks many
companies can't even make their internal data sets interoperable, let alone externally. What advice would you give them to make interoperability more of a reality in the Semantec model in space?
Panos Alexopolus: what I mentioned earlier, right? It's a mission within an organization,
right? And it's, and it's about
starting from the pain points. interoperability is not something that you can sell to a customer majority of the cases. It's a backend thing. organizational thing.
The end application is You cannot, I don't know, make a marketing announcement and say, Oh yeah, we created, we improve interoperability within our organization. Okay.
Loris Marini: like,
Okay, what's in it for
Panos Alexopolus: the stock, will not go higher, ,
Loris Marini: Yeah.
Panos Alexopolus: So it's not something that is easily It's something that, it's hard and you need to do it in strategic way to try to convince people to talk to people. And again, start with the pain points and minimum viable interoperability that you want achieve. So instead of saying, I'm a company, I have 10 departments, let's merge EM departments.
Okay, let's first merge, I don't know, two of them. And let's see what.
Loris Marini: The most critical. Yeah. from which we can get the highest leverage. Yeah definitely. I see that that clash a lot between purists and
pregnancies. The business
at the end of the day you're there to make money. if you can produce at a lower cost and increase your margins, do it more efficiently, Yeah.
Then you game on.
Panos Alexopolus: conceptually it's similar to what I think every IT
engineer knows US debt. It's also a type of debt. in the same way. Technical
debt in this, Yes, it's semantic
Loris Marini: You gotta manage it. Yeah it's the same thing with I guess a little bit in finance. I see the same concepts, there's famous book, Rich Debt, Poor Debt, and it's all about, being able to manage debt as a way to build stuff. And I think it's very similar in business.
You gotta manage debt. You can't aim for a hundred percent perfect system. Where hopefully Ashley, that answers your question. I think. I think it does. the last one is on from Ole Lio lesson. Bna just sta texting him this morning and I asked, What would you asked panels?
And he said, Something really interesting is more enough philosophy. I think borderline philosophical question is he asks, to me, language is gestures, discreet intentions, persuading with humor more than reasoning, a convincing stare. How do you capture all this and what is the effect
that all of this intangible part of the communication has on objectivity
truth and clarity? sort of communication,
Panos Alexopolus: Yeah, it's I was talking with an ontology professor and I talked to him about my, when I was doing my PhD research, and I told him what you want vagueness and the problem of find things, and you said why you want to make your life difficult, , have you already solved the other problems?
It's actually very good, very correct question. And it has to do with not only language, but it's with the limits of what we can explicitly represent and make processable by machines. many modeling is an approximation the same way that any language model is also an approximation of right?
We go from what people have in their minds or what they, what is expressed in the document to a very. Reduced version, right? When you have just entities and relations, which more the majority of semantic models have, there's not much room for anything else, have. And there's also another thing, when you have a semantic model, it's not independent.
in order to use it, you need to it somehow in an operational environment to what the task is. If it's a language task, text based, you have be able to have input and output also expressed in that. If have gestures, you need to have sensors, you need to have this information.
Some. gestures, Okay. How do you define a gesture? And what is the physical manifestation? How it is represented by those that work with gestures, and then how you can model it the semantic Now about objectivity. Yeah, I would say there are very few things that are objective in reality, and I'm not going to the extreme that says everything is objective, right?
There are some things that we call agree on, but there are levels of agreement, levels of common acceptance, and whatever has a common acceptance is great extent. For example, we all say a sphere. Yeah. There are some few that say Earth is flat.
Okay. But I still consider this as an objective fact. We cannot debate that. But, and there are things that purely subjective. Yeah. I like that movie. You didn't like it. Okay.
Loris Marini: Yeah. Yeah. So again, it goes back to scope essentially. So objectivity and scope somewhat related.
Panos Alexopolus: objectivity and scope. again, philosophically, if you think about it, all the problems in humanity come from the
fact that and all the wars and all the things and all the conflicts come from expecting projecting what we call objective reality to people that don't consider.
another objective reality. because we believe that this is objective and therefore you are
wrong. And therefore if you think something different, you are wrong.
this is where the most confidence come.
Loris Marini: Yeah. so it's the problem as deep and complex as humankind itself. yeah I really enjoy this chat. I believe we are approached the end of our time. I really took a little bit over the time that you had allocated for the call, so I wanna thank you for that. Alex sos, author of the book, Semantec Modeling for Data, Avoiding Pitfalls and Breaking the Dilemmas or Rally 2020.
We're gonna have links as usual to the books in the show notes for those that are interested. Definitely have a look. It's in my list and the Christmas
my side is coming very soon, panels.
Thank you very much again for being
Panos Alexopolus: Laura, for the of the hosting.
Loris Marini: Absolutely my pleasure. And that's you on LinkedIn, .
Panos Alexopolus: you. Yeah.
Loris Marini: Thanks.