Today we talk about turning data into knowledge at scale, and about the arduous job of cleaning, defining, collapsing and expanding information and knowledge in the enterprise. My guest today is Jessica Talisman senior Taxonomist at Pluralsight.
Join hundreds of practitioners and leaders like you with episode insights straight in your inbox.
Checkout our brands or sponsors page to see if you are a match. We publish conversations with industry leaders to help data practitioners maximise the impact of their work.
Today we talk about turning data into knowledge at scale, and about the arduous job of cleaning, defining, collapsing and expanding information and knowledge in the enterprise. If we do a good job by the end of this episode you will be able to understand the blind spots around the human workflow of creating controlled vocabularies, taxonomies and ontologies. You will also feel a bit more grounded and confident to join these hot topics and contribute with your own unique perspective. My guest today is Jessica Talisman senior Taxonomist at Pluralsight. You can follow Jessica on LinkedIn.
Clean, structured, and trusted data is the goal of most data engineering teams. But data per se isn't the end goal. If we want to work smarter and more effectively we need to turn data into knowledge and create a stable knowledge base for the organization. So the trillion dollar questions: how do we turn data into knowledge at scale? How do we align teams so that we understand things in the same way?
To learn more about this I had the pleasure of meeting Jessica Talisman in episode 040 🎉 of the Discovering Data podcast. We talked about the arduous job of cleaning, defining, collapsing, and expanding knowledge in the enterprise. This is a summary of what I learned in part 1 of this episode. Is this useful to you? Do you agree/disagree with any of this? Join the conversation so that we can keep learning from one another 🙌
Ok, let's go. I learned that a stable vocabulary gives us unique definitions. It also has clear relationships between entities at different levels of abstraction. And this is what we need to create systems that are readable by people and machines at the same time. If there's one thing I remember from this conversation is this:
Trusted data needs a stable vocabulary. We need a way to classify things (taxonomies) and a way to express their relationship (ontologies).
Let’s unpack that a little bit. A controlled vocabulary is a place where we define, disambiguate, de-duplicate, and clean the data. It's where we align synonyms and resolve ambiguities so that we can agree on what things mean.
What's a taxonomy?
Taxonomies allow us to classify things and create order. One way to create structure in the data is to identify parent-child relationships between entities. This allows us to group them in what in the episode we called "containers". An example is what we see at the checkout in a supermarket or online. Every item belongs to a category so we can navigate the menu easily. Another example is the file system in a computer made of folders and subfolders. We can have many taxonomies and as Jessica Talisman said "the magic happens when you link taxonomies ontologically".
And this is the critical part.
Like a taxonomy, an ontology expresses a relationship between two entities. But it's way more powerful than a linear parent-child hierarchy. Take the word "shoes" and "running shoes". Running shoes belong to the class of shoes and as we traverse the tree, this parent-child relationship gives us useful information. But notice how we are still talking about the same type of object or entity. What if I want to define the relationship between my running shoes and my own body?
Let's see... I wear running shoes. So "I" is the subject, "shoes" is the object, and "wear" is the predicate that links the subject and the object together.
This triplet is an ontology.
The reason why this is powerful is that it allows us to express relationships between objects in different parts of the knowledge base. Another big lesson for me is that taxonomies and ontologies need governance! If they are not well designed and maintained we lose consistency and we can no longer agree on what things mean. We also talked about the FRBR model, and the need to document how inputs lead to outputs so that we can reconstruct the lineage.
An example is content creation. From each conversation in this podcast, I extract clips for Twitter, TikTok, YouTube, and Instagram. I must keep the relationship with the original content if I want to reuse the same clips for different purposes. We talked about Open Refine, an open-source tool that Jessica uses all the time, to integrate schema and reconcile data. If two names have the same spelling, are they the same thing? For example how many different things can the word "board" mean?
These might seem irrelevant but they are critical to reconciling reports and establishing the ground truth. We talked about the problem of "collisions" and how they cause "stemming" issues in search. We talked about the concept of a knowledge graph and how that differs from knowledge graph databases.
And we also touched on outer-world vs closed-world systems. I learned that a well-implemented knowledge graph (not the database LOL) should be able to connect the inner world with the outer world. That's how we can find connections between terms that exist inside and outside the walls of the org. Perhaps this is what we need if we want to create the sort of "networks of trust" that Douglas Laney talked about in Episode 020 of this podcast. Next week we'll dive into part 2 of this episode to explore the people's challenges in creating vocabularies. We'll also look at some strategies we can use to do our job better. Until then, stay curious and keep discovering data!
Your ideas help us create useful and relevant content. Send a private message or rate the show on Apple Podcast or Spotify!