Knowledge is of no value unless you put it into practice. — Anton Chekhov

In college, I took a course called IF15: Knowledge Engineering. That's when I heard the word ontology for the first time in my life. It would pop up occasionally in papers and articles I read, but I never took the time to dig deeper—I never felt the need, the necessity.

This term has come back in force over the past few months with the GenAI boom, and especially with the realization that classic RAG is failing and the rise of an alternative—or rather, a helper—GraphRAG. GraphRAG relies on the notion of Knowledge Graphs, which is deeply connected to the concepts of Ontologies.

Originally, this article was just going to be an introduction to ontologies. Then I realized how irrelevant it would be to stay narrowly scoped on that single concept. So I reoriented the article toward knowledge in general—the broader picture.

The thread running through this article: how do we structure what we know so that both machines and humans can understand it?

What is Knowledge?

List Structure — Knowledge Definition from my IF15 course

The Triptych: Data → Information → Knowledge

Before talking about how to model knowledge, let's define what it is.

We generally distinguish three levels:

Data: Raw facts, without context. 42, "Paris", 2024-01-15.
Information: Contextualized data. "The customer ordered 42 units in Paris on January 15, 2024."
Knowledge: Information usable for action or decision-making. "Paris orders increase in January—we need to anticipate stock levels."

Knowledge is therefore information used in a given context to solve a problem or make a decision (thks to my UTT course).

Knowledge Engineering

My IF15 course defined knowledge engineering as:

An approach that collects and structures reasoning. Its objective is to formalize problem-solving—the approach followed by one or more experts to solve a problem.

In other words: externalize the knowledge produced "in" and "for" a domain, and make it exploitable.

At the time, I found it very theoretical, almost boring. Today, with agents that need to "understand" our data to generate SQL queries or answer business questions, this discipline makes complete sense.

The Two Faces of Metadata

When we talk about knowledge in enterprises, we're essentially talking about metadata—data about our data. This metadata divides into two fundamental categories:

Domain Knowledge (Business Knowledge)

This is what the business knows about its domain:

Business concepts and jargon: What is "churn"? "MRR"? A "qualified lead"?
Glossaries and definitions: How do we calculate revenue? Gross or net?
Acronyms and synonyms: WC = World Cup, ARR = Annual Recurring Revenue, CMR = Cameroon

Structural Knowledge (Technical Knowledge)

This is what the data knows about itself:

Relationships between elements: Which tables can be joined? On which keys?
Dependencies: If I modify this column, what breaks?
Lineage: Where does this data come from? What transformations has it undergone?

These two types of knowledge are complementary. Domain knowledge says "the business talks about revenue", structural knowledge says "revenue is in fact_sales.amount". Without the mapping between the two, it's impossible to translate a business question into a technical query.

Garbage In, Garbage Out

We all know this principle in Machine Learning: if your training data is bad, your model will be bad.

This principle applies exactly to GenAI and Agents, but with an important nuance: for GenAI, the "garbage" we're talking about is mostly primarily the metadata.

When you want to do a text-to-SQL project for example, the heart lies in the metadata—the description of the data you have in your possession.

Very often, companies rush directly into the AI layer, GenAI—either to follow the trend or because they think that's where the difficulty lies. But not at all. The difficulty is upstream: in the quality and completeness of metadata (and obviously data, but this is normally already well known...).

Investing heavily in sophisticated models without investing in metadata is building a house on sand.

How to Model Knowledge?

There are several ways to structure knowledge, with different levels of sophistication. These are called Knowledge Management Structures.

The choice depends on the use case, the scale, and especially who will consume this knowledge. They can obviously be combined depending on the use cases.

List (Controlled Vocabulary)

The most basic form of structuring.

It is A simple enumeration of possible values, with no relationships between them.

It is Flat, non-hierarchical structure, with no semantics beyond belonging to the list.

Examples: list of countries: France, Germany, Spain..., list of genders: Male, Female, Non-binary, list of order statuses: Pending, Shipped, Delivered, Cancelled

This is Useful for constraining values, but captures no relationships or meaning.

Taxonomy

We step up by introducing hierarchy.

A Taxonomy is roughly a hierarchical classification based on parent-child relationships: Single relationship: IS-A. Taxonomies are Tree-like, from general to specific.

Examples: A car IS-A vehicle, An SUV IS-A car, An SUV IS-A vehicle (by transitivity)

What is great here, is the introduction of conceptual hierarchy. You can navigate from general to particular and vice versa, Although, only one relationship possible. You can't say that a car belongs to someone or is manufactured by a brand...

Thesaurus

The thesaurus enriches taxonomy with synonymy and generic relationships.

They are Taxonomies augmented with equivalence and association links.

Relationships:

IS-A (inherited from taxonomy)
SYNONYM-OF: Car ↔ Automobile ↔ Auto
RELATED-TO: Car ↔ Road, Car ↔ Driver

So, they help handling linguistic ambiguity. When a user searches for "auto", we also find "car".

Typical usage: Search engines, indexing systems, navigation aids.

Semantic Layer

The semantic layer is a related concept that has had a lot of influence in the data ecosystem, with tools like DataHub, dbt Semantic Layer, or Tableau/PowerBI data models.

They Pre-calculated logical views on data, defining business metrics and concepts. So, they are, by-design, Hard-coded and static information. They are often scoped to a tool (Tableau, PowerBI, dbt) and more like "Semantic Views" than true semantics

Concrete example:

metrics:
  - name: revenue
    description: 'Total revenue from completed orders'
    type: sum
    sql: amount
    filters:
      - status = 'completed'

Despite their theoretical importance, semantic layers remain marginal with clients. Very few companies actually have a mature semantic layer. And when it exists, it's often limited to a specific tool.

The semantic layer references recurring information but doesn't allow generating new knowledge. It's static—you define "revenue," but you can't dynamically ask "which metrics are related to revenue?"

Ontology

Ontology is the major qualitative leap. We move from static to dynamic.

They are formal structure allowing rich, typed, and semantic relationships, unlimited and explicit (MARRIED-TO, WORKS-FOR, MANUFACTURED-BY, LOCATED-IN, PURCHASED,...).

Structure:

Classes: Abstract concepts (Person, Product, Company)
Subclasses: Specializations (Employee IS-A Person)
Instances: Concrete entities representing real facts (John Smith, iPhone 15)
Axioms: Rules and constraints ("An employee can only work for one company at a time")
Properties: Attributes of classes (Person has an age, a name...)

They are many standard out there: RDF, OWL, SPARQL, we used 3 of them at UTT, lol, and we worked with a tool named (protégé)[https://protege.stanford.edu/] (read it in french please)

The ontology is by-design traversable. You can query it to infer new information that wasn't explicitly declared.

Example: If John WORKS-FOR Acme, and Acme LOCATED-IN Paris, then we can infer that John works in Paris—even if this fact isn't directly stored.

Knowledge Graph

The Knowledge Graph is mainly, is my current understanding, the concrete implementation of an ontology.

A graph of structured data where entities (Nodes) are connected by typed relationships (Edges). Simple, Basique.

Nodes are entities (people, products, concepts...) and Edges are labeled and directional relationships

Why and For Whom Should We Model Knowledge?

Structured knowledge has three types of consumers, each with specific needs.

For Humans

For humans, structured knowledge is invaluable across roles: data analysts, analytics engineers, and data scientists benefit first, gaining the context to interpret fields like status_cd, understanding how tables can be joined, and correctly discerning whether a negative amount signals a refund or an error—in the absence of clear documentation, newcomers are forced to relearn what was already known. Stakeholders and business users rely on a common language to avoid ambiguity: a shared glossary ensures that everyone understands terms like "churn" and calculates KPIs, such as "revenue," using consistent logic, while enabling cross-team communication so that Marketing and Finance speak the same language. Operational and data engineers, along with new team members, need living documentation to grasp data processes, accelerating onboarding so that the information system becomes navigable in days rather than months, and facilitating traceability and audit by making it clear where numbers come from and how calculations happen.

Agents

This is where it gets really interesting. Let's direclty take the really common Text-to-insights agent use case.

The Text-to-insights Challenge

Everyone wants to chat with their data, but everyone is not ready to do what is necessary.

Whether the data is in a Data Lake, a Data Warehouse, a simple relational database, the problem is the same: translating a business question into a technical query.

To achieve this, the agent must be able to:

Map business concepts → "revenue" corresponds to which column?
Understand values → "World Cup" is the code WC or WORLD_CUP?
Know the joins → How do you link customers to orders?
Respect business rules → Is revenue calculated before or after tax?

What Agents Need

Concretely, a performant Text-to-insights agent needs:

Element	Description	Example
Glossary	Concept → technical mapping	"revenue" = `SUM(orders.amount)`
Enriched schema	Tables + columns + descriptions	`status_cd`: Status code (A=Active, I=Inactive)
Joins	Relationships between tables	`orders.customer_id` → `customers.id`
Validated examples	Question/SQL pairs	"Top 10 customers" → `SELECT...`
Business rules	Constraints and calculations	Revenue = amount before tax, excluding cancellations

The Measured Impact

This isn't theory. Research (notably from LinkedIn and Snowflake on Cortex) has quantified the impact of metadata on the quality of generated queries.

The difference between an agent that hallucinates non-existent columns and an agent that produces correct queries? The quality of metadata provided in context.

Why Does This Matter Now?

The Return of Knowledge Engineering

The term "ontology" has been experiencing a resurgence over the past year. This is no coincidence: it's directly correlated with the rise of GenAI.

The first peak of interest in ontologies was corelated with the big data boom, the second one with the GenAI one.

It reminds me of my university courses, courses I found sometimes boring. Those courses are getting their revenge.

The Failure of Classic RAG

Classic RAG (Retrieval-Augmented Generation) works like this:

Split documents into chunks
Vectorize these chunks
Retrieve chunks similar to the question
Inject them into the LLM prompt

We inject raw context—pieces of text without structure. It's sufficient for simple factual questions ("What is the refund policy?"), but insufficient for complex reasoning ("Which customers are at risk of churning next month?").

Classic RAG is a Raw Context Retriever. It retrieves text, not knowledge.

From Retrieval to Reasoning

RAG Evolution: From Retrieval to Reasoning

	RAG Today	RAG Tomorrow
Full name	Retrieval Augmented Generation	Reasoning Augmented Generation
Input	Raw text chunks	Structured knowledge
Method	Vector similarity	Vector similarity + Graph traversal
Capability	Finding facts	Inferring insights

Raw context is interesting for facts, but it's even more impactful to be able to reason over existing knowledge in a domain.

This is where Knowledge Graphs and ontologies come into play. They allow agents to Navigate through knowledge (not just retrieve it), Infer non-explicit facts, Reason about relationships between concepts

The Evidence for Enterprises

It has become obvious that agents need a structured way to understand reasoning processes.

Investing in the AI layer without investing in metadata = predictable failure
Output quality is determined by input quality (garbage in, garbage out)
Knowledge Management is no longer a nice-to-have, it's a prerequisite

The good news: You don't need to do everything at once.

Where to Start?

Start small: A CSV file with a glossary of business terms
Document key tables: The most queried ones first
Describe columns: Possible values, meaning, usage patterns
Map joins: Relationships between main tables
Collect examples: Question/SQL pairs validated by humans

Perfection is not required. Progress is.

This metadata can be AI-assisted: take samples from your tables, pass them to an LLM to generate descriptions, then manually validate and adjust. It's tedious work, but it's the work that makes the difference between a POC that impresses and an agent that delivers value in production.

Ontology is not (only) a dusty academic concept. It's the foundation on which tomorrow's agents will be able to reason—not just retrieve text.

Classic RAG has shown its limits. GraphRAG and Knowledge Graph-based approaches point toward the future: systems that understand the structure of knowledge, not just its textual content.

For enterprises, the message is clear: before investing in the latest trendy use cases/tools, invest in your metadata. Document your tables. Define your concepts. Map your relationships.

It's less sexy than a new tool, but it's what will make the difference between an agent that hallucinates and an agent that reasons.

PA,