The Semantic Layer doesn't make sense

How I stopped YAML'ing and learned to love prose.

Sep 26, 2025

The Semantic Layer has been a top-five hot topic in the last couple years of Analytics Engineering (the post-modern-data-stack years).

Earlier this week a newly formed consortium of data companies announced a plan to collaborate on an Open Semantic Interchange (OSI) standard.

HEADLINE: Snowflake Unites Industry Leaders to Unlock AI’s Potential with the Open Semantic Interchange Initiative (link)

The announcement is light on details. There’s literally just one: YAML. So who knows what it will look like, but I thought I’d take the opportunity to do some ranting.

Starting with some assumptions:

You are on the data team at some business.
The business has a data warehouse, like Snowflake, into which you copy data tables from many sources (your product, Salesforce, Zendesk etc. etc.)
The ‘raw’ tables are kind of hard to work with, so you have a whole bunch of SQL dedicated to just cleaning, renaming, and generally remodeling your data into a cleaner more intuitive state.
Many of the remodeled tables are so nice, that anyone with basic SQL skills could run with them, but none of the people who would benefit from access know SQL, nor do they plan to learn it.
Many of the remodeled tables, especially the important ones, have a lot of sharp edges. Their transformations were written years ago, and fixing the sharp edges would mean rocking the boat, and tracking down myriad dependencies.
- Some sharp edges are easy to fix, but very hard to explain, so it’s easiest to just leave them.

With that world building aside, let’s get into the complaining.

The Semantic Layer is a bad idea and we don’t need it, especially not now.

Semantic layers are tedious, bordering on intractable.

High-dimensional spaces are un-intuitively vast, and data warehouses are high dimensional spaces.

A table with 10 column → a 10-dimensional space.

Most companies have 1,000s of tables, several with 100+ columns, and these tables are inter-related.

Bringing a Semantic Layer to this high dimensional manifold is like bringing a label printer to a jungle. Now start by labeling each thing in the jungle. Then label the relationships between the things, and finally label all the interactions you might have with a thing or a relationship.

This is something you can do, but it’s a very silly thing to do. And when a data company tells you that they sell such a label printer, you should imagine yourself stumbling around labeling things until the Earth crashes into the Sun.

Tesseract | Interstellar Wiki | Fandom — defining metrics in the data warehouse

You can just name things

Tables and columns are already name-able. You don’t need a Semantic Layer. You just need to name them well.

If your data team cannot rename something, then you have a problem; you have lost the edit privileges needed to do your job. It doesn’t matter if the gatekeeper is IT or Finance. You need edit privileges. Your job is to produce semantic coherence, and well, take it from Larry, at 8:40.

Unless you name things the right way, you can’t have intelligent conversations about them. So once you rename [this table] as Timesheet, or Timecard, now you can ask some intelligent questions.

Naming things is more important than ever before, and it’s hard. It has always been hard. Here’s the more classic quote:

There are only two hard things in Computer Science: cache invalidation and naming things.

This is a good thing for Analytics Engineering as a craft. You need to have hard things to get good at.

Building an Analyst Agent is a great Trojan Horse to kick off your real crusade: renaming things. Name your reasoning as “AI Readiness”.

How many Software Engineers does it take to write a paragraph?

A big part of the LLM paradigm shift is the new primacy of prose. Software Engineers need to pick up Technical Writing.

The reason you don’t hear more Software Engineers talking about this is that we mostly suck at writing, and many SWEs have convinced themselves that programming languages are actually harder, and more important than language-languages.

And I get it. I’ve been writing SQL for 10 years. It’s super satisfying to caffeinated and do some mental shape rotating. Plus you get paid a lot to do it, and very few people get paid to do writing.

But consider this…

Where did the name “Semantic Layer” come from?

I’ll tell you where.

Some marketer wordcel made it up as part of their ‘developer-led growth’ GTM plan.

They smelled your fear of writing paragraphs, and contrived some system where you’d only have to write one sentence at a time. Then they called it a “YAML-based framework” that “layers on to your data stack” and you ate it up.

Gemini 2.5 has 99.9% percentile reading comprehension and 1M context token window.

You think you need a framework to “organize semantic metadata”?

Just. Write. Paragraphs.

Don’t get sold on some paint-by-numbers worksheet with 15 data company logos stamped on it.

Set aside some time. Make it fun for yourself. Maybe start a little Substack? Write about something technical. Switch between styles carelessly. Put punctuation wherever you want, and only share it with SWEs, like some kind of support group.

It’s an old idea that has never worked

I think the Semantic Layer traces back to the idea of the Semantic Web (wikipedia).

The goal of the Semantic Web is to make Internet data machine-readable

Tim Berner’s Lee was a bit of a one-hit wonder with “the internet” but he also had some ideas about what we should do with the internet, before it turned in to… what it turned into.

It’s kind of a cool idea, definitely worth skimming. But it didn’t catch on.

Instead, we got 25 years of recipe blogs, and then finally LLMs.

Machine Readability is now solved. It is no longer an obstacle.

Standards? Graphs? Yea we’ve got those, they’re called written English and the ParaGraph.

Sometime in the next year a salesperson will advise you to convert your paragraphs into their proprietary YAML framework using ChatGPT. When that happens, you must immediately end the call.

SQL is the interface

The most successful semantic layer of the last decade is LookML; Looker’s semantic layer for defining dimensions and measures on each table, and the relationships between sets of tables. The purpose of this semantic layer was to enable a no-code UI where user’s could point-and-click their way to combinations of dimensions, measures and charts.

Today, the best UI for almost everything is Chat. You don’t need to set up a a semantic layer to produce a UI, because nobody wants a UI. Even if you did produce this UI, your users would ask for a Chat sidebar so they wouldn’t have to touch it.

How should an LLM interface with data?

The answer is: whatever way has the deepest distribution in the training corpus.

So actually the answer is Python + Pandas. But SQL is a solid option, easy to implement, and easier to reason about for the rest of the data team.

It’s actually kind of tragic that LLMs aren’t better at SQL given its age; SQL is older than Bash. What matters for LLMs is the prevalence of the language in the corpus of ‘the public internet’, and here SQL has several issues; it comes in many dialects, it has often omitted state in the database tables it references, and hidden semantics as well. There’s just not as much SQL as there is “Data Science Python”.

But you know what there’s even less of? Proprietary semantic layer APIs that have yet to be written. Just switching between SQL and one of these APIs probably nerfs your model’s IQ by a dozen points.

Creative alternatives

Beyond Analytics Engineering and common sense, there are deeper, darker games to play around LLM intelligence. These are twisted systems built off the loot from the largest theft of intellectual property in human history.

I don’t have benchmarks to back this up, just intuition from the fact that ML systems are literally manifestations of bias, and these systems were trained on the internet, which I’ve spent time on.

Try writing your agent prompt like a monologue from American Psycho, or Wolf of Wall Street, but to the tune of mainstream ideals of diligence and capability.

Set your agent up as former Jane Street, 23 year old Korean American male. Caffeinated, and freshly bumped on whatever nootropic drug the model most closely associates with intelligence (ask it). Your company is small and selective. Fast growing, with offices in Manhattan and San Francisco. Your context and prompts use American coastal elite idiomatic English with immaculate spelling and grammar. If you give the agent a name (as the Claude system prompts do) then you must pick something with minimal cultural-political valence.

And keep the whole prompt tight. You can’t have a Korean-American male named “Claude 2”. That’s semantic incoherence.

My point is really that these tools aren’t part of a stack the way the last decade of vendor products were. You gotta jam on this yourself.

And I’ll say one extra time that these biases are twisted and bad. This technology is already shitting where it eats (the internet), which is also where we used to be able to eat. As proof of my deep cynicism toward these models, here are the darkest most dispiriting 10 minutes of online video I have seen in my career.

What’s worth salvaging

We don’t need a semantic layer. But there are some parts of the semantic layer concept that I do like; things that would be nice to have.

Table Relationships

Data warehouses forgot to implement these. They have their reasons, and it’s really fine, but it would be nice to define the existence of a relationship between tables in an official way.

Like hey, the ORDERS table can join to the USERS table on the user_id column, and that’s the same user_id that is the primary key of the USERS table, so it’s a M:1 join.

That’s great information to surface for my LLM.

I’d even be willing to let “user_id” exist in a global namespace so that I can just assume every “user_id” I come across is the “user_id” of the USERS table. This way I could save a ton of typing, and it might be the case that ‘name spacing’ is a good idea for semantic coherence generally.

Higher-order semantic objects

Tables and column have descriptions. What if I want to create a description of a star schema / data mart? Where do I put that?

How do I track a table’s membership in one mart of another? How do I know when a table renaming has broken this mapping?

This might be one I’m just supposed to implement myself using the database. I could just data model it! But using a framework like dbt, it seems like something I’d rather ‘configure’ than implement.

Maybe I’ll build an extension for it.

The concept of a metric

Total cop out here at the bottom of my screen, but I kind of like the idea of defining a metric.

I know I was said it was tedious and stupid earlier, but some metrics can’t be materialized as columns, and you still want to “keep them” somewhere… but where?

I don’t have a great answer. It might be another thing that I’d want to implement as a table. If you hadn’t picked up on it, dbt + Snowflake is my hammer, and everything is a nail.

But why not start with a table called metrics, and every row is a metric, and there’s a column that describes how to calculate it in plain English.

Done? Call it Metric Context Protocol and ship it.

Sung Won Chung

Sep 30

Severely underrated: being an organized person and doing the hard, tedious work of keeping tables organized.

1 reply

Matt Arderne

Feb 11

We agree entirely on this. Per my latest post, I would like to keep pushing on this as I think there is more mess in the way of that Jane street analyst that the data industry would like to admit…

4 more comments...

Jay’s Substack

Discussion about this post

Ready for more?