dbt: How to Create Custom ID Fields When Your Data Doesn’t Come With One

We’ve all been there. You pull a dataset into dbt, excited to start transforming it, only to realize there’s no ID field — no primary key, no unique identifier, nothing to reliably join or test on.

Without IDs, it’s hard to join tables, track changes, or even run some data tests properly. Luckily, dbt gives us a simple and consistent way to create custom ID fields, even if your source data doesn’t come with one.

Let’s go through how to do it cleanly using the generate_surrogate_key macro.

Why You Need an ID Field

A good ID field acts as a unique fingerprint for each record. It ensures you can:

Join tables safely without duplicates or mismatched rows
Run data quality tests that rely on unique identifiers
Keep track of changes or updates when records refresh
Avoid depending on fields that might change over time (like names or descriptions)

If your source tables don’t provide one, you’ll want to create your own.

Common Ways People Try to Create IDs

Before dbt, most people would create IDs in SQL using functions like CONCAT() or hashing functions like MD5().

For example: md5(concat(first_name, last_name, date_of_birth)) as customer_id

This approach works fine, but it gets messy quickly if you’re dealing with nulls, data type mismatches, or multiple joins. That’s where dbt and its built-in ecosystem really shine.

The dbt Way: Using generate_surrogate_key

Instead of writing your own hash logic from scratch, you can use a pre-built dbt macro that handles all the edge cases for you.

The generate_surrograte_key macro creates a unique key from one or more fields and takes care of things like nulls, concatenation, and hashing behind the scenes.

To use it, you’ll first need to install the dbt_utils package: Create a packages.yml file in your project’s root folder (if you don’t already have one) and add the following (OR the most up to date on the dbt utils page):

Then install it by running dbt deps

Once it’s installed, you can call the macro inside your SQL models like this:

{{ dbt_utils.generate_surrogate_key(['field_a', 'field_b', 'field_c']) }} as record_id

This will create a single hashed ID value that uniquely represents each combination of those fields.

Example: Creating a Custom Athlete ID

Let’s say you have a dataset of athlete stats with quite a few field but no unique identifier. You can create one using the macro like this:

Now every record gets a unique athlete_stats_id that stays consistent as long as the combination of year and athlete_name stays the same.

Why Use generate_surrogate_key Instead of Plain SQL

You could technically build your own IDs with concat() and md5(), but there are some clear advantages to using the dbt macro:

It automatically handles nulls and data type inconsistencies
It uses a consistent hashing pattern across your entire project
It’s easier to read, maintain, and reuse across multiple models
It avoids subtle join issues that can happen when doing it manually

In other words, it standardizes something that most teams end up needing anyway.

Author:

Rosh Khan

View Profile