Introduction
Ever wondered what a customer’s status was last month, only to find your source database only shows the current state? Losing historical data is a common challenge in analytics. This article introduces dbt snapshots, a powerful feature for capturing changes over time and effortlessly building Slowly Changing Dimension (SCD) Type 2 tables. We’ll use a practical example to show you how snapshots work and how they unlock crucial historical insights.
Example scenario: Tracking Customer Subscription Status Over Time
Imagine you’re working for a streaming service that manages customer subscriptions. The subscription status changes regularly between states such as trial
, active
, paused
, or cancelled
. Your source database table, subscriptions
, only holds the current subscription state—so historical states get overwritten and lost.
Let’s illustrate this clearly, step-by-step, and use this practical scenario to demonstrate how to use dbt snapshots.
1. The Problem: Lost Historical Subscription Data
Suppose your subscription table initially looks like this:
subscription_id | user_id | status | plan | updated_at |
---|---|---|---|---|
101 | 9001 | trial | basic | 2024-01-10 08:00:00 |
102 | 9002 | active | premium | 2024-01-10 09:15:00 |
Later, a subscriber’s status updates from trial
to active
, and then later to paused
. Each update overwrites the previous state:
subscription_id | user_id | status | plan | updated_at |
---|---|---|---|---|
101 | 9001 | paused | basic | 2024-02-01 12:00:00 |
102 | 9002 | active | premium | 2024-01-10 09:15:00 |
Here, we’ve lost valuable historical context—like when exactly the subscription transitioned between states, preventing us from calculating metrics like “average trial duration” or “time spent active.”
2. Solution: Using dbt Snapshots
We’ll configure dbt snapshots to capture every status change and keep historical records intact.
Step 1: Configure the Snapshot YAML
Create this snapshot definition in your project (make sure you’ve defined your sources first):
snapshots/subscription_snapshot.yml
:
snapshots:
- name: subscription_status_snapshot
description: "Tracks historical subscription status changes over time."
relation: source('streaming_service', 'subscriptions')
config:
schema: snapshots
unique_key: subscription_id
strategy: timestamp
updated_at: updated_at
Explanation:
schema
: Separate snapshot schema (snapshots
)unique_key
: Unique subscription identifier (subscription_id
)strategy
: We usetimestamp
becauseupdated_at
reliably tracks updatesupdated_at
: We use the name of the field that contains the update timestamp (hereupdated_at
too)
Step 2: Run your Snapshot
Execute the snapshot command in your terminal:
dbt snapshot
After the initial run, your snapshot table will look like this:
subscription_id | user_id | status | plan | updated_at | dbt_valid_from | dbt_valid_to |
---|---|---|---|---|---|---|
101 | 9001 | trial | basic | 2024-01-10 08:00:00 | 2024-01-10 08:00:00 | null |
102 | 9002 | active | premium | 2024-01-10 09:15:00 | 2024-01-10 09:15:00 | null |
Step 3: Capture Updates Automatically
When subscription 101
changes from trial
→ active
(Jan 17) → paused
(Feb 1), re-running dbt snapshot
will capture history:
After running dbt snapshot
again on Jan 17:
subscription_id | user_id | status | plan | updated_at | dbt_valid_from | dbt_valid_to |
---|---|---|---|---|---|---|
101 | 9001 | paused | basic | 2024-01-10 08:00:00 | 2024-01-10 08:00:00 | 2024-01-17 14:30:00 |
101 | 9001 | active | basic | 2024-01-17 14:30:00 | 2024-01-17 14:30:00 | null |
102 | 9002 | active | premium | 2024-01-10 09:15:00 | 2024-01-10 09:15:00 | null |
After running again on Feb 1:
subscription_id | user_id | status | plan | updated_at | dbt_valid_from | dbt_valid_to |
---|---|---|---|---|---|---|
101 | 9001 | paused | basic | 2024-01-10 08:00:00 | 2024-01-10 08:00:00 | 2024-01-17 14:30:00 |
101 | 9001 | active | basic | 2024-01-17 14:30:00 | 2024-01-17 14:30:00 | 2024-02-01 12:00:00 |
101 | 9001 | paused | basic | 2024-02-01 12:00:00 | 2024-02-01 12:00:00 | null |
102 | 9002 | active | premium | 2024-01-10 09:15:00 | 2024-01-10 09:15:00 | null |
Now, your historical subscription changes are safely captured.
3. Using Snapshot Data in Analytical Models
You can easily query your snapshot data to calculate meaningful metrics.
For instance, create a model to analyze subscription durations:
models/subscription_durations.sql
:
select
subscription_id,
user_id,
status,
plan,
dbt_valid_from,
dbt_valid_to,
extract(day from dbt_valid_to - dbt_valid_from) as duration_days
from {{ ref('subscription_status_snapshot') }}
order by subscription_id, dbt_valid_from
Warning
The
extract
function works withPostgreSQL
but it’s pretty similar todatediff
standard SQL functions.
This generates a table that lets you analyze how long each subscription stayed in a specific status:
subscription_id | user_id | status | plan | dbt_valid_from | dbt_valid_to | duration_days |
---|---|---|---|---|---|---|
101 | 9001 | trial | basic | 2024-01-10 | 2024-01-17 | 7 days |
101 | 9001 | active | basic | 2024-01-17 | 2024-02-01 | 15 days |
101 | 9001 | paused | basic | 2024-02-01 | null | current |
102 | 9002 | active | premium | 2024-01-10 | null | current |
To be more precise with the given model duration_days just gives an integer or null but in a dashboarding solution we can add the “days” and replace the null by “current”.
4. Alternative Snapshot Strategy (Check Columns)
Sometimes, your source data doesn’t reliably track timestamps or simply lacks an updated_at
column entirely. In these scenarios, using the check
strategy makes sense.
Example scenario: Tracking Product Catalog Changes
Imagine you’re maintaining a product catalog for an online store. Your source table, product_catalog
, gets updated frequently. Columns like price
, availability
, and description
often change—but the source table doesn’t record when these updates happen. It only stores the latest state, without timestamps:
product_id | product_name | price | availability | description |
---|---|---|---|---|
201 | Running Shoes | 99.99 | in stock | Comfortable running shoes |
202 | Yoga Mat | 45.00 | in stock | Eco-friendly yoga mat |
Later, the price and availability of the Running Shoes change, but you have no timestamp of this update—only the current state is available:
product_id | product_name | price | availability | description |
---|---|---|---|---|
201 | Running Shoes | 89.99 | out of stock | Comfortable running shoes |
202 | Yoga Mat | 45.00 | in stock | Eco-friendly yoga mat |
Without snapshots, the historical price (99.99
) and availability (in stock
) are lost forever.
Using the check
strategy to solve this
To capture changes explicitly on specific columns without relying on timestamps, configure your snapshot as follows:
snapshots/product_catalog_snapshot.yml
:
snapshots:
- name: product_catalog_snapshot
description: "Tracks historical changes in product details."
relation: source('online_store', 'product_catalog')
config:
schema: snapshots
unique_key: product_id
strategy: check
check_cols:
- price
- availability
- description
Explanation:
- strategy: check explicitly tracks changes by comparing listed columns between snapshot runs.
- check_cols: lists the columns to monitor. Any change in these columns triggers the snapshot to record a new historical row.
If you want to check all columns, you can put check_cols: 'all'
instead.
Snapshot Results after Running
Initially, running dbt snapshot
creates the first snapshot records:
product_id | product_name | price | availability | description | dbt_valid_from | dbt_valid_to |
---|---|---|---|---|---|---|
201 | Running Shoes | 99.99 | in stock | Comfortable running shoes | 2024-01-10 10:00:00 | null |
202 | Yoga Mat | 45.00 | in stock | Eco-friendly yoga mat | 2024-01-10 10:00:00 | null |
After changes occur (price drops, availability changes) and you run snapshots again, dbt captures this update explicitly:
product_id | product_name | price | availability | description | dbt_valid_from | dbt_valid_to |
---|---|---|---|---|---|---|
201 | Running Shoes | 99.99 | in stock | Comfortable running shoes | 2024-01-10 10:00:00 | 2024-01-15 15:00:00 |
201 | Running Shoes | 89.99 | out of stock | Comfortable running shoes | 2024-01-15 15:00:00 | null |
202 | Yoga Mat | 45.00 | in stock | Eco-friendly yoga mat | 2024-01-10 10:00:00 | null |
Here’s what happened:
- dbt detected that product 201’s
price
andavailability
changed. - It automatically closed the previous snapshot (
dbt_valid_to
set) and created a new row capturing the latest state.
You could be wondering:
“Wait how the dbt_valid_from
and dbt_valid_to
is defined if there is not timestamp”
Excellent question! When using the check strategy without an explicit updated_at
timestamp, dbt automatically uses the time of the snapshot run (current timestamp
) to fill dbt_valid_from
and dbt_valid_to
. If you do have an updated_at
column, you can optionally provide it, allowing dbt to use that timestamp instead.
Why this works
The check
strategy doesn’t rely on timestamps from your data. Instead, it compares column values between snapshot executions. This ensures historical tracking, even when your source data doesn’t contain update timestamps.
The check
strategy can be way less efficient since you’ll compare all columns declared between the last snapshot version and the current data in the source, keep that in mind.
5. Advanced: Handling Subscription Deletions
If you want to track deletions explicitly by using hard_deletes: new_record
:
snapshots:
- name: subscription_snapshot_with_deletes
description: "Tracks historical subscription status changes over time."
relation: source('streaming_service', 'subscriptions')
config:
schema: snapshots
unique_key: subscription_id
strategy: timestamp
updated_at: updated_at
hard_deletes: new_record
This configuration ensures deleted subscriptions create new rows marked explicitly as deleted, preserving history.
The hard_deletes
can have three values:
ignore
: it’s the default value, if the record is deleted, nothing happensinvalidate
: same asinvalidate_hard_deletes=true
model parameter (deprecated), where the record is deleted it’ll change the lastdbt_valid_to
value to replace thenull
by the timestamp of the last run.new_record
: it’ll adddbt_is_deleted
boolean column
Practical Example with new_record
:
Imagine the initial subscription snapshot:
subscription_id | user_id | status | plan | updated_at | dbt_valid_from | dbt_valid_to | dbt_is_deleted |
---|---|---|---|---|---|---|---|
101 | 9001 | active | basic | 2024-01-10 08:00:00 | 2024-01-10 08:00:00 | null | false |
102 | 9002 | paused | premium | 2024-01-12 09:15:00 | 2024-01-12 09:15:00 | null | false |
Then subscription 101
gets explicitly deleted from your source database on 2024-02-01 12:00:00
. After running dbt snapshot
again, you’ll have:
subscription_id | user_id | status | plan | updated_at | dbt_valid_from | dbt_valid_to | dbt_is_deleted |
---|---|---|---|---|---|---|---|
101 | 9001 | active | basic | 2024-01-10 08:00:00 | 2024-01-10 08:00:00 | 2024-02-01 12:00:00 | false |
101 | 9001 | active | basic | 2024-02-01 12:00:00 | 2024-02-01 12:00:00 | null | true |
102 | 9002 | paused | premium | 2024-01-12 09:15:00 | 2024-01-12 09:15:00 | null | false |
What happened?
- dbt detected subscription
101
no longer exists in your source. - It closed the previous valid record (set
dbt_valid_to
to the snapshot run timestamp). - It added a new record explicitly marking it as deleted (
dbt_is_deleted=true
).
This configuration gives you clear visibility and control over deleted records, preserving historical context and enabling deeper analysis (e.g., retention analysis, churn detection).
6. How to Apply Transformations in Snapshots?
The best practice is to avoid transforming the data in snapshots. If however you’d need to filter or to deduplicate the data from the source (which is quite important for a good capture change), you should define an ephemeral model and use it as reference in the snapshot configuration.
To apply transformations you need to create a model (i.e. you model needs to be in models
folder) with an ephemeral materialized configuration
{{ config(materialized='ephemeral') }}
select distinct on (product_id) *
from {{ source('catalog', 'raw_products') }}
order by product_id
Then you can use this configuration
snapshots/products_snapshots.yml
:
snapshots:
- name: products_snapshot
description: "Tracks historical products changes over time."
relation: ref('ephemeral_products_snapshots')
config:
schema: snapshots
unique_key: product_id
strategy: check
check_cols: 'all'
7. Ensure uniqueness of the unique_key
To avoid bad surprises, it’s highly recommended to create a unique
test on the source on the unique_key
. Indeed, if you don’t have a single row per unique_key
you will witness weird behaviours from dbt snapshot. It can comprise the history and you will get trouble retrieving the real data.
8. Bonus: Snapshots meta-fields
We previously talked about dbt_valid_from
, dbt_valid_to
and dbt_is_deleted
.
There are also two others fields:
dbt_scd_id
: a unique key generated for each snapshot rowdbt_updated_at
: equal to theupdated_at
timestamp of the source
Key Takeaways:
- Implement SCD Type 2 Effortlessly: dbt snapshots are the go-to method for building Slowly Changing Dimension Type 2 tables, automatically tracking historical changes in your source data.
- Choose the Right Strategy: Use the
timestamp
strategy if your source has a reliable update timestamp column (updated_at
). Opt for thecheck
strategy (monitoring specific columns orall
) when timestamps are unreliable or absent. - Unlock Historical Insights: Snapshot data, with its
dbt_valid_from
anddbt_valid_to
columns, enables powerful time-based analysis, such as customer lifecycle tracking or product price history. - Configuration Matters: Pay close attention to the
unique_key
(ensure it’s truly unique in the source!) and consider options likehard_deletes
to manage record deletions appropriately. - Keep Snapshots Clean: Apply necessary transformations (like deduplication or filtering) before the snapshot using ephemeral models, rather than within the snapshot definition itself.
Conclusion
As we’ve seen, dbt snapshots provide a robust and declarative way to capture data history, transforming the challenge of building Slowly Changing Dimensions from a complex coding task into a simple configuration. By leveraging either the timestamp
or check
strategy, you can reliably track changes in your sources and unlock deeper historical analysis. You’re now equipped to start preserving valuable historical context in your own dbt projects!
👉 The full source code for the project is available on GitHub:
📂 p-munhoz/dbt-snapshots