Introduction

Ever wondered what a customer’s status was last month, only to find your source database only shows the current state? Losing historical data is a common challenge in analytics. This article introduces dbt snapshots, a powerful feature for capturing changes over time and effortlessly building Slowly Changing Dimension (SCD) Type 2 tables. We’ll use a practical example to show you how snapshots work and how they unlock crucial historical insights.

Example scenario: Tracking Customer Subscription Status Over Time

Imagine you’re working for a streaming service that manages customer subscriptions. The subscription status changes regularly between states such as trial, active, paused, or cancelled. Your source database table, subscriptions, only holds the current subscription state—so historical states get overwritten and lost.

Let’s illustrate this clearly, step-by-step, and use this practical scenario to demonstrate how to use dbt snapshots.

1. The Problem: Lost Historical Subscription Data

Suppose your subscription table initially looks like this:

subscription_iduser_idstatusplanupdated_at
1019001trialbasic2024-01-10 08:00:00
1029002activepremium2024-01-10 09:15:00

Later, a subscriber’s status updates from trial to active, and then later to paused. Each update overwrites the previous state:

subscription_iduser_idstatusplanupdated_at
1019001pausedbasic2024-02-01 12:00:00
1029002activepremium2024-01-10 09:15:00

Here, we’ve lost valuable historical context—like when exactly the subscription transitioned between states, preventing us from calculating metrics like “average trial duration” or “time spent active.”

2. Solution: Using dbt Snapshots

We’ll configure dbt snapshots to capture every status change and keep historical records intact.

Step 1: Configure the Snapshot YAML

Create this snapshot definition in your project (make sure you’ve defined your sources first): snapshots/subscription_snapshot.yml:

snapshots:
  - name: subscription_status_snapshot
    description: "Tracks historical subscription status changes over time."
    relation: source('streaming_service', 'subscriptions')
    config:
      schema: snapshots
      unique_key: subscription_id
      strategy: timestamp
      updated_at: updated_at
 

Explanation:

  • schema: Separate snapshot schema (snapshots)
  • unique_key: Unique subscription identifier (subscription_id)
  • strategy: We use timestamp because updated_at reliably tracks updates
  • updated_at: We use the name of the field that contains the update timestamp (here updated_at too)

Step 2: Run your Snapshot

Execute the snapshot command in your terminal:

dbt snapshot

After the initial run, your snapshot table will look like this:

subscription_iduser_idstatusplanupdated_atdbt_valid_fromdbt_valid_to
1019001trialbasic2024-01-10 08:00:002024-01-10 08:00:00null
1029002activepremium2024-01-10 09:15:002024-01-10 09:15:00null

Step 3: Capture Updates Automatically

When subscription 101 changes from trialactive (Jan 17) → paused (Feb 1), re-running dbt snapshot will capture history:

After running dbt snapshot again on Jan 17:

subscription_iduser_idstatusplanupdated_atdbt_valid_fromdbt_valid_to
1019001pausedbasic2024-01-10 08:00:002024-01-10 08:00:002024-01-17 14:30:00
1019001activebasic2024-01-17 14:30:002024-01-17 14:30:00null
1029002activepremium2024-01-10 09:15:002024-01-10 09:15:00null

After running again on Feb 1:

subscription_iduser_idstatusplanupdated_atdbt_valid_fromdbt_valid_to
1019001pausedbasic2024-01-10 08:00:002024-01-10 08:00:002024-01-17 14:30:00
1019001activebasic2024-01-17 14:30:002024-01-17 14:30:002024-02-01 12:00:00
1019001pausedbasic2024-02-01 12:00:002024-02-01 12:00:00null
1029002activepremium2024-01-10 09:15:002024-01-10 09:15:00null

Now, your historical subscription changes are safely captured.

3. Using Snapshot Data in Analytical Models

You can easily query your snapshot data to calculate meaningful metrics.

For instance, create a model to analyze subscription durations: models/subscription_durations.sql:

select
  subscription_id,
  user_id,
  status,
  plan,
  dbt_valid_from,
  dbt_valid_to,
  extract(day from dbt_valid_to - dbt_valid_from) as duration_days
from {{ ref('subscription_status_snapshot') }}
order by subscription_id, dbt_valid_from

Warning

The extract function works with PostgreSQL but it’s pretty similar to datediff standard SQL functions.

This generates a table that lets you analyze how long each subscription stayed in a specific status:

subscription_iduser_idstatusplandbt_valid_fromdbt_valid_toduration_days
1019001trialbasic2024-01-102024-01-177 days
1019001activebasic2024-01-172024-02-0115 days
1019001pausedbasic2024-02-01nullcurrent
1029002activepremium2024-01-10nullcurrent

To be more precise with the given model duration_days just gives an integer or null but in a dashboarding solution we can add the “days” and replace the null by “current”.

4. Alternative Snapshot Strategy (Check Columns)

Sometimes, your source data doesn’t reliably track timestamps or simply lacks an updated_at column entirely. In these scenarios, using the check strategy makes sense.

Example scenario: Tracking Product Catalog Changes

Imagine you’re maintaining a product catalog for an online store. Your source table, product_catalog, gets updated frequently. Columns like price, availability, and description often change—but the source table doesn’t record when these updates happen. It only stores the latest state, without timestamps:

product_idproduct_namepriceavailabilitydescription
201Running Shoes99.99in stockComfortable running shoes
202Yoga Mat45.00in stockEco-friendly yoga mat

Later, the price and availability of the Running Shoes change, but you have no timestamp of this update—only the current state is available:

product_idproduct_namepriceavailabilitydescription
201Running Shoes89.99out of stockComfortable running shoes
202Yoga Mat45.00in stockEco-friendly yoga mat

Without snapshots, the historical price (99.99) and availability (in stock) are lost forever.

Using the check strategy to solve this

To capture changes explicitly on specific columns without relying on timestamps, configure your snapshot as follows:

snapshots/product_catalog_snapshot.yml:

snapshots:
  - name: product_catalog_snapshot
    description: "Tracks historical changes in product details."
    relation: source('online_store', 'product_catalog')
    config:
      schema: snapshots
      unique_key: product_id
      strategy: check
      check_cols:
        - price
        - availability
        - description
 

Explanation:

  • strategy: check explicitly tracks changes by comparing listed columns between snapshot runs.
  • check_cols: lists the columns to monitor. Any change in these columns triggers the snapshot to record a new historical row.

If you want to check all columns, you can put check_cols: 'all' instead.

Snapshot Results after Running

Initially, running dbt snapshot creates the first snapshot records:

product_idproduct_namepriceavailabilitydescriptiondbt_valid_fromdbt_valid_to
201Running Shoes99.99in stockComfortable running shoes2024-01-10 10:00:00null
202Yoga Mat45.00in stockEco-friendly yoga mat2024-01-10 10:00:00null

After changes occur (price drops, availability changes) and you run snapshots again, dbt captures this update explicitly:

product_idproduct_namepriceavailabilitydescriptiondbt_valid_fromdbt_valid_to
201Running Shoes99.99in stockComfortable running shoes2024-01-10 10:00:002024-01-15 15:00:00
201Running Shoes89.99out of stockComfortable running shoes2024-01-15 15:00:00null
202Yoga Mat45.00in stockEco-friendly yoga mat2024-01-10 10:00:00null

Here’s what happened:

  • dbt detected that product 201’s price and availability changed.
  • It automatically closed the previous snapshot (dbt_valid_to set) and created a new row capturing the latest state.

You could be wondering: “Wait how the dbt_valid_from and dbt_valid_to is defined if there is not timestamp”

Excellent question! When using the check strategy without an explicit updated_at timestamp, dbt automatically uses the time of the snapshot run (current timestamp) to fill dbt_valid_from and dbt_valid_to. If you do have an updated_at column, you can optionally provide it, allowing dbt to use that timestamp instead.

Why this works

The check strategy doesn’t rely on timestamps from your data. Instead, it compares column values between snapshot executions. This ensures historical tracking, even when your source data doesn’t contain update timestamps.

The check strategy can be way less efficient since you’ll compare all columns declared between the last snapshot version and the current data in the source, keep that in mind.

5. Advanced: Handling Subscription Deletions

If you want to track deletions explicitly by using hard_deletes: new_record:

snapshots:
  - name: subscription_snapshot_with_deletes
	description: "Tracks historical subscription status changes over time."
    relation: source('streaming_service', 'subscriptions')
    config:
      schema: snapshots
      unique_key: subscription_id
      strategy: timestamp
      updated_at: updated_at
      hard_deletes: new_record
 

This configuration ensures deleted subscriptions create new rows marked explicitly as deleted, preserving history.

The hard_deletes can have three values:

  • ignore: it’s the default value, if the record is deleted, nothing happens
  • invalidate: same as invalidate_hard_deletes=true model parameter (deprecated), where the record is deleted it’ll change the last dbt_valid_to value to replace the null by the timestamp of the last run.
  • new_record: it’ll add dbt_is_deleted boolean column

Practical Example with new_record:

Imagine the initial subscription snapshot:

subscription_iduser_idstatusplanupdated_atdbt_valid_fromdbt_valid_todbt_is_deleted
1019001activebasic2024-01-10 08:00:002024-01-10 08:00:00nullfalse
1029002pausedpremium2024-01-12 09:15:002024-01-12 09:15:00nullfalse

Then subscription 101 gets explicitly deleted from your source database on 2024-02-01 12:00:00. After running dbt snapshot again, you’ll have:

subscription_iduser_idstatusplanupdated_atdbt_valid_fromdbt_valid_todbt_is_deleted
1019001activebasic2024-01-10 08:00:002024-01-10 08:00:002024-02-01 12:00:00false
1019001activebasic2024-02-01 12:00:002024-02-01 12:00:00nulltrue
1029002pausedpremium2024-01-12 09:15:002024-01-12 09:15:00nullfalse

What happened?

  • dbt detected subscription 101 no longer exists in your source.
  • It closed the previous valid record (set dbt_valid_to to the snapshot run timestamp).
  • It added a new record explicitly marking it as deleted (dbt_is_deleted=true).

This configuration gives you clear visibility and control over deleted records, preserving historical context and enabling deeper analysis (e.g., retention analysis, churn detection).

6. How to Apply Transformations in Snapshots?

The best practice is to avoid transforming the data in snapshots. If however you’d need to filter or to deduplicate the data from the source (which is quite important for a good capture change), you should define an ephemeral model and use it as reference in the snapshot configuration.

To apply transformations you need to create a model (i.e. you model needs to be in models folder) with an ephemeral materialized configuration

{{ config(materialized='ephemeral') }}
 
select distinct on (product_id) * 
from {{ source('catalog', 'raw_products') }}
order by product_id
 

Then you can use this configuration snapshots/products_snapshots.yml:

snapshots:
  - name: products_snapshot
    description: "Tracks historical products changes over time."
    relation: ref('ephemeral_products_snapshots')
    config:
      schema: snapshots
      unique_key: product_id
      strategy: check
      check_cols: 'all'
 

7. Ensure uniqueness of the unique_key

To avoid bad surprises, it’s highly recommended to create a unique test on the source on the unique_key. Indeed, if you don’t have a single row per unique_key you will witness weird behaviours from dbt snapshot. It can comprise the history and you will get trouble retrieving the real data.

8. Bonus: Snapshots meta-fields

We previously talked about dbt_valid_from, dbt_valid_to and dbt_is_deleted.

There are also two others fields:

  • dbt_scd_id: a unique key generated for each snapshot row
  • dbt_updated_at: equal to the updated_at timestamp of the source

Key Takeaways:

  • Implement SCD Type 2 Effortlessly: dbt snapshots are the go-to method for building Slowly Changing Dimension Type 2 tables, automatically tracking historical changes in your source data.
  • Choose the Right Strategy: Use the timestamp strategy if your source has a reliable update timestamp column (updated_at). Opt for the check strategy (monitoring specific columns or all) when timestamps are unreliable or absent.
  • Unlock Historical Insights: Snapshot data, with its dbt_valid_from and dbt_valid_to columns, enables powerful time-based analysis, such as customer lifecycle tracking or product price history.
  • Configuration Matters: Pay close attention to the unique_key (ensure it’s truly unique in the source!) and consider options like hard_deletes to manage record deletions appropriately.
  • Keep Snapshots Clean: Apply necessary transformations (like deduplication or filtering) before the snapshot using ephemeral models, rather than within the snapshot definition itself.

Conclusion

As we’ve seen, dbt snapshots provide a robust and declarative way to capture data history, transforming the challenge of building Slowly Changing Dimensions from a complex coding task into a simple configuration. By leveraging either the timestamp or check strategy, you can reliably track changes in your sources and unlock deeper historical analysis. You’re now equipped to start preserving valuable historical context in your own dbt projects!

👉 The full source code for the project is available on GitHub:
📂 p-munhoz/dbt-snapshots