Context
When working with dbt-core (data build tool), understanding the fundamental difference between sources and models is crucial for building effective data transformations. While both are essential components of the dbt ecosystem, they serve distinct purposes and appear at different stages of the analytics engineering workflow. This article will explain their differences and how they work together to create a robust data transformation pipeline.
What are sources in dbt?
Sources in dbt represent the raw data that comes into your data warehouse from external systems. They are the starting point of your dbt pipeline, the tables and views that exist in your database before dbt performs any transformations.
Key characteristics of sources:
- External origin: Sources are tables that exist in your database but are loaded by processes outside of dbt (ETL/ELT tools, data integration platforms, etc.)
- Declaration in YAML: Sources are defined in YAML files within your
models
directory, commonly in a file namedsources.yml
or containingsource
keyword - Referencing mechanism: Sources provide a way to reference raw data tables using the
source()
function in your SQL models - Freshness monitoring: dbt allows you to define and test the freshness of source data
- Documentation: Sources can be documented and visualized in dbt’s documentation site
Here’s how a source might be defined in a sources.yml
file:
version: 2
sources:
- name: shopify
description: Raw data from our Shopify store
database: raw
schema: shopify
tables:
- name: orders
description: Raw orders data
columns:
- name: id
description: Primary key of the orders table
loaded_at_field: created_at
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
In order to reference the orders
table in the shopify
schema, we need to use {{ source('shopify', 'orders') }}
Jinja function. Please note that the shopify first argument in the source
function refers to the name: shopify and not the schema
name.
For example with the sources.yml
below, we would need to use {{ source('my_incredible_source', 'orders') }}
version: 2
sources:
- name: my_incredible_source
description: Raw data from our Shopify store
database: raw
schema: shopify
tables:
- name: orders
description: Raw orders data
columns:
- name: id
description: Primary key of the orders table
loaded_at_field: created_at
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
What are models in dbt?
Models in dbt are SQL files that transform your source data into more useful structures. They represent the transformations you want to apply to your data, typically organized in a way that makes sense for analytics and business intelligence.
Key characteristics of models:
- SQL-based transformations: Models are defined in SQL files with a
.sql
extension - Materialization options: Models can be materialized as tables, views, incremental tables, or ephemeral models
- Referencing capabilities: Models can reference sources, other models, seeds, or snapshots using the
ref()
function - Organizational structure: Models are typically organized in a hierarchical structure (staging, intermediate, marts)
- Version controlled: Models are stored as code in your version control system
Here’s an example of a simple dbt model:
-- models/staging/shopify/stg_shopify__orders.sql
{{
config(
materialized='view'
)
}}
SELECT
id as order_id,
user_id,
status,
created_at,
updated_at,
total_price
FROM {{ source('shopify', 'orders') }}
Key differences between sources and models
Aspect | Sources | Models |
---|---|---|
Origin | External systems, loaded outside of dbt | Created and managed by dbt |
Format | YAML configuration files | SQL files |
Location in pipeline | Beginning of the pipeline | Middle to end of the pipeline |
Referencing syntax | source('name', 'table') | ref('model_name') |
Materialization | Pre-existing tables/views (no control) | Configurable (tables, views, incremental, ephemeral) |
Modification | Cannot be modified by dbt | Created and modified by dbt |
Testing | Limited to data quality tests | Can include both data quality and business logic tests |
How sources and models work together
Sources and models create a complete data transformation pipeline in dbt:
- Data ingestion: External tools load data into your warehouse, creating source tables
- Source declaration: You define these external tables as sources in dbt
- Staging models: Create staging models that clean and standardize source data
- Intermediate models: Build intermediate models that join and aggregate staging models
- Marts models: Create final data marts for business domains and use cases
This structure creates a clean separation of concerns:
- Sources represent the interface between your external data and dbt
- Models represent the transformations applied to that data
Best practices for managing sources and models
Sources best practices:
- Define all external data as sources, even if you only use a subset
- Set up freshness checks to monitor data quality
- Document your sources thoroughly
- Use naming conventions that make the origin of the data clear
Models best practices:
- Create staging models that are 1:1 with source tables
- Use a consistent folder structure (staging, intermediate, marts)
- Apply appropriate materializations based on model complexity and usage
- Reference sources with
source()
and other models withref()
- Build testing into your models
Conclusion
Understanding the difference between sources and models in dbt-core is fundamental to building effective data transformation pipelines. Sources form the foundation of your dbt project, representing the raw data that enters your warehouse from external systems. Models, on the other hand, represent the transformations you apply to that data to make it more useful for analytics and business intelligence.
By properly distinguishing between these two concepts and following best practices for each, you can create a more maintainable, testable, and understandable data transformation workflow. This clear separation of concerns makes it easier to adapt to changes in your data sources and evolve your transformation logic over time.
Remember that sources and models are not competing concepts but complementary parts of a well-designed dbt project. Together, they create a powerful system for transforming raw data into valuable business insights.