CI/CD Pipeline Implementation with GitHub Actions for dbt-Core Transformations

Project Overview

I built a dbt project for my company to manage data transformations in BigQuery. This project required creating a seamless CI/CD pipeline that would streamline the development process and ensure quality control with every change. To achieve this, I integrated GitHub Actions, automating tests, running modified files only, and maintaining a strict separation between development and production environments.

Environment Setup

In BigQuery, I have separated two distinct environments:

  • Development: Used for testing and running the latest changes.
  • Production: Runs the stable and verified dbt models every day or on release.

Having two environments — development and production — in a data warehouse is crucial for ensuring the stability, accuracy, and reliability of data workflows. Here’s why:

1. Prevent Disruptions in Production

The production environment contains live, critical data that business stakeholders rely on for decision-making. Any errors or changes introduced directly into production could disrupt operations, lead to inaccurate reporting, and damage trust in the data.

By separating the environments, you can test new changes, transformations, and queries in the development environment without risking disruptions in production.

2. Safe Testing and Development

In the development environment, data engineers, analytics engineers or data analysts can experiment with new models, transformations, and optimizations. This allows for iterative testing, debugging, and validation of the changes without the fear of breaking live data workflows.

Having this space to test ensures that any issues are resolved before code is promoted to production.

3. Data Quality Assurance

A key aspect of managing a data warehouse is ensuring the quality and accuracy of the data. By using a development environment to run tests, validate queries, and check for anomalies, you ensure that only validated, high-quality data models are deployed to production.

In dbt, for example, you can run tests like unique checks, not-null constraints, and referential integrity checks in development before moving to production.

4. Controlled Deployment of Changes

In a CI/CD workflow, having two environments allows for incremental deployment. Changes can be tested and validated in development, and only those that pass are merged into production. This reduces the likelihood of bugs and errors affecting the production environment.

With version control and environment separation, you can deploy changes more confidently, ensuring minimal impact on production.

5. Auditability and Rollback

The two-environment setup provides a clear audit trail of changes. If an issue arises after a release to production, it is easy to trace the modifications made in the development environment and roll back to a stable state if needed.

Having a safe place for experimentation allows teams to better manage rollbacks and troubleshoot potential issues.

6. Efficient Resource Management

Production workloads are usually heavier and optimized for performance. Running experimental queries or transformations directly in production could cause resource bottlenecks, slow down critical processes, or increase costs.

By isolating development processes, teams can run tests and transformations in a separate, controlled environment without affecting the performance or cost efficiency of the production system.

GitHub Actions Workflow

The CI/CD pipeline uses GitHub Actions to automate the following tasks:

1. Pull Request (PR) Validation and Testing

This workflow, named dev-test-pr, is designed to ensure that pull requests (PRs) meet certain criteria before they can be merged. It includes checks for rebase status, commit count, commit messages, specific code patterns, and SQL file linting using SQLFluff.

Workflow: dev-test-pr
Trigger

The workflow is triggered by pull request events such as:

  • Opening a PR
  • Synchronizing (updating) a PR
  • Reopening a PR
  • Marking a PR as ready for review
on:
 
pull_request:
 
types: [opened, synchronize, reopened, ready_for_review]
Environment Variables

Several environment variables are set up for use throughout the workflow:

env:
	DBT_PROFILES_DIR: ./
	DBT_GOOGLE_PROJECT: ${{ vars.DBT_GOOGLE_PROJECT }}
	DBT_GOOGLE_DATASET: dev_staging
	DBT_GOOGLE_DATASET_QA: dev_qa
	DBT_GOOGLE_KEYFILE: /tmp/google/google-service-account.json
	DBT_MANIFEST: ./manifest.json
	DBT_STATE: ./
	ENV: dev
	TARGET: dev
	KEYFILE_CONTENTS: ${{ secrets.DBT_SVC_DEV }} # Contents of the keyfile from GitHub Secrets
Jobs
1. Rebase Check

This job checks if the PR branch is rebased on the main branch.

 
jobs:
 
rebase-check:
 
if: ${{ ! github.event.pull_request.draft }}
 
runs-on: ubuntu-latest
 
steps:
 
- name: Checkout PR branch
 
uses: actions/checkout@v2
 
with:
 
ref: ${{ github.event.pull_request.head.ref }}
 
fetch-depth: 0
 
  
 
- name: Fetch main branch
 
run: git fetch origin main
 
  
 
- name: Check if PR is rebased on main
 
run: |
 
git merge-base --is-ancestor origin/main ${{ github.event.pull_request.head.sha }}
 
if [ $? -ne 0 ]; then
 
echo "::error::PR is not rebased on main branch. Please rebase your branch."
 
exit 1
 
fi
 
2. Commit Count Check

This job ensures the PR has only one commit.

 
  
 
nb-of-commits-check:
 
if: ${{ ! github.event.pull_request.draft }}
 
runs-on: ubuntu-latest
 
steps:
 
- name: Checkout PR branch
 
uses: actions/checkout@v2
 
with:
 
ref: ${{ github.event.pull_request.head.ref }}
 
fetch-depth: 0
 
  
 
- name: Count commits in PR
 
id: commit-count
 
run: |
 
COMMIT_COUNT=$(git rev-list --count origin/main..HEAD)
 
echo "commit_count=$COMMIT_COUNT" >> $GITHUB_ENV
 
  
 
- name: Check commit count
 
run: |
 
if [ "${{ env.commit_count }}" -gt 1 ]; then
 
echo "::error::PR has more than 1 commit. Please squash your commits."
 
exit 1
 
fi
 
3. Commit Message Check

This job validates that commit messages follow specified conventions.

 
commit-message-check:
 
if: ${{ ! github.event.pull_request.draft }}
 
runs-on: ubuntu-latest
 
steps:
 
- name: Checkout PR branch
 
uses: actions/checkout@v2
 
with:
 
ref: ${{ github.event.pull_request.head.ref }}
 
fetch-depth: 0
 
  
 
- name: Check commit messages
 
run: |
 
INVALID_COMMIT_MSG=$(git log --format=%B origin/main..HEAD | grep -vE '^(feat|ci|fix|docs|refactor|creation)')
 
if [ -n "$INVALID_COMMIT_MSG" ]; then
 
echo "::error::Some commits do not start with a valid keyword (feat, ci, fix, docs, refactor, creation): $INVALID_COMMIT_MSG"
 
exit 1
 
fi
 
4. SQLFluff Lint

This job lints modified SQL files using SQLFluff to ensure code quality.

 
sqlfluff-lint:
 
runs-on: ubuntu-latest
 
steps:
 
- name: Checkout Code
 
uses: actions/checkout@v3
 
with:
 
fetch-depth: 0
 
  
 
- name: Fetch base branch
 
run: git fetch origin ${{ github.event.pull_request.base.ref }}
 
  
 
- name: Set up Python and Install SQLFluff
 
uses: actions/setup-python@v4
 
with:
 
python-version: '3.11'
 
- run: |
 
python -m pip install --upgrade pip
 
pip install sqlfluff
 
  
 
- name: Get Modified SQL Files
 
id: sql_files
 
run: |
 
CHANGED_FILES=$(git diff --name-only --diff-filter=ACM origin/${{ github.event.pull_request.base.ref }})
 
SQL_FILES=$(echo "$CHANGED_FILES" | grep -E '\.sql$' || true)
 
SQL_FILES=$(echo "$SQL_FILES" | tr '\n' ' ' | xargs)
 
echo "SQL_FILES=$SQL_FILES" >> $GITHUB_ENV
 
  
 
- name: Run SQLFluff Lint
 
if: env.SQL_FILES && env.SQL_FILES != ''
 
run: |
 
if [ -z "$SQL_FILES" ]; then
 
echo "No SQL files to lint."
 
exit 0
 
fi
 
sqlfluff lint $SQL_FILES
 

Workflow Benefits

  • Automated Quality Checks: The workflow ensures that PRs meet predefined standards, reducing the chance of errors in production.

  • Rebase Enforcement: Prevents issues that can arise from outdated branches.

  • Commit Message Standards: Maintains consistency in commit messages, improving project documentation and history.

  • Code Quality Assurance: Linting SQL files promotes best practices in SQL code writing.

2. Execute dbt build in development environment on PR Merge

This GitHub Actions workflow automates the process of executing dbt build in development environment whenever a pull request (PR) is merged. It ensures that the project is built and deployed correctly, while also managing authentication and logging.

Workflow Details

  • Name: run dev on PR merge

  • Trigger: The workflow runs when a pull request is closed.

Environment Variables

The workflow defines several environment variables to facilitate its operations:

  • DBT_PROFILES_DIR: Specifies the directory for dbt profiles.

  • DBT_GOOGLE_PROJECT: The Google project ID.

  • DBT_GOOGLE_DATASET: The dataset to use for development.

  • DBT_GOOGLE_KEYFILE: Path for the Google service account keyfile.

  • DBT_MANIFEST: Path to the dbt manifest file.

  • DBT_STATE: Directory for dbt state.

  • ENV: Environment setting (set to dev).

  • TARGET: Target environment (set to dev).

  • KEYFILE_CONTENTS: Contents of the Google service account keyfile pulled from GitHub Secrets.

Job

This job runs on an ubuntu-latest environment and consists of several key steps:

1. Checkout Code

Uses actions/checkout@v4 to check out the code from the repository.

2. Prepare Google Keyfile

Creates the necessary directory and writes the keyfile contents.

3. Authenticate to Google Cloud

Uses the google-github-actions/auth@v2 action to authenticate using the service account.

4. Download Manifest File

Downloads the manifest.json file from Google Cloud Storage using gsutil.

5. Run a Multi-line Script

Placeholder for additional build, test, and deployment commands.

6. Set Up Python

Uses actions/setup-python@v2 to set up Python 3.11.

7. Install Dependencies

Installs necessary Python packages, including dbt-core and dbt-bigquery.

8. Run dbt Build

Executes dbt build to build the project based on the modified states.

9. Upload Logs on Failure

If the build fails, it uploads logs to Google Cloud Storage for troubleshooting.

10. Deploy via GCS

If successful, it deploys the manifest.json file to Google Cloud Storage.

11. Report Status on Slack

Uses a Slack notification action to report the status of the workflow, including details about the run and the user who initiated the PR.

Workflow Benefits

This GitHub Actions workflow effectively automates the development process for merged pull requests, ensuring that all necessary tasks are performed consistently and efficiently, while also keeping the team informed about the workflow’s status.

3. Production Release (prod) - Incremental Runs

In this GitHub Actions workflow, I automated the process of running dbt whenever a new release is published (only for modified models). Here’s a detailed breakdown of the steps involved and their purposes:

1. Trigger on Release Publication

The workflow is triggered by the release event, specifically when a new release is published.

This ensures that dbt tasks are executed automatically whenever a new version of your project is released.

2. Setting Up Environment Variables

The workflow sets several environment variables, including:

  • DBT_PROFILES_DIR: Specifies where the dbt profiles are located.

  • DBT_GOOGLE_PROJECT and PROJECT_ID: Point to the Google Cloud project that will be used for dbt’s BigQuery integration.

  • DBT_GOOGLE_KEYFILE: A temporary path where the service account credentials will be stored.

  • KEYFILE_CONTENTS: Fetches the contents of the service account key from the GitHub Secrets.

3. Job: run-dbt

Runs on ubuntu-latest: The job executes on the latest Ubuntu environment.

Steps within the run-dbt job:

3.1. Checkout Repository

Uses actions/checkout@v3 to pull the code from the repository. The fetch-depth: 0 option ensures all tags are retrieved, which is important for later comparisons between releases.

3.2. Setup for Google Cloud Authentication

  • Creates a directory for storing the Google Cloud service account credentials, then writes the KEYFILE_CONTENTS into the DBT_GOOGLE_KEYFILE.

  • The step google-github-actions/auth@v2 is used to authenticate the GitHub Actions runner with Google Cloud, using the provided credentials.

3.3. Google Cloud Storage Interaction

  • The gcloud auth activate-service-account step authenticates with Google Cloud.

  • A gsutil command downloads the manifest.json file from Google Cloud Storage. This file is used for state-based dbt builds to compare between the current and previous dbt runs.

3.4. Python Setup

  • actions/setup-python@v2 installs Python version 3.11 for running dbt. This is done twice to ensure Python and dbt are properly set up.

3.5. Install dbt and Dependencies

  • Installs dbt-core and dbt-bigquery to allow running dbt with BigQuery as the target.

  • dbt deps —target prod installs dbt project dependencies using the prod target environment.

3.6. Run dbt Build

  • The dbt build command runs dbt models that have been modified based on the state comparison (—select state:modified+). This allows only the modified models and their dependencies to be rebuilt, optimizing runtime.

3.7. Generate dbt Documentation

  • The dbt docs generate step builds documentation from the dbt project, which will be used to display metadata and lineage for the models in your dbt project.

3.8. Custom Python Script

  • A custom Python script replace_dbt_docs.py is executed. It modifies the generated dbt documentation before it’s uploaded to the cloud.

3.9. Upload dbt Docs to Google Cloud

  • Uses gsutil to upload the dbt documentation (index.html, catalog.json, manifest.json) to a specified Google Cloud Storage bucket (gs://lucca-dbt-docs/static/).

4. Capture and Categorize File Changes

This step is responsible for comparing the current release with the previous one, capturing which files have been modified, added, deleted, or renamed.

A git diff command is used to list the differences between the two release tags.

The changes are categorized and stored in separate text files (modified_files.txt, added_files.txt, deleted_files.txt, renamed_files.txt), then compiled into a summary file (file_changes.txt).

5. Slack Notification

After capturing the file changes, the workflow sends the categorized changes to a Slack channel.

Two Slack notifications are triggered:

  • Status Report: This sends the overall status of the workflow (e.g., success or failure) to a Slack channel.

  • File Changes Report: This sends the categorized file changes between the two releases (added, modified, deleted, renamed) to a designated Slack channel, keeping the team informed about what has changed in the latest release.

Key Benefits

  • Automated dbt Runs on Release: Ensures that the latest version of your data models is built and documented whenever a new release is published.

  • Selective Rebuilds: By using dbt build —select state:modified+, only the modified models are rebuilt, reducing build times and ensuring efficient processing.

  • Continuous Deployment of Documentation: dbt documentation is generated and pushed to a cloud storage bucket, allowing the latest documentation to be accessible immediately after each release.

  • Automated Notifications: Slack notifications keep the team updated on the workflow status and changes, improving communication and transparency in the release process.

  • File Change Tracking: The workflow tracks and categorizes file changes between releases, which helps with auditing and understanding the scope of each release.

4. Daily Production Runs

  • Every day, the production environment runs a complete refresh of the dbt models to ensure data consistency and accuracy.

Key Benefits

  • Faster Development Cycles: By running only the modified files, we significantly reduced the time to test and deploy changes.

  • Automated Quality Checks: PR validation with sqlfluff and dbt testing ensures that only high-quality code is merged.

  • Separation of Environments: The development and production environments in BigQuery are completely isolated, minimizing the risk of impacting production data during development.

  • Continuous Production Runs: The daily production runs ensure that data is always up to date and accurate, providing business stakeholders with reliable insights.

Conclusion

This CI/CD pipeline has streamlined the development process for dbt transformations in BigQuery, reduced time spent on deployments, and improved overall code quality. By leveraging GitHub Actions and dbt’s manifest, I was able to create a robust system that scales well and maintains data integrity across environments.

Feel free to reach out if you’d like a similar solution implemented for your data workflows!