Introduction
Ever wanted to experiment with a modern data stack but didn’t want to spend money on cloud resources? In this article, I’ll walk through my recent project: a local data platform built entirely with Terraform and Docker that replicates cloud architecture patterns for free.
Why I Built This
As a data engineer, I regularly work with cloud-based data platforms that can cost hundreds or thousands of dollars per month. These platforms are great for production, but for learning, experimentation, and even some development work, they’re overkill.
What if we could replicate the core architecture patterns locally?
That’s exactly what this project does. I wanted a playground to experiment with infrastructure as code (IaC) patterns for data platforms. The goal was to create something that:
- Costs nothing to run (leveraging local resources)
- Mirrors real-world architecture found in cloud environments
- Integrates popular open-source tools for data engineering
- Demonstrates IaC principles with Terraform
- Provides hands-on experience with event-driven architectures
The beauty of this approach is that it creates a realistic environment for developing data pipelines without the cloud costs. It’s perfect for:
- Learning data engineering concepts in a practical setting
- Prototyping pipelines before deploying to production
- Teaching others about modern data stack components
- Testing infrastructure changes safely
Let’s dive into how this works!
🔍 Key Technologies Explained
Here’s a quick rundown of the core tools used in this project:
- Docker: A containerization platform that packages code and dependencies together so applications run consistently in any environment.
- Terraform: An infrastructure-as-code (IaC) tool that lets you define and provision infrastructure (like networks, containers, and volumes) using declarative configuration files.
- Airflow: A workflow orchestration tool that lets you define, schedule, and monitor data pipelines using Python code (called DAGs).
- Minio: An open-source object storage server that’s API-compatible with Amazon S3. It’s great for storing files like raw and processed data locally.
- LocalStack: A fully functional local AWS cloud emulator. In this project, it simulates SQS (Simple Queue Service) to support event-driven workflows without needing an AWS account.
- DuckDB: A lightweight analytical SQL database optimized for OLAP workloads. Think of it as SQLite for analytics — perfect for local data warehousing and fast queries.
- Pandas: A popular Python library for data manipulation and analysis. Used here in the ETL step to transform CSV data.
- boto3: The official AWS SDK for Python. It interacts with Minio and LocalStack just like it would with AWS S3 or SQS, using the same API.
The Architecture: Deep Dive
Here’s what the architecture looks like:
User / DAG Trigger
|
v
Airflow DAG
|
|--> Upload CSV to Minio (S3 alternative)
|--> Load raw CSV to DuckDB
|--> [Triggered via SQS] Run Docker ETL
|--> Load transformed data to DuckDB
The beauty of this setup is that each component mimics a cloud service, creating a fully functional data pipeline:
What I Needed | Cloud Service | Local Alternative | Why This Works |
---|---|---|---|
Object Storage | AWS S3 | Minio | API-compatible S3 alternative |
Event-Driven Compute | AWS Lambda | Docker Containers | Containerized functions with similar lifecycle |
Workflow Orchestration | MWAA / Airflow | Dockerized Airflow | Same tool, just running locally |
Message Queue | SQS/SNS | LocalStack | Emulates AWS services API locally |
Analytical Database | Redshift / BigQuery | DuckDB | Columnar storage with SQL interface |
Let’s break down each component:
1. Minio: S3-Compatible Object Storage
Minio is an open-source object storage server that implements the Amazon S3 API. In my setup, Minio stores:
- Raw CSV files uploaded through the Airflow DAG
- Transformed data processed by our ETL container
The beauty of using Minio is that it works with the exact same boto3 S3 client code you’d use with real AWS S3. This means our code remains cloud-compatible with minimal changes.
You can read the Minio documentation here.
2. Airflow: Workflow Orchestration
Airflow serves as the brain of our operation, coordinating when and how data flows through the system. I’ve configured it with two primary DAGs:
- A data upload pipeline that moves CSVs to Minio and loads them into DuckDB
- An event-listener that polls SQS for messages and triggers ETL jobs
The Docker-based setup includes volume mounts for both DAGs and data, allowing for easy code changes without rebuilding containers.
You can read the Airflow documentation here.
3. LocalStack: Emulating AWS Services
LocalStack provides a local emulation of AWS services. In this project, I’m using it to create a functional SQS queue that:
- Accepts messages that trigger ETL processes
- Maintains the queue state between runs
- Supports standard AWS CLI and SDK interactions
This component is crucial for demonstrating event-driven architectures without actual AWS costs.
You can read the LocalStack documentation here.
4. Docker Containers: Pseudo-Lambda Functions
Instead of using AWS Lambda for compute, I’m using Docker containers that:
- Run a specific task (our Python ETL code)
- Execute quickly and terminate when complete
- Can be triggered by events (SQS messages)
- Accept environment variables for configuration
This approach mimics serverless functions while giving us complete control over the runtime environment.
You can read the Docker documentation here.
5. DuckDB: Analytical Database
For our data warehouse, I chose DuckDB because:
- It’s lightweight yet powerful for analytical queries
- Stores data in columnar format similar to cloud data warehouses
- Requires no server setup (perfect for local development)
- Has native Python integration
You can read the DuckDB documentation here.
Infrastructure as Code with Terraform: Explained
The entire platform is defined using Terraform, making it reproducible and version-controlled. Let’s dive deeper into how this works.
But why Terraform is that such a big deal in Data Engineering and DevOps?
In the past, infrastructure was manually configured — spinning up EC2s, setting up S3 buckets, clicking around in cloud consoles. That works for a quick test… but it quickly becomes unmanageable, especially when teams grow or systems evolve.
With Terraform, your infrastructure is:
- Declarative: You describe what you want, and Terraform figures out how to make it real.
- Versioned: Infra changes live in git, just like code. You can review them, roll them back, and collaborate safely.- Reproducible: A single terraform apply can spin up your entire stack from scratch — dev, staging, or demo environments.
- Cloud-agnostic: You can use the same tool to provision AWS, GCP, Azure… or in this case, Docker containers on your machine.
For data engineers, this matters a lot. Our workflows depend on storage, compute, and orchestration tools being correctly wired together. With Terraform, we avoid config drift, manual mistakes, and “it works on my machine” syndrome.
Modular Project Structure
I’ve organized the Terraform configuration into logical modules:
terraform/
├── main.tf # Main configuration that ties everything together
├── modules/
│ ├── airflow/ # Airflow orchestrator setup
│ │ ├── main.tf # Container, image, and volume configuration
│ │ ├── terraform.tf # Provider requirements
│ │ └── variables.tf # Configurable inputs
│ ├── localstack/ # LocalStack for SQS
│ │ ├── main.tf # Container configuration
│ │ ├── terraform.tf # Provider requirements
│ │ └── variables.tf # Configurable inputs
│ └── minio/ # Object storage
│ ├── main.tf # Container configuration
│ ├── terraform.tf # Provider requirements
│ └── variables.tf # Configurable inputs
This modularity offers several advantages:
- Separation of concerns: Each component is self-contained
- Reusability: Modules can be shared across projects
- Maintainability: Easier to understand and modify specific parts
- Isolation: Changes to one component don’t affect others
The Docker Network: Connecting Components
First, we need to create a dedicated Docker network that allows all our containers to communicate with each other. This is crucial because it enables containers to reach each other using predictable hostnames instead of dynamic IP addresses:
# Create a custom Docker network for inter-container communication
# This allows containers to communicate using service names (e.g., "minio", "airflow")
# instead of having to discover and use dynamic IP addresses
resource "docker_network" "data_platform" {
name = "data_platform_net"
}
This network enables our containers to communicate with each other using container names as hostnames (like http://minio:9000
), which simplifies configuration. Without this network, each service would need to know the others’ IP addresses, which change each time containers restart.
Building Custom Images
For Airflow, I use Terraform to build a custom Docker image:
resource "docker_image" "airflow" {
name = var.image_name
build {
context = var.context
dockerfile = var.dockerfile
}
}
This allows me to include necessary Python dependencies like boto3
and duckdb
directly in the image.
Container Configuration
Each container is configured with appropriate environment variables, ports, and volume mounts.
The Minio configuration demonstrates several key container orchestration patterns:
resource "docker_container" "minio" {
image = docker_image.minio.name
name = var.container_name
# Connect to our custom network and set hostname alias
# This allows other containers to reach Minio at "http://minio:9000"
networks_advanced {
name = "data_platform_net"
aliases = ["minio"]
}
# Port 9000: S3-compatible API (used by boto3 and applications)
ports {
internal = 9000
external = 9000
}
# Port 9001: Web console UI (accessible at http://localhost:9001)
ports {
internal = 9001
external = 9001
}
# Set admin credentials for accessing Minio
env = [
"MINIO_ROOT_USER=${var.minio_user}",
"MINIO_ROOT_PASSWORD=${var.minio_password}"
]
# Start Minio server with data directory and console on port 9001
command = ["server", "/data", "--console-address", ":9001"]
# Mount host directory for data persistence
# Data survives container restarts and can be inspected from host
volumes {
container_path = "/data"
host_path = var.host_data_path
}
}
This is particularly powerful because:
- It exposes the same ports you’d use in a cloud environment
- It mounts volumes from the host for persistence
- It connects to our shared network for inter-container communication
Event-Driven ETL Workflow
The real magic happens in the Airflow DAGs. I’ve built two primary workflows:
- Data Upload & Processing Pipeline (
dags/upload_to_minio.py
): Upload CSV → DuckDB → ETL → Load transformed data - Event-Driven ETL (
dags/trigger_etl_from_sqs.py
): Listen to SQS queue and trigger ETL jobs based on messages
Here’s a snippet from the SQS-triggered DAG:
def check_and_trigger_etl():
# Create SQS client pointing to LocalStack instead of real AWS
# Using the same boto3 interface ensures cloud compatibility
sqs = boto3.client(
"sqs",
endpoint_url="http://localstack:4566", # LocalStack SQS endpoint
aws_access_key_id="test", # Dummy credentials for LocalStack
aws_secret_access_key="test",
region_name="us-east-1"
)
# Get the queue URL - required for all SQS operations
queue_url = sqs.get_queue_url(QueueName="my-etl-queue")["QueueUrl"]
# Poll for messages (non-blocking check for new work)
# MaxNumberOfMessages=1 ensures we process one job at a time
messages = sqs.receive_message(QueueUrl=queue_url, MaxNumberOfMessages=1)
# If we found a message, trigger the ETL process
if "Messages" in messages:
print("Message received! Triggering ETL.")
# Launch ETL container (simulates AWS Lambda function)
# Key elements:
# --rm: Delete container when done (like Lambda's ephemeral nature)
# -e: Pass environment variables for Minio connection
# --network: Connect to our data platform network
subprocess.run([
"docker", "run", "--rm",
"-e", "MINIO_ENDPOINT=http://minio:9000", # Internal network address
"-e", "AWS_ACCESS_KEY_ID=admin", # Minio credentials
"-e", "AWS_SECRET_ACCESS_KEY=password",
"--network", "data_platform_net", # Same network as other services
"etl-transform" # Our custom ETL image
])
# Clean up: delete processed messages from queue
# This prevents reprocessing and is standard SQS pattern
for msg in messages["Messages"]:
sqs.delete_message(
QueueUrl=queue_url,
ReceiptHandle=msg["ReceiptHandle"] # Unique handle for this message
)
This code polls an SQS queue and runs a Docker container when it receives a message - just like how AWS Lambda functions can be triggered by SQS events.
The ETL Process: Under the Hood
Let’s examine the actual transformation logic running in our Docker container. This is where the data processing happens:
import os
import boto3
import pandas as pd
s3 = boto3.client(
's3',
endpoint_url=os.environ.get("MINIO_ENDPOINT", "http://minio:9000"),
aws_access_key_id=os.environ.get("AWS_ACCESS_KEY_ID", "admin"),
aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY", "password"),
region_name="us-east-1"
)
# Download file from Minio
s3.download_file("demo-bucket", "sample.csv", "downloaded.csv")
# Transform with pandas
df = pd.read_csv("downloaded.csv")
df["age_plus_10"] = df["age"] + 10
# Save transformed file
df.to_csv("transformed.csv", index=False)
# Upload back to Minio
s3.upload_file("transformed.csv", "demo-bucket", "transformed.csv")
While this transformation is simple (adding 10 to the age column), the pattern demonstrates several important concepts:
-
Environment Configuration: The script reads connection details from environment variables, allowing the same code to work in different environments
-
S3-Compatible Operations: We’re using boto3 just like we would with AWS, making this code cloud-ready
-
ETL Pattern:
- Extract: Downloading the CSV from object storage
- Transform: Manipulating the data with pandas
- Load: Saving the result back to object storage
-
Containerization: By packaging this in a Docker container, we ensure consistent execution regardless of the environment
The Dockerfile for this transformation is equally important:
FROM python:3.10-slim
WORKDIR /app
COPY scripts/transform.py .
RUN pip install pandas boto3
ENTRYPOINT ["python", "transform.py"]
This creates a lightweight image that:
- Uses a minimal Python base
- Installs only the required dependencies
- Sets the transform script as the entrypoint
The beauty of this approach is that our ETL code is:
- Portable: It can run anywhere Docker is available
- Consistent: The environment is identical every time
- Isolated: It doesn’t interfere with other processes
- Scalable: Multiple containers can run in parallel
- Cloud-Ready: The same container could be deployed to ECS, Kubernetes, or adapted for Lambda
Running the Platform: Detailed Steps
Let me walk through exactly how to get this platform up and running on your local machine. The beauty of infrastructure as code is that it should “just work” on any system with the prerequisites installed.
Prerequisites
- Docker: To run the containers
- Terraform: For deploying the infrastructure
- AWS CLI: For interacting with LocalStack
Step 1: Clone the Repository
git clone https://github.com/p-munhoz/iac-data-platform.git
cd iac-data-platform
Step 2: Apply Terraform Infrastructure
cd terraform
terraform init
terraform apply
During the terraform apply
step, Terraform will:
- Create a Docker network
- Pull or build necessary Docker images
- Start containers for Minio, Airflow, and LocalStack
- Configure the containers with appropriate environment variables
- Mount volumes for data persistence
This is where the magic of IaC happens - with one command we’ve created our entire data platform!
Step 3: Access Airflow UI
Once the infrastructure is running, you can access the Airflow web interface:
- URL: http://localhost:8080
- Username: admin
- Password: admin
From here, you can view and trigger the DAGs we’ve created.
Step 4: Create the SQS Queue (One-Time Setup)
This command creates our message queue in LocalStack, demonstrating how we can use standard AWS CLI commands with local services:
# Set dummy AWS credentials (required by AWS CLI but not validated by LocalStack)
AWS_ACCESS_KEY_ID=test \
AWS_SECRET_ACCESS_KEY=test \
# Use AWS CLI with LocalStack endpoint instead of real AWS
aws --endpoint-url=http://localhost:4566 \
sqs create-queue \
--queue-name my-etl-queue \ # Name used by our Airflow DAG
--region us-east-1 # Required by AWS CLI
Notice how we’re specifying the LocalStack endpoint while using standard AWS CLI commands - this is what makes LocalStack so powerful for development.
Step 5: Triggering Workflows
You have two options for triggering the data processing:
Option A: Direct DAG Execution
From the Airflow UI, simply click on the “upload_and_duckdb” DAG and trigger it manually. This will:
- Upload the sample CSV to Minio
- Load it into DuckDB
- Run the ETL transformation
- Load the transformed data into DuckDB
Option B: Send a Message to SQS
To demonstrate event-driven processing, send a message to the SQS queue:
AWS_ACCESS_KEY_ID=test \
AWS_SECRET_ACCESS_KEY=test \
aws --endpoint-url=http://localhost:4566 \
sqs send-message \
--queue-url http://localhost:4566/000000000000/my-etl-queue \
--message-body "trigger"
The Airflow DAG that polls SQS will detect this message and trigger the ETL process.
Step 6: Verify the Results
You can verify that everything worked by:
- Checking Minio: Navigate to http://localhost:9001 and login with admin/password
- Inspecting DuckDB: You can connect to the DuckDB file at
data/warehouse.duckdb
using DuckDB CLI or a SQL client - Viewing Airflow logs: Check the Airflow UI for execution logs
Step 7: Tear Down (When Finished)
When you’re done experimenting, you can destroy all resources:
cd terraform
terraform destroy
This will remove all containers and networks, though any data in the mounted volumes will remain on your local filesystem.
Key Takeaways
Building this project can teach you several valuable lessons about modern data architecture and IaC:
1. Infrastructure as Code is Transformative
With traditional infrastructure management, recreating a development environment across different machines is error-prone and time-consuming. With this Terraform setup:
- Reproducibility is guaranteed - the exact same environment every time
- Version control applies to infrastructure, not just application code
- Documentation is built into the code itself
- Collaboration becomes easier when infrastructure is defined as code
Module-based Terraform are particularly powerful, as it allowed you to encapsulate specific components and reuse them across different projects.
2. Container Networking is Crucial for Data Platforms
One of the subtle but important aspects of this project was configuring the Docker network properly:
resource "docker_network" "data_platform" {
name = "data_platform_net"
}
This dedicated network allows containers to communicate using predictable hostnames. Without this, each component would need to know the dynamic IP addresses of other services.
3. Event-Driven Architecture Requires Different Thinking
Working with the SQS-triggered workflow highlighted the differences between traditional batch processing and event-driven approaches:
- Loose coupling: Components communicate through events, not direct calls
- Scalability: Easy to add new consumers without modifying producers
- Resilience: Messages persist even if consumers are temporarily unavailable
- Complexity: Need to handle idempotency, ordering, and failure scenarios
This local setup provides a perfect sandbox for experimenting with these patterns before implementing them in production.
4. Emulating Cloud Services Locally Has Limitations
While the local stack works well for development, there are some limitations:
- Performance: Local resources are constrained compared to cloud services
- Features: Some advanced cloud features aren’t available in the local alternatives
- Scale: Can’t truly test large-scale data scenarios locally
However, these limitations are acceptable for most development and learning scenarios.
5. Docker Makes ETL Consistent
Using Docker for the ETL process ensures consistent execution environments, which is critical for reproducible data transformations. The container approach offers:
- Dependency isolation: Each transformation has its own environment
- Portability: Works the same on any machine with Docker
- Version control: Container images can be tagged and versioned
- Resource control: Can limit CPU/memory allocation
This pattern is particularly useful when scaling to multiple ETL processes with different dependencies.
Ideas for Extending the Project
There’s plenty of room to grow this platform:
- Add Prometheus + Grafana for monitoring
- Implement streaming with Kafka (via another container)
- Add proper CI/CD pipelines with GitHub Actions
- Build a simple UI to visualize the data flow
- Add dbt for transformation instead of raw Python
Conclusion
This project demonstrates how to build a complete data platform using infrastructure as code principles - all running locally and for free. It’s a great way to learn about modern data engineering patterns without worrying about cloud costs.
The complete code is available on my GitHub repository. Feel free to clone it, extend it, and use it as a starting point for your own data engineering experiments!