Modernizing Data Lakes and Data Warehouses with Google Cloud

Here a sum up of the following course: https://www.cloudskillsboost.google/course_templates/54

It’s a free and useful tutorial so I still recommend to follow and look at the videos!

Key Concepts

1. Role of a Data Engineer

  • Primary responsibility: Building data pipelines.
  • Purpose: Enable stakeholders to make faster, better decisions by leveraging data.
  • Cloud advantage: Separate compute from storage, no infrastructure/software management, more focus on insights from data.

2. Data Lakes vs. Data Warehouses

  • Data Lake: Stores unprocessed data. Typically used for raw data storage.
  • Data Warehouse: Stores transformed data used for analytics, machine learning, and dashboards.
  • Key difference: Data lakes hold raw, unstructured data; data warehouses hold processed, structured data ready for querying and analysis.

3. Cloud Storage for Data Lakes

  • Google Cloud Storage: Main solution for data lakes.
  • Other solutions: Low latency, transactional workloads, and structured data options are available on Google Cloud.

4. BigQuery for Data Warehouses

  • BigQuery: Google Cloud’s data warehouse solution.
  • Performance optimization: Use of partitioning and clustering to enhance query performance.

5. Data Pipelines: EL, ETL, ELT

  • EL (Extract-Load): Load raw data into a data lake for later processing.
  • ELT (Extract-Load-Transform): Load data into the data warehouse first, then transform it within the warehouse.
  • ETL (Extract-Transform-Load): Transform data before loading it into the data warehouse.

6. Reference Architectures for Data Pipelines

  • Google Cloud offers reference architectures for batch and streaming data pipelines.
  • These can be used as a starting point for building robust pipelines on Google Cloud.

7. Cloud-Based Data Engineering

  • Performing data engineering entirely in the cloud allows for flexibility, scalability, and reduced overhead.
  • You focus on data and analytics while Google Cloud handles infrastructure.

Next Steps

  • The next course in the series is “Building Batch Data Pipelines on Google Cloud.”