Modernizing Data Lakes and Data Warehouses with Google Cloud
Here a sum up of the following course: https://www.cloudskillsboost.google/course_templates/54
It’s a free and useful tutorial so I still recommend to follow and look at the videos!
Key Concepts
1. Role of a Data Engineer
- Primary responsibility: Building data pipelines.
- Purpose: Enable stakeholders to make faster, better decisions by leveraging data.
- Cloud advantage: Separate compute from storage, no infrastructure/software management, more focus on insights from data.
2. Data Lakes vs. Data Warehouses
- Data Lake: Stores unprocessed data. Typically used for raw data storage.
- Data Warehouse: Stores transformed data used for analytics, machine learning, and dashboards.
- Key difference: Data lakes hold raw, unstructured data; data warehouses hold processed, structured data ready for querying and analysis.
3. Cloud Storage for Data Lakes
- Google Cloud Storage: Main solution for data lakes.
- Other solutions: Low latency, transactional workloads, and structured data options are available on Google Cloud.
4. BigQuery for Data Warehouses
- BigQuery: Google Cloud’s data warehouse solution.
- Performance optimization: Use of partitioning and clustering to enhance query performance.
5. Data Pipelines: EL, ETL, ELT
- EL (Extract-Load): Load raw data into a data lake for later processing.
- ELT (Extract-Load-Transform): Load data into the data warehouse first, then transform it within the warehouse.
- ETL (Extract-Transform-Load): Transform data before loading it into the data warehouse.
6. Reference Architectures for Data Pipelines
- Google Cloud offers reference architectures for batch and streaming data pipelines.
- These can be used as a starting point for building robust pipelines on Google Cloud.
7. Cloud-Based Data Engineering
- Performing data engineering entirely in the cloud allows for flexibility, scalability, and reduced overhead.
- You focus on data and analytics while Google Cloud handles infrastructure.
Next Steps
- The next course in the series is “Building Batch Data Pipelines on Google Cloud.”