graph TD
subgraph foundation
Py[Python]
SQL
end
foundation --> core
subgraph core
ETL
DWH
Scheduler
end
core --> BP
core --> SP
BP[Batch Processing with Spark]
SP[Stream Processing with Flink]
Foundational Knowledge
Computer Science Fundamentals
Data Structures (Arrays, Linked Lists, HashMaps)
Algorithms (Sorting, Searching, Recursion)
Big O Notation
Programming Languages
Python (primary language for data engineering)
SQL (essential for querying databases)
Optional: Java/Scala for working with big data tools like Apache Spark
Core Data Engineering Concepts
Databases
Relational Databases (PostgreSQL, MySQL)
NoSQL Databases (MongoDB, Cassandra)
Data Modeling (ER Diagrams, Normalization)
Data Warehousing
Concepts: ETL/ELT, Star Schema, Snowflake Schema
Tools: Amazon Redshift, Google BigQuery, Snowflake
Project
Design a database schema for a simple application (e.g., an e-commerce store).
Implement the schema in PostgreSQL or MySQL and perform CRUD operations.
Build an ETL pipeline that loads data from CSV files into the database.
Batch Processing
Batch Processing Concepts
Definition and Use Cases
Difference between Batch and Stream Processing
Batch Processing Frameworks
Apache Hadoop (HDFS, MapReduce, S3)
Apache Spark (Core Concepts, DataFrames, RDDs)
Building Batch Data Pipelines
Data Extraction (from databases, APIs)
Data Transformation (using Spark, SQL)
Data Loading (to data warehouses, data lakes)
Schedule the ETL with Apache Airflow (workflow orchestration)
Send into Trino (distributed SQL query engine)
Project
Create a batch processing pipeline using Apache Spark that processes and transforms a large dataset (e.g., log files).
Use Apache Airflow to schedule and orchestrate the batch processing pipeline.
Load the processed data into a data warehouse like Google BigQuery or Amazon Redshift.
Data Lakes
Concept
Differences between Data Lakes and Data Warehouses
Schema-on-read vs. Schema-on-write
Implementation
Delta Lake (transactional storage layer on top of data lakes)
Apache Hudi, Apache Iceberg (alternatives)
Building
Using Delta Lake on AWS S3 or Azure Data Lake Storage
Managing large datasets, handling schema evolution
Project
Set up a data lake using Delta Lake on AWS S3.
Ingest batch data into the data lake and manage schema evolution.
Query the data lake using Spark SQL or Trino.
Stream Processing
Concepts
Real-time Data Processing vs. Batch Processing
Event-Driven Architectures
Stream Processing Frameworks
Apache Kafka (core concepts, message brokering): might mislead, Kafka is not really stream processing, but we need it.
Apache Flink (streaming analytics)
Windowing
Checkpoint
Building Streaming Data Pipelines
Real-Time Data Ingestion (using Kafka)
Stream Processing with Flink (windowing, state management)
Real-Time Analytics and Monitoring (using Kafka Streams, Flink)
Tools and Technologies
Kafka Connect (data integration tool)
KSQL (streaming SQL engine for Kafka)
Project
Implement a real-time data ingestion pipeline using Apache Kafka.
Use Apache Flink to process the stream in real time, performing operations like filtering and aggregation.
Build a real-time dashboard to visualize the streaming data.
Data Governance
Data Quality
Ensuring data accuracy, completeness, and consistency
Tools: Great Expectations, Deequ
Data Lineage
Tracking data flow across pipelines
Tools: OpenLineage, DataHub
Data Privacy
Compliance with GDPR, CCPA, and other regulations
Techniques for anonymization and encryption
Data Catalogs
Centralized metadata management
Tools: Apache Atlas, Amundsen
Project
Implement data quality checks using Great Expectations in a data pipeline.
Set up data lineage tracking with OpenLineage in your ETL pipelines.
Create a data catalog for your data assets using Amundsen.
Advanced Topic
Integration of Batch and Stream Processing
Lambda Architecture (batch and stream processing integration)