Data Engineering Bootcamp - Series 1

Get Started Today and Build Your Career in Data Engineering!

Get Started Today and Build Your Career in Data Engineering!

Overview

Understand the Fundamentals of Modern Data Engineering, Build and Manage Scalable Data Lakes on AWS S3, Design Star Schema Data Models with Fact & Dimension Tables, Implement Slowly Changing Dimensions (SCD1 & SCD2), Develop ETL Pipelines Using PySpark with Data Quality Checks, Query and Explore Data Lakes with AWS Athena and Glue Catalog, Automate Workflows and Pipelines Using Apache Airflow, Create Custom Airflow Plugins to Manage EMR Spark Jobs, Apply the WAP (Write-Audit-Publish) Pattern for Production Pipelines, Implement Data Quality Frameworks and Data Contracts, Deploy and Monitor Data Pipelines on AWS EMR, Optimize Data Workflows for Cost, Performance, and Reliability, Gain Hands-On Experience with Real-World Use Cases, Prepare for Data Engineering Interviews with Confidence

Aspiring data engineers looking to break into the field, Software engineers or analysts transitioning into data roles, Professionals seeking real-world, hands-on data engineering experience, Anyone interested in mastering the modern data engineering stack

Basic knowledge of SQL and Python, Familiarity with Docker and Bash scripting helpful

Take your first step into the world of data engineering and future-proof your career with this hands-on, project-based bootcamp built on the modern data stack. Taught by a seasoned data architect with over 11 years of industry experience, this course blends theory with practice, designed for aspiring data engineers, software engineers, analysts, and anyone eager to learn how to build real-world data pipelines.

You’ll learn to design scalable data lakes, build dimensional data models, implement data quality frameworks, and orchestrate pipelines using Apache Airflow, all using a real-life ride-hailing application use case to simulate enterprise-scale systems.

What You’ll Learn

Section 1: Context Setup

Build your foundation with the Modern Data Stack, understand OLTP systems, and explore real-world data platform architectures.

  • Gain clarity on how data flows in data-driven companies

  • Learn using a ride-hailing app scenario

  • Get properly onboarded into the bootcamp journey

Section 2: Data Lake Essentials

Learn how to build and manage scalable data lakes on AWS S3.

  • S3 architecture, partitioning, layers, and schema evolution

  • IAM, encryption, storage classes, event notifications

  • Lifecycle management, backup & recovery

  • Hands-on with Boto3 S3 APIs

Section 3: Data Modeling

Master star schema design and implement SCD Type 1 and Type 2 dimensions.

  • Dimensional & fact modeling

  • ETL development for analytical reporting

  • Build end-to-end models and data marts with hands-on labs

Section 4: Data Quality

Ensure trust and integrity in your data pipelines.

  • Understand accuracy, completeness, and consistency

  • Implement DQ checks using industry best practices

  • Use data contracts for accountability

Section 5: AWS Athena

Query massive datasets with serverless power using AWS Athena.

  • Learn DDL, Glue Catalog, and workgroup management

  • Automate queries using Boto3 APIs

  • Compare Athena vs Presto vs Trino

  • Optimize queries with best practices

Section 6: Apache Spark

Build production-grade data pipelines with PySpark on AWS EMR.

  • Learn Spark architecture and PySpark APIs

  • Build data pipelines using the WAP (Write-Audit-Publish) pattern

  • Run scalable jobs on AWS EMR

  • Apply UDFs and data quality within transformation logic

Section 7: Apache Airflow

Orchestrate workflows using Airflow and build custom plugins:

  • Design DAGs, schedule pipelines, manage dependencies

  • Automate Spark jobs using custom AWS EMR plugin

  • Hands-on labs for ingestion and transformation DAGs

  • Build reliable, reusable orchestration solutions

What You’ll Build

A production-style data platform for a ride-hailing company, including:

  • Data lake on AWS S3

  • Dimensional data model with SCD logic

  • Spark-based transformation pipelines

  • Automated orchestration with Airflow

  • Query layer with Athena

  • Built-in data quality validations

Andalib Ansari

Hi, I’m Andalib Ansari, a Data Architect with over 11 years of hands-on experience building data platforms and pipelines across industries like online gaming, ride-hailing, SaaS, and telecom.

Throughout my career, I’ve designed large-scale Data Warehouses and Data Lakes, developed high-performance data pipelines, and worked extensively with tools like Python, Spark, SQL, Hive, Scala, Redshift, EMR, Athena, and Airflow. I’ve also built analytics platforms, crafted data solutions using React, and delivered production-grade architectures in cloud-native environments (AWS & GCP).

Previously, I launched a Big Data course that has reached over 28,000+ students across 145+ countries, empowering learners to break into the world of data. Now, with the evolution of the modern data stack, I’m excited to bring you cutting-edge content designed to make you job-ready and confident in real-world data engineering roles.

If you're passionate about data, eager to solve meaningful problems, and ready to future-proof your career, you're in the right place. Let’s learn, build, and grow together.

Free Enroll