Data Engineering on Google Cloud

Get hands-on experience with designing and building data processing systems on Google Cloud. This course uses lectures, demos, and hands-on labs to show you how to design data processing systems, build end-to-end data pipelines, analyze data, and implement machine learning. This course covers structured, unstructured, and streaming data.

Enquire today

Duration 4 days

Level intermediate

Format Instructor led, On-demand

What you'll learn

Design and build data processing systems on Google Cloud.
Process batch and streaming data by implementing autoscaling data pipelines on Dataflow.
Derive business insights from extremely large datasets using BigQuery.
Leverage unstructured data using Spark and ML APIs on Dataproc.
Enable instant insights from streaming data.
Understand ML APIs and BigQuery ML, and learn to use AutoML to create powerful models without coding.

About this course

Overview

18 Modules · 143 Videos · 24 Labs · 21 Classrom activities

Who this course is for

This class is intended for developers who are responsible for:

Extracting, loading, transforming, cleaning, and validating data.
Designing pipelines and architectures for data processing.
Integrating analytics and machine learning capabilities into data pipelines.
Querying datasets, visualizing query results, and creating reports.

Prerequisite

To benefit from this course, participants should have completed “Google Cloud Big Data and Machine Learning Fundamentals” or have equivalent experience. Participant should also have:

Basic proficiency with a common query language such as SQL.
Experience with data modeling and ETL (extract, transform, load) activities.
Experience with developing applications using a common programming language such as Python.
Familiarity with machine learning and/or statistics.

Products

BigQuery
Cloud Bigtable
Cloud Storage
Cloud SQL
Cloud Spanner
Dataproc
Dataflow
Cloud Data Fusion
Cloud Composer
Pub/Sub
Vertex AI
Cloud ML APIs

Module 1

Introduction to Data Engineering

Topics

Explore the role of a data engineer
Analyze data engineering challenges
Introduction to BigQuery
Data lakes and data warehouses
Transactional databases versus data warehouses
Partner effectively with other data teams
Manage data access and governance
Build production-ready pipelines
Review Google Cloud customer case study

Objectives

Understand the role of a data engineer
Discuss benefits of doing data engineering in the cloud
Discuss challenges of data engineering practice and how building data pipelines in the cloud helps to address these
Review and understand the purpose of a data lake versus a data warehouse, and when to use which

Activities

Lab: Using BigQuery to do Analysis

Module 2

Building a Data Lake

Topics

Introduction to data lakes
Data storage and ETL options on Google Cloud
Building a data lake using Cloud Storage
Securing Cloud Storage
Storing all sorts of data types
Cloud SQL as a relational data lake

Objectives

Understand why Cloud Storage is a great option for building a data lake on Google Cloud
Learn how to use Cloud SQL for a relational data lake

Activities

Lab: Loading Taxi Data into Cloud SQL

Module 3

Building a Data Warehouse

Topics

The modern data warehouse
Introduction to BigQuery
Getting started with BigQuery
Loading data
Exploring schemas
Schema design
Nested and repeated fields
Optimizing with partitioning and clustering

Objectives

Discuss requirements of a modern warehouse
Understand why BigQuery is the scalable data warehousing solution on Google Cloud
Understand core concepts of BigQuery and review options of loading data into BigQuery

Activities

Lab: Loading Data into BigQuery
Lab: Working with JSON and Array Data in BigQuery

Module 4

Introduction to Building Batch Data Pipelines

Topics

EL, ELT, ETL
Quality considerations
How to carry out operations in BigQuery
Shortcomings
ETL to solve data quality issues

Objectives

Review different methods of loading data into your data lakes and warehouses: EL, ELT, and ETL
Discuss data quality considerations and when to use ETL instead of EL and ELT

Module 5

Executing Spark on Dataproc

Topics

The Hadoop ecosystem
Run Hadoop on Dataproc
Cloud Storage instead of HDFS
Optimize Dataproc

Objectives

Review the parts of the Hadoop ecosystem
Learn how to lift and shift your existing Hadoop workloads to the cloud using Dataproc
Understand considerations around using Cloud Storage instead of HDFS for storage
Learn how to optimize Dataproc jobs

Activities

Lab: Running Apache Spark jobs on Dataproc

Module 6

Serverless Data Processing with Dataflow

Topics

Introduction to Dataflow
Why customers value Dataflow
Dataflow pipelines
Aggregating with GroupByKey and Combine
Side inputs and windows
Dataflow templates
Dataflow SQL

Objectives

Understand how to decide between Dataflow and Dataproc for processing data pipelines
Understand the features that customers value in Dataflow
Discuss core concepts in Dataflow
Review the use of Dataflow templates and SQL

Activities

Lab: A Simple Dataflow Pipeline (Python/Java)
Lab: MapReduce in Dataflow (Python/Java)
Lab: Side inputs (Python/Java)

Module 7

Manage Data Pipelines with Cloud Data Fusion and Cloud Composer

Topics

Building batch data pipelines visually with Cloud Data Fusion
Components
UI overview
Building a pipeline
Exploring data using Wrangler
Orchestrating work between Google Cloud services with Cloud Composer
Apache Airflow environment
DAGs and operators
Workflow scheduling
Monitoring and logging

Objectives

Discuss how to manage your data pipelines with Data Fusion and Cloud Composer
Understand Data Fusion’s visual design capabilities
Learn how Cloud Composer can help to orchestrate the work across multiple Google Cloud services

Activities

Lab: Building and Executing a Pipeline Graph in Data Fusion
Optional Lab: An introduction to Cloud Composer

Module 8

Introduction to Processing Streaming Data

Topics

Process Streaming Data

Objectives

Explain streaming data processing
Describe the challenges with streaming data
Identify the Google Cloud products and tools that can help address streaming data challenges

Module 9

Serverless Messaging with Pub/Sub

Topics

Introduction to Pub/Sub
Pub/Sub push versus pull
Publishing with Pub/Sub code

Objectives

Describe the Pub/Sub service
Understand how Pub/Sub works
Gain hands-on Pub/Sub experience with a lab that simulates real-time streaming sensor data

Activities

Lab: Publish Streaming Data into Pub/Sub

Module 10

Dataflow Streaming Features

Topics

Steaming data challenges
Dataflow windowing

Objectives

Understand the Dataflow service
Build a stream processing pipeline for live traffic data
Demonstrate how to handle late data using watermarks, triggers, and accumulation

Activities

Lab: Streaming Data Pipelines

Module 11

High-Throughput BigQuery and Bigtable Streaming Features

Topics

Streaming into BigQuery and visualizing results
High-throughput streaming with Cloud Bigtable
Optimizing Cloud Bigtable performance

Objectives

Learn how to perform ad hoc analysis on streaming data using BigQuery and dashboards
Understand how Cloud Bigtable is a low-latency solution
Describe how to architect for Bigtable and how to ingest data into Bigtable
Highlight performance considerations for the relevant services

Activities

Lab: Streaming Analytics and Dashboards
Lab: Streaming Data Pipelines into Bigtable

Module 12

Advanced BigQuery Functionality and Performance

Topics

Analytic window functions
Use With clauses
GIS functions
Performance considerations

Objectives

Review some of BigQuery’s advanced analysis capabilities
Discuss ways to improve query performance

Activities

Lab: Optimizing your BigQuery Queries for Performance
Optional Lab: Partitioned Tables in BigQuery

Module 13

Introduction to Analytics and AI

Topics

What is AI?
From ad-hoc data analysis to data-driven decisions
Options for ML models on Google Cloud

Objectives

Understand the proposition that ML adds value to your data
Understand the relationship between ML, AI, and Deep Learning
Identify ML options on Google Cloud

Module 14

Prebuilt ML Model APIs for Unstructured Data

Topics

Unstructured data is hard
ML APIs for enriching data

Objectives

Discuss challenges when working with unstructured data
Learn the applications of ready to-use ML APIs on unstructured data

Activities

Lab: Using the Natural Language API to Classify Unstructured Text

Module 15

Big Data Analytics with Notebooks

Topics

What’s a notebook?
BigQuery magic and ties to Pandas

Objectives

Introduce Notebooks as a tool for prototyping ML solutions
Learn to execute BigQuery commands from Notebooks

Activities

Lab: BigQuery in Jupyter Labs on AI Platform

Module 16

Production ML Pipelines

Topics

Ways to do ML on Google Cloud
Vertex AI Pipelines
AI Hub

Objectives

Describe options available for building custom ML models
Understand the use of tools like Vertex AI Pipelines

Activities

Lab: Running Pipelines on Vertex AI

Module 17

Custom Model Building with SQL in BigQuery ML

Topics

BigQuery ML for quick model building
Supported models

Objectives

Learn how to create ML models by using SQL syntax in BigQuery
Demonstrate building different kinds of ML models using BigQuery ML

Activities

Lab option 1: Predict Bike Trip Duration with a Regression Model in BigQuery ML
Lab option 2: Movie Recommendations in BigQuery ML

Module 18

Custom Model Building with AutoML

Topics

Why AutoML?
AutoML Vision
AutoML NLP
AutoML tables

Objectives

Explore various AutoML products used in machine learning
Learn to use AutoML to create powerful models without coding

AboutAppsbroker Academy

Appsbroker Academy is an Authorised Training Partner for Google Cloud. Drawing on our own highly skilled engineers’ unique experiences and expertise, we provide dedicated, industry-specific training using real-life examples to help your people to thrive.

Find out more

View more Google Courses

Data Engineering on Google Cloud

What you'll learn

About this course

Overview

Who this course is for

Prerequisite

Products

AboutAppsbroker Academy

Contact Us

Start your Cloud training journey today.