Study Notes — Certification Prep

Dataflow
Study Guide

Apache Beam managed service for unified stream and batch data processing patterns. Automated infrastructure provisioning and cluster management with horizontal autoscaling.

Updated: April 2026

Version: 1.0

Category: Data Processing

Reading Time: ~25 min

Author: Michaël Bettan

Definition & Use Cases

What is Dataflow?

Dataflow is an Apache Beam managed service for unified stream and batch data processing patterns. It provides automated infrastructure provisioning and cluster management with horizontal autoscaling.

Availability

Regional

SLA

99.9%

Use Cases

Streaming Analytics for Business insights.
Real-time AI for Predictive Analytics or Fraud Detection.
Sensor and log data processing for System Health.
ETL pipelines in general.

Note: If you have an existing Hadoop/Spark ecosystem, use Dataproc instead.

Billing

Billed in per-second increments, on a per-job basis:

Dataflow Compute resources: Worker vCPU and Memory, Shuffle data processed, Streaming Engine Compute Units.
Dataflow Prime Compute resources: Data Compute Units (DCUs).

Infrastructure & SDK Model

Infrastructure Model

Jobs: Can be batch (bounded data) or streaming (unbounded data).
Invocation: Predefined Templates, Custom Templates, or Dataflow SQL.
Workers: Compute Engine machines execute the processing with horizontal autoscaling.
Separating Compute and Storage:
- Dataflow Shuffle: Moves shuffle operations out of worker VMs (for batch).
- Streaming Engine: Moves pipeline execution out of the worker (for streaming).

Beam SDK Model

PCollection: A distributed, immutable dataset processed within a pipeline.
Transforms (PTransform): Operations on PCollections (filtering, grouping, joining).
- ParDo: Core transform for parallel processing.
- Flatten: Merges multiple PCollections.
- CoGroupByKey: Joins PCollections by key.

Windowing & Watermarks

Windowing

Dividing unbounded data streams into logical chunks for processing.

Tumbling Window: Non-overlapping, consistent, fixed-size time intervals.
Sliding Window: Overlapping windows that move forward by a set interval.
Session Window: Dynamically determined based on periods of activity followed by inactivity.

Watermarks

Watermark is an estimation of the oldest unprocessed data. It represents the system's understanding of how complete the data stream is.

Tracks progress in unbounded data.
Determines when to finalize results, especially with late-arriving data.
Default behavior is to drop late data (arriving after the watermark).

Dataflow Prime & Optimization

Dataflow Prime

Vertical Autoscaling: Scales workers' memory capacity dynamically.
Right Fitting: Stage-specific worker pools using resource hints.
Job Visualizer: See performance and find bottlenecks.

Self-Assessment Questions

Q1. When should you choose Dataflow over Dataproc?

Choose Dataflow for modern pipeline development (Apache Beam). Choose Dataproc for existing Hadoop/Spark ecosystems.

Q2. What is a Watermark in Dataflow?

An estimation of the oldest unprocessed data, used to track progress and handle late data.

Q3. What are the three types of windowing supported by Dataflow?

Tumbling (fixed, non-overlapping), Sliding (overlapping), and Session (based on activity gaps).

DataflowStudy Guide