×
Study Notes — Certification Prep

Dataflow
Study Guide

Apache Beam managed service for unified stream and batch data processing patterns. Automated infrastructure provisioning and cluster management with horizontal autoscaling.

Updated: April 2026
Version: 1.0
Category: Data Processing
Reading Time: ~25 min
Author: Michaël Bettan
01

Definition & Use Cases

What is Dataflow?

Dataflow is an Apache Beam managed service for unified stream and batch data processing patterns. It provides automated infrastructure provisioning and cluster management with horizontal autoscaling.

Availability
Regional
SLA
99.9%

Use Cases

  • Streaming Analytics for Business insights.
  • Real-time AI for Predictive Analytics or Fraud Detection.
  • Sensor and log data processing for System Health.
  • ETL pipelines in general.

Note: If you have an existing Hadoop/Spark ecosystem, use Dataproc instead.

Billing

Billed in per-second increments, on a per-job basis:

  • Dataflow Compute resources: Worker vCPU and Memory, Shuffle data processed, Streaming Engine Compute Units.
  • Dataflow Prime Compute resources: Data Compute Units (DCUs).
02

Infrastructure & SDK Model

Infrastructure Model

Beam SDK Model

03

Windowing & Watermarks

Windowing

Dividing unbounded data streams into logical chunks for processing.

Watermarks

Watermark is an estimation of the oldest unprocessed data. It represents the system's understanding of how complete the data stream is.

04

Dataflow Prime & Optimization

Dataflow Prime

Self-Assessment Questions

Q1. When should you choose Dataflow over Dataproc?

Choose Dataflow for modern pipeline development (Apache Beam). Choose Dataproc for existing Hadoop/Spark ecosystems.

Q2. What is a Watermark in Dataflow?

An estimation of the oldest unprocessed data, used to track progress and handle late data.

Q3. What are the three types of windowing supported by Dataflow?

Tumbling (fixed, non-overlapping), Sliding (overlapping), and Session (based on activity gaps).