Google Cloud Service Notes

Google Cloud
Datastream

Serverless Change Data Capture (CDC) and Replication Service that allows you to synchronize data across heterogeneous databases, storage systems, and applications reliably and with minimal latency.

Author: Michaël Bettan

SLA: >=99.9%

Billing: Per-byte (GBs processed)

Core Overview & Use Cases

Billing Breakdown

Based on gigabytes (GBs) processed. Usage is billed in per-byte increments, on a per-stream basis, and is stated in GB (500 MB is 0.5 GB, for example). Bytes are counted based on raw (uncompressed) data for CDC and Backfill.

In addition to Datastream costs, you're billed for resources used to transfer, store, or process data: Cloud Storage, Dataflow, and Networking.

Primary Use Cases

Analytics: Keep analytical systems up to date.
Database Replication: Synchronize databases (e.g., Cloud SQL → Cloud Spanner).
Event-driven architectures: Trigger downstream processes efficiently.

Data Integration

Data streams from databases and Software-as-a-Service (SaaS) cloud services can feed a near-real-time data integration pipeline by loading data into BigQuery via Dataflow or Cloud Data Fusion.

Streaming Analytics

Changes in databases are ingested into streaming pipelines such as with Dataflow for fraud detection, security event processing, and anomaly detection.

Near-Real-Time Availability

Availability of data changes in near-real-time powers artificial intelligence and machine learning applications to prevent churn or increase engagement via marketing efforts or by feeding back into production systems.

Change Data Capture (CDC) Definition

Each change in a source system is captured and recorded in a data store. This is helpful in cases where it is important to know all changes over time and not just the state of the database at the time of data extraction.

ETL vs CDC Approach

ETL/ELT Process: Sufficient if you need to know only the final value.
CDC Approach: Appropriate if you need to know all the changes in value over time.

Alternative: Federated Queries

Query external data directly from BigQuery.
No data replication required.
Best for occasional access and joining data.

Key Features

Serverless: Configure a stream and the data starts moving. There are no installation, resource allocation, or maintenance overheads. Auto-scaling capabilities allocate resources to keep data moving in real time, automatically.

Unified Avro-based Type Schema: Enables simple, source-independent processing by converting all source-specific data types into a unified Datastream type schema, based on Avro types.

Stream Historical and CDC Data: Datastream streams both historical and CDC source data in real time, simultaneously.

Oracle CDC Without Additional Licenses: Provides LogMiner-based CDC streaming from any Oracle source version 11.2g and above, without the need to pay for additional licenses or software installations.

Cloud Storage Destination: CDC data is written to self-describing Avro or JSON files in Cloud Storage continually. This information is easily consumable for additional processing either directly in place or by loading downstream.

Key Concepts

Datastream ingests a lot of data in real time from a variety of sources, and makes the data available for consumption in the destination.

Event: The unit of data stored by Datastream. Represents every data manipulation language (DML) change for a given object.

Stream: Represents continuous ingestion of events from a source and writing them to a destination. Consists of a source and destination connection profile pair, along with stream-specific settings.

Connection Profiles: Represent connectivity information to a specific source or destination database.

Objects: Represent a sub-portion of a stream. For instance, a database stream has a data object for every table being streamed.

Private Connectivity Configurations: Enable Datastream to communicate with data sources over a secure, private network connection. This communication happens through Virtual Private Cloud (VPC) peering.

Sources, Destinations & Connectivity

Source & Destination

Sources: Oracle and MySQL
Destination: Cloud Storage (change streams)
Limitation: Natively limited to Cloud Storage destination.

Connectivity Options

Private connectivity configurations: Defines a peering connection establishing a dedicated connection on the Datastream network (no other customers share it) communicating over internal IP addresses.
VPC Peering
IP ALLOW-LIST
Port-forwarding via SSH tunnel
Note: Private connectivity is optional, Datastream also supports other modes over public networks.

Custom Destination Workarounds

Because Datastream natively writes only to Cloud Storage, replicating to other destinations requires a Dataflow template workaround. Supported replication destinations include: Cloud SQL, Spanner, BigQuery, Bigtable, MongoDB, and Databricks.

Review & Self-Test

Key Takeaways

Datastream is a serverless CDC and replication service for real-time data sync.
Native destination is Cloud Storage, storing data as self-describing Avro or JSON files.
Supports Oracle (without extra licenses via LogMiner) and MySQL as sources.
Requires Dataflow templates to write directly to Spanner, Cloud SQL, BigQuery, Bigtable, MongoDB, or Databricks.
The core organizational units are Streams, Connection Profiles, Objects, and Events.

Self-Assessment Questions

Q1. What is the native destination for Datastream output?

Cloud Storage (as self-describing Avro or JSON files).

Q2. Which Oracle CDC method does Datastream use that avoids additional licenses?

LogMiner-based CDC streaming.

Q3. If you need to replicate data from Datastream to BigQuery, what service is required as an intermediary?

Dataflow (using a Dataflow template).

Q4. What is the alternative to CDC if you only need occasional access and joining of external data?

Federated Queries (e.g., querying external data directly from BigQuery).

Study Progress — Material Completed 100%