Google Cloud Platform

Cloud Data Fusion

Managed CDAP enterprise data integration service to provide an unified user experience for data wrangling and pipeline design.

Availability: Zonal, Regional

SLA: N/A

Billing: 3 Editions

Author: Michaël Bettan

Overview & Use Cases

Cloud Data Fusion is a fully managed, cloud-native enterprise data integration service based on the open-source CDAP project. It helps users quickly build and manage data pipelines without writing code.

Primary Use Cases

Unified Analytics Data Marts: Build reliable and fast data foundations.
Agile Data Warehouses: Accelerate modernization of data warehousing.
Secured and Integrated Data Lakes: Centralize data with enterprise-grade protection.
Operational Reporting: Enable streamlined insight delivery for business operations.

Important Consideration

If data wrangling and out-of-the-box connectors are strictly required, then Data Fusion is the recommended solution for an Enterprise scale customer.

Pricing Editions

Data Fusion offers three distinct billing editions tailored to different deployment sizes, workloads criticality, and real-time integration needs.

Developer

Full-feature edition for product exploration and development environments.

Zonal availability
Limitation on execution environment

Basic

Comprehensive data integration capabilities. Recommended for non-critical environments.

Build batch data pipelines
Connect to any data source
Perform code-free transformations
Limitation on simultaneous pipeline runs

Enterprise

Designed for critical environments requiring scale and reliability.

Includes support for real time data pipelines
Interactions with data lineage
No limitations on simultaneous pipeline runs
High availability

Key Capabilities

Data Integration

Develop and execute robust ETL/ELT pipelines seamlessly.

Code-free visual transformations: Drag and drop interface for ease of use.
200+ data sources and formats: Connect to mainframes, databases, SaaS, and enterprise applications.
1000+ transforms: Extensive library including data quality tools.

Multi-Cloud & Openness

Execute integration pipelines across multiple flexible environments.

Multi-Cloud: Deploy in different clouds (CDAP/CDF). Built on CDAP.io (Open-source & Community based).
Open Execution: Automatically provisioned Dataproc clusters with just a click.
Ecosystem Support: Run on AWS EMR, Azure HDInsight, Cloudera SDX, Cloudera Data platform (HDP), and Snowflake.

Google Cloud Integrated

Deep integration with native Google Cloud sources & destinations.

Cloud Storage (GCS)
Cloud SQL & Cloud Spanner
Pub/Sub
BigQuery
Cloud Bigtable

Metadata & Protection

Ensure security, visibility, and unified wrangling across pipelines.

Metadata & Modeling: End-to-end data view and flow. Insights into pipeline through end-to-end pipeline lineage.
Unified Wrangling & Pipeline: Cohesive experience across data preparation.
Data Protection (DLP): Mask, Redact & Encrypt data natively.
Secrets Management: Store sensitive passwords and URLs securely (KMS).

Architecture Components

Wrangler: Interactive tool for data preparation for on-boarding new sources and datasets. Performs data transformation, data quality checks, and provides visual feedback.

Data Pipeline: A robust user interface designed for building complex data workflows including operations like join, lookup, aggregate, filtering, etc.

Rules Engine: A cask tool used for business data transformation and checks, making complex rules codified accessible for business users.

Metadata Aggregator: A cask tool designed to effectively aggregate business, technical, and operational metadata across the platform.

Replication: A capability of Data Fusion that allows users to continuously replicate data from their production databases directly into an analytics data warehouse.

Execution Environment: Refers to the actual instance or compute resource where the pipelines run and execute transformations.

Security, Logging & Monitoring

Security & Execution Isolation

Directed Acyclic Graphs (DAG): Workflows are represented by a series of stages arranged in a DAG. This forms a strict one-way pipeline.
Stages: These act as the "nodes" in the pipeline graph and can represent different types of operations.
Namespaces: Used to logically partition a Data Fusion instance to achieve application and data isolation within your design and execution environments.

Logging & Monitoring

Native integration with Google Cloud's operations suite for full pipeline visibility and auditing.

Cloud Monitoring: Integrated out-of-the-box for tracking metrics and overall pipeline health.
Cloud Logging: Integrated for centralized log management, security auditing, and troubleshooting pipeline execution.

Video Walkthrough

Watch the Data Fusion engineering blueprint video on YouTube for a visual walkthrough.

Watch on YouTube