Managed CDAP enterprise data integration service to provide an unified user experience for data wrangling and pipeline design.
Availability: Zonal, Regional
SLA: N/A
Billing: 3 Editions
Author: Michaël Bettan
01
Overview & Use Cases
Cloud Data Fusion is a fully managed, cloud-native enterprise data integration service based on the open-source CDAP project. It helps users quickly build and manage data pipelines without writing code.
Primary Use Cases
Unified Analytics Data Marts: Build reliable and fast data foundations.
Agile Data Warehouses: Accelerate modernization of data warehousing.
Secured and Integrated Data Lakes: Centralize data with enterprise-grade protection.
Operational Reporting: Enable streamlined insight delivery for business operations.
Important Consideration
If data wrangling and out-of-the-box connectors are strictly required, then Data Fusion is the recommended solution for an Enterprise scale customer.
02
Pricing Editions
Data Fusion offers three distinct billing editions tailored to different deployment sizes, workloads criticality, and real-time integration needs.
Developer
Full-feature edition for product exploration and development environments.
Zonal availability
Limitation on execution environment
Basic
Comprehensive data integration capabilities. Recommended for non-critical environments.
Build batch data pipelines
Connect to any data source
Perform code-free transformations
Limitation on simultaneous pipeline runs
Enterprise
Designed for critical environments requiring scale and reliability.
Includes support for real time data pipelines
Interactions with data lineage
No limitations on simultaneous pipeline runs
High availability
03
Key Capabilities
Data Integration
Develop and execute robust ETL/ELT pipelines seamlessly.
Code-free visual transformations: Drag and drop interface for ease of use.
200+ data sources and formats: Connect to mainframes, databases, SaaS, and enterprise applications.
1000+ transforms: Extensive library including data quality tools.
Multi-Cloud & Openness
Execute integration pipelines across multiple flexible environments.
Multi-Cloud: Deploy in different clouds (CDAP/CDF). Built on CDAP.io (Open-source & Community based).
Open Execution: Automatically provisioned Dataproc clusters with just a click.
Ecosystem Support: Run on AWS EMR, Azure HDInsight, Cloudera SDX, Cloudera Data platform (HDP), and Snowflake.
Google Cloud Integrated
Deep integration with native Google Cloud sources & destinations.
Cloud Storage (GCS)
Cloud SQL & Cloud Spanner
Pub/Sub
BigQuery
Cloud Bigtable
Metadata & Protection
Ensure security, visibility, and unified wrangling across pipelines.
Metadata & Modeling: End-to-end data view and flow. Insights into pipeline through end-to-end pipeline lineage.
Unified Wrangling & Pipeline: Cohesive experience across data preparation.
Data Protection (DLP): Mask, Redact & Encrypt data natively.
Secrets Management: Store sensitive passwords and URLs securely (KMS).
04
Architecture Components
Wrangler
Interactive tool for data preparation for on-boarding new sources and datasets. Performs data transformation, data quality checks, and provides visual feedback.
Data Pipeline
A robust user interface designed for building complex data workflows including operations like join, lookup, aggregate, filtering, etc.
Rules Engine
A cask tool used for business data transformation and checks, making complex rules codified accessible for business users.
Metadata Aggregator
A cask tool designed to effectively aggregate business, technical, and operational metadata across the platform.
Replication
A capability of Data Fusion that allows users to continuously replicate data from their production databases directly into an analytics data warehouse.
Execution Environment
Refers to the actual instance or compute resource where the pipelines run and execute transformations.
05
Security, Logging & Monitoring
Security & Execution Isolation
Directed Acyclic Graphs (DAG): Workflows are represented by a series of stages arranged in a DAG. This forms a strict one-way pipeline.
Stages: These act as the "nodes" in the pipeline graph and can represent different types of operations.
Namespaces: Used to logically partition a Data Fusion instance to achieve application and data isolation within your design and execution environments.
Logging & Monitoring
Native integration with Google Cloud's operations suite for full pipeline visibility and auditing.
Cloud Monitoring: Integrated out-of-the-box for tracking metrics and overall pipeline health.
Cloud Logging: Integrated for centralized log management, security auditing, and troubleshooting pipeline execution.