×
Dataplex Overview

Google Cloud
Dataplex

Intelligent data fabric that unifies your distributed data to help automate data management and power analytics at scale.

Author: Michaël Bettan
01

Overview & Data Mesh

Dataplex is a unified data management platform that unifies data lakes and warehouses to simplify data governance and accelerate analytics. It organizes disparate data sources across multiple clouds and formats into logical lakes and zones for unified management, and offers built-in security, governance, and integrations with other GCP services for seamless data analysis and insights.

Use Cases

  • Rapidly curate, secure, integrate, and analyze any type of data, at any scale
  • Organize data without data movement, automatic data discovery, metadata harvesting, lifecycle management, and data quality with built-in AI-driven data intelligence.
  • Central policy management, monitoring and auditing for data authorization, retention, and classification.

Data Mesh

Data Mesh promotes distributed data ownership while centralizing governance and data discoverability. It shifts the responsibility for managing, producing, and consuming data to domain-specific teams that have the most context for that data.

  • Dataplex provides centralized data governance while enabling distributed ownership through "virtual lakes."

Billing

Based on pay-as-you-go usage: Dataplex processing (standard and premium), Dataplex shuffle storage, Metadata storage, and Data Catalog API calls.

02

Key Capabilities

Centralized Data Management

  • Unifies data lakes and warehouses across multiple storage systems (Cloud Storage, BigQuery, etc.) for centralized visibility and control.
  • Organizes data into lakes and zones based on business logic for easier management.
  • Integrates with data governance and security tools for policy enforcement.

Enhanced Data Governance

  • Enforces data access and usage policies across all unified sources for compliance and privacy.
  • Provides data lineage tracking to understand data origins and transformations.
  • Enables data quality management to ensure accuracy and reliability.

Simplified Analytics

  • Provides a unified interface for data discovery and exploration.
  • Simplifies data access for analysts and data scientists using preferred tools (BigQuery, Spark, etc.).
  • Accelerates analytics and machine learning workflows with streamlined data access.

Security and Compliance

  • Enforces granular access controls and data masking for sensitive data protection.
  • Integrates with Google Cloud's security infrastructure for threat detection and prevention.
  • Supports compliance with industry regulations (GDPR, CCPA, etc.).
03

Architecture: Lakes & Zones

Logical groupings within a Dataplex lake allow you to organize and manage your data assets based on criteria like data sensitivity, department, or business domain. Benefits include simplified data management (breaking down large data lakes into manageable units), improved data governance (applying different policies and controls to different zones), and enhanced security (isolating sensitive data in dedicated zones).

Virtual Lakes
A lake is the highest-level abstraction representing a logical container for organizing and managing your data across multiple data storage systems (e.g., Cloud Storage, BigQuery). It allows you to apply governance, metadata management, and monitoring across your entire data ecosystem.

Use case: A lake can represent an overarching domain in your data mesh architecture, such as a specific business function (e.g., marketing, finance) or a data product.
Zones
A zone is a sub-component of a lake that organizes data into logical groupings, typically based on the lifecycle or stage of the data (e.g., raw, curated, or analytics data). Zones allow finer segmentation of data within a lake, reflecting different stages in the data processing pipeline.

Use case: You might create separate zones for raw, transformed, and curated data within a single lake, ensuring logical separation and governance at different stages of the data lifecycle.

Types of Zones

04

Data Profiles & Automated Tagging

Data Profile Insights

Data profiles give you insights into your data:

Automated Tagging

You can automatically tag tables in Dataplex based on the insights from your data profiles. For example, you could tag a table as "PII" if a profile detects sensitive information like names or addresses.

Improved Discovery & Governance

Improved Data Discovery: Tags make it easier to find and understand data within Dataplex.
Data Governance: Tagging helps you manage and control access to sensitive data.

05

Security Access Controls

Project Level

  • Dataplex Admin: Full control over all Dataplex resources in the project.
  • Dataplex Editor: Can manage Dataplex resources but cannot grant access to others.
  • Dataplex Viewer: Read-only access to all Dataplex resources in the project.

Lake Level

Grant permissions within a specific lake.

  • Lake Admin: Full control over a lake and its contents.
  • Lake Contributor: Can create, update, and delete resources within a lake.
  • Lake Reader: Read-only access to a lake and its contents.

Zone Level

Grant permissions within a specific zone.

  • Zone Admin: Full control over a zone and its contents.
  • Zone Contributor: Can create, update, and delete resources within a zone.
  • Zone Reader: Read-only access to a zone and its contents.

Data Roles

Control access to the data within Dataplex assets.

  • Data Reader: Read-only access to data.
  • Data Writer: Permission to write data.
  • Data Owner: Full control over data, including granting access to others.
06

Policy Tags

Policy Tags enable you to define and apply business-relevant metadata to your data assets. These tags can represent classifications (e.g., "Confidential", "PII"), compliance requirements (e.g., "GDPR", "HIPAA"), or data sensitivity levels.

Benefits of Policy Tags