Google Cloud Data Engineering

Dataproc
Managed Apache Spark & Hadoop

Managed Apache Spark and Hadoop service to leverage OSS Hadoop ecosystem tools (Pig, Hive, Spark, etc.).

Availability: Regional

SLA: >=99.5%

Billing: 1-min minimum

Author: Michaël Bettan

Core Components & Ecosystem

Data can no longer fit in memory on one machine (monolithic), so a new way of computing was devised using many computers to process the data (distributed). Such a group is called a cluster. Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems.

Use Cases

Big Data Processing: Process vast amounts of structured and unstructured data.
Data Warehousing & ETL: Scalable extraction, transformation, and loading (Hive, Presto).
Machine Learning: Train & deploy models with Spark MLlib.
Real-time Analytics: Streaming data with Flink & Spark Streaming.
Hadoop/Spark Migration: Migrate existing clusters from on-premises to cloud.
Interactive Analysis: Notebooks, Data Lakehouse, Data/Log analysis.

Billing Structure

Billed by the second (1-minute minimum).
Infrastructure costs: Compute Engine VMs, Persistent Disk storage, Cloud Storage, Cloud Monitoring, Global Networking.
Dataproc managed service cost: $0.01 per vCPU per hour based on the aggregate # of vCPUs across the entire cluster.

HDFS: Hadoop Distributed File System. Provides high-throughput access by partitioning data across machines. 1 master node stores metadata; rest are data nodes physically storing data. Replicated across 3 machines for fault tolerance.

YARN: Yet Another Resource Negotiator. Job scheduling and cluster resource management. The Resource Manager schedules tasks across nodes, and Node Managers handle execution.

MapReduce: Parallel programming paradigm that processes huge amounts of data by running processes in two stages: map and reduce.

Spark: Fast, general-purpose distributed framework. Core abstraction is the Resilient Distributed Dataset (RDD), which is immutable, lazily evaluated, and fault-tolerant through lineage tracking.

Hive: Data warehousing SQL-like query tool built on Hadoop. Translates HiveQL queries into MapReduce jobs, eliminating the need to write Java MapReduce code.

Pig: High-level scripting language (Pig Latin) for ETL. Supports checkpointing to resume processing in case of failure, and splitting pipelines to route data in parallel.

Data Ingestion & Storage

HBase: Non-relational, NoSQL, column-oriented database that runs on HDFS. Ideal for sparse datasets.
Sqoop: Transferring framework to move large amounts of data into HDFS from relational databases (MySQL).
Flume: Distributed, reliable service for efficiently collecting, aggregating, and moving large amounts of log data.
Presto / Trino: Distributed SQL query engine for interactive data analysis.
Avro / Parquet: Avro for compact, binary row-data exchange; Parquet for columnar storage optimized for querying.

Streaming & Coordination

Kafka & Kafka Connect: Distributed streaming platform for real-time pipelines and connecting CDC systems (MySQL/Postgres) into topics.
Flink: Stream processing framework for stateful computations over bounded and unbounded data streams.
Debezium: Distributed OSS platform for real-time change data capture (CDC).
Zookeeper: Manages distributed environment and configuration across services.
Oozie: Workflow scheduler to manage Hadoop jobs with complex dependencies.

Architecture & Foundation

Master and Worker Nodes: The master node is responsible for distributing and managing workload distribution. You can configure Primary workers and Secondary workers (low-cost Spot VMs without HDFS).

Cluster Types: Standard, Single Node (1 master, 0 workers), High-Availability (multiple masters).

Dataproc Metastore: Fully managed, highly available service providing a centralized Hive metastore for your data lake. Allows Spark, Hive, and Presto to seamlessly share metadata.

Spark History Server: Web interface to view details about completed Spark jobs (stages, RDDs). Dataproc clusters can be set up to upload history automatically to GCS.

Foundation Strategies

Persistent Clusters: Long-lived, ideal for ongoing batch jobs, retaining state. Higher ongoing cost but efficient for frequent use.
Transient Clusters: Short-lived, created for specific jobs and terminated afterward. Cost-effective for sporadic processing.
Initialization actions: Custom scripts to install additional components (e.g., Kafka) at worker creation time.
Optional Components: Pre-packaged additions like Anaconda, Jupyter, Druid, Ranger, Solr.

Autoscaling & Workflows

Autoscaling: Dynamically adjusts cluster size based on YARN memory/CPU utilization. Set thresholds and configure cooldown periods to avoid rapid oscillation.
Workflows: Directed Acyclic Graph (DAG) of jobs on a cluster. Submitted via Console, gcloud, REST API, or Dataproc Workflow / Composer.
Component Gateway: Provides secure access to web endpoints for Dataproc default and optional components.

GCP Integration & Security

Ecosystem Integration

Cloud Storage (GCS): Used for staging and storing data, providing scalable and durable storage while separating compute from storage.
BigQuery: Acts as a data warehouse for querying large datasets efficiently, often used in conjunction with Dataproc for analytics.
Dataflow: Manages data pipelines for ETL processes, integrating with Dataproc for seamless data processing.

Security Access Control

Project-wide access is restricted by default.
Dataproc Editor: Can create, delete, and edit clusters, jobs, and workflows.
Dataproc Viewer: Read-only access to view Dataproc configurations and jobs.
Dataproc Worker: Executes jobs; typically assigned to the service account executing on the worker nodes.

Dataproc vs Dataflow

Lead with Dataflow for streaming workloads and serverless ETL. Lead with Dataproc if you rely heavily on Spark/Hadoop ecosystems or are performing a "lift-and-shift" migration.

Best Practices & Optimization

Cost Optimization

Utilize Spot VMs: For non-critical, ephemeral clusters (no SLA, subject to preemption).
Autoscaling: Right-size clusters dynamically based on workload demands.
Larger Persistent Disks: Deliver better high disk throughput and cost-effective scaling than simply adding more VMs.

Performance Tuning

Co-locate clusters and buckets in the same region to minimize latency and egress costs.
Use SSDs when performing many shuffling operations or partitioned write workloads.
Increase boot disk size on spot workers for shuffle-heavy workloads.
spark.sql.shuffle.partitions: Set the number of partitions after a shuffle in Spark SQL.

Storage Optimization

Leverage GCS over HDFS: Avoids the 3x replication overhead and cleanly separates compute from storage.
Columnar Formats: Prefer Parquet or ORC for efficient querying.
I/O Optimization: Avoid small reads, use large block sizes, and avoid excessive nested directory traversal.
GCS Caching: Accelerates access for workloads accessing the same data repeatedly.

Advanced Optimizations & Tools

Use distcp for efficient on-premises to cloud data transfers. Apply aggressive query optimizations (AQE) for performance gains. Leverage Spark 3's improved execution engine in Cluster mode for better resource utilization. Secure clusters with VPC Service Controls, CMEK, and Least Privilege.

Ephemeral vs Persistent Clusters

Google Cloud strongly recommends using ephemeral clusters tailored for single-job tasks. This approach minimizes operational overhead and avoids the common anti-patterns associated with shared, long-lived persistent clusters.

Issues with Persistent Clusters

Single Points of Failure: Errors in a shared cluster disrupt all jobs. Ephemeral clusters are recreated quickly.
State Management: Challenges migrating and maintaining cluster states across HDFS or MySQL.
Resource Contention: Multiple jobs competing for YARN resources impacts overall performance.
Unresponsive Daemons: Memory pressure causes critical service daemons to freeze.
Disk Capacity Issues: Accumulated logs and temp files fill up disk space.
Outdated Software: Managing outdated images is problematic with long-lived setups.

Migration Considerations

Migrate data to HDFS or GCS (unless heavy I/O is explicitly required).
Migrate existing Hadoop/Spark jobs directly to Dataproc clusters.
Migrate HBase to Bigtable for better managed performance.
Migrate Hive & Impala to BigQuery.
Shift from persistent architectures (running multiple jobs) to ephemeral single-purpose clusters.
Separate storage (GCS) and compute (Dataproc clusters).

Monitoring & ML Tools

Job Driver Output: Log levels per job are gathered automatically.
Logs: Default to INFO; can be configured to DEBUG.
Cloud Monitoring: Tracks HDFS, YARN, Job & Operational metrics.
Cluster Details: Graphs for CPU, DISK, and NETWORK utilizations.
Machine Learning: SparkML is the modern, higher-level library for ML pipelines and DataFrame integration. MLlib is the legacy lower-level library focused on RDDs.

Languages & Web Interfaces

Language	Pros	Cons
Python (PySpark)	Easy to learn, large community, rich libraries (Pandas, NumPy), good for data exploration and prototyping.	Slower than Scala/Java for complex jobs, less type safety can lead to runtime errors.
Scala (Spark)	Fast performance, statically typed (catches errors early), integrates natively with Spark, concise syntax.	Steeper learning curve than Python, smaller developer community.
Java (Spark)	Very fast, highly mature ecosystem, extensive tooling support.	Verbose syntax, can be significantly less developer-friendly for data scientists.
SQL (Spark SQL)	Familiar to many data analysts, declarative (express what you want), excellent for data warehousing/BI.	Less flexible than general-purpose languages, can be difficult for complex programmatic transformations.
R (SparkR)	Powerful statistical computing language, rich repository of statistical libraries.	Slower than Scala/Java, less integrated with the broader Spark ecosystem compared to other options.

Accessing Web Interfaces

Component Gateway: Provides one-click access to Hadoop, Spark, and other component web UIs directly through the GCP console. Requires explicitly enabling during cluster creation.
Google Cloud CLI: You can use gcloud compute ssh with dynamic port forwarding to create an SSH tunnel and set up a SOCKS proxy for secure, authenticated local access to UIs.

DataprocManaged Apache Spark & Hadoop

Core Components & Ecosystem

Use Cases

Billing Structure

Data Ingestion & Storage

Streaming & Coordination

Architecture & Foundation

Foundation Strategies

Autoscaling & Workflows

GCP Integration & Security

Ecosystem Integration

Security Access Control

Dataproc vs Dataflow

Best Practices & Optimization

Cost Optimization

Performance Tuning

Storage Optimization

Advanced Optimizations & Tools

Ephemeral vs Persistent Clusters

Issues with Persistent Clusters

Migration Considerations

Monitoring & ML Tools

Languages & Web Interfaces

Accessing Web Interfaces

Dataproc
Managed Apache Spark & Hadoop