Managed Apache Spark and Hadoop service to leverage OSS Hadoop ecosystem tools (Pig, Hive, Spark, etc.).
Availability: Regional
SLA: >=99.5%
Billing: 1-min minimum
Author: Michaël Bettan
01
Core Components & Ecosystem
Data can no longer fit in memory on one machine (monolithic), so a new way of computing was devised using many computers to process the data (distributed). Such a group is called a cluster. Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems.
Use Cases
Big Data Processing: Process vast amounts of structured and unstructured data.
Data Warehousing & ETL: Scalable extraction, transformation, and loading (Hive, Presto).
Machine Learning: Train & deploy models with Spark MLlib.
Real-time Analytics: Streaming data with Flink & Spark Streaming.
Hadoop/Spark Migration: Migrate existing clusters from on-premises to cloud.
Interactive Analysis: Notebooks, Data Lakehouse, Data/Log analysis.
Billing Structure
Billed by the second (1-minute minimum).
Infrastructure costs: Compute Engine VMs, Persistent Disk storage, Cloud Storage, Cloud Monitoring, Global Networking.
Dataproc managed service cost: $0.01 per vCPU per hour based on the aggregate # of vCPUs across the entire cluster.
HDFS
Hadoop Distributed File System. Provides high-throughput access by partitioning data across machines. 1 master node stores metadata; rest are data nodes physically storing data. Replicated across 3 machines for fault tolerance.
YARN
Yet Another Resource Negotiator. Job scheduling and cluster resource management. The Resource Manager schedules tasks across nodes, and Node Managers handle execution.
MapReduce
Parallel programming paradigm that processes huge amounts of data by running processes in two stages: map and reduce.
Spark
Fast, general-purpose distributed framework. Core abstraction is the Resilient Distributed Dataset (RDD), which is immutable, lazily evaluated, and fault-tolerant through lineage tracking.
Hive
Data warehousing SQL-like query tool built on Hadoop. Translates HiveQL queries into MapReduce jobs, eliminating the need to write Java MapReduce code.
Pig
High-level scripting language (Pig Latin) for ETL. Supports checkpointing to resume processing in case of failure, and splitting pipelines to route data in parallel.
Data Ingestion & Storage
HBase: Non-relational, NoSQL, column-oriented database that runs on HDFS. Ideal for sparse datasets.
Sqoop: Transferring framework to move large amounts of data into HDFS from relational databases (MySQL).
Flume: Distributed, reliable service for efficiently collecting, aggregating, and moving large amounts of log data.
Presto / Trino: Distributed SQL query engine for interactive data analysis.
Avro / Parquet: Avro for compact, binary row-data exchange; Parquet for columnar storage optimized for querying.
Streaming & Coordination
Kafka & Kafka Connect: Distributed streaming platform for real-time pipelines and connecting CDC systems (MySQL/Postgres) into topics.
Flink: Stream processing framework for stateful computations over bounded and unbounded data streams.
Debezium: Distributed OSS platform for real-time change data capture (CDC).
Zookeeper: Manages distributed environment and configuration across services.
Oozie: Workflow scheduler to manage Hadoop jobs with complex dependencies.
02
Architecture & Foundation
Master and Worker Nodes
The master node is responsible for distributing and managing workload distribution. You can configure Primary workers and Secondary workers (low-cost Spot VMs without HDFS).
Cluster Types
Standard, Single Node (1 master, 0 workers), High-Availability (multiple masters).
Dataproc Metastore
Fully managed, highly available service providing a centralized Hive metastore for your data lake. Allows Spark, Hive, and Presto to seamlessly share metadata.
Spark History Server
Web interface to view details about completed Spark jobs (stages, RDDs). Dataproc clusters can be set up to upload history automatically to GCS.
Foundation Strategies
Persistent Clusters: Long-lived, ideal for ongoing batch jobs, retaining state. Higher ongoing cost but efficient for frequent use.
Transient Clusters: Short-lived, created for specific jobs and terminated afterward. Cost-effective for sporadic processing.
Initialization actions: Custom scripts to install additional components (e.g., Kafka) at worker creation time.
Optional Components: Pre-packaged additions like Anaconda, Jupyter, Druid, Ranger, Solr.
Autoscaling & Workflows
Autoscaling: Dynamically adjusts cluster size based on YARN memory/CPU utilization. Set thresholds and configure cooldown periods to avoid rapid oscillation.
Workflows: Directed Acyclic Graph (DAG) of jobs on a cluster. Submitted via Console, gcloud, REST API, or Dataproc Workflow / Composer.
Component Gateway: Provides secure access to web endpoints for Dataproc default and optional components.
03
GCP Integration & Security
Ecosystem Integration
Cloud Storage (GCS): Used for staging and storing data, providing scalable and durable storage while separating compute from storage.
BigQuery: Acts as a data warehouse for querying large datasets efficiently, often used in conjunction with Dataproc for analytics.
Dataflow: Manages data pipelines for ETL processes, integrating with Dataproc for seamless data processing.
Security Access Control
Project-wide access is restricted by default.
Dataproc Editor: Can create, delete, and edit clusters, jobs, and workflows.
Dataproc Viewer: Read-only access to view Dataproc configurations and jobs.
Dataproc Worker: Executes jobs; typically assigned to the service account executing on the worker nodes.
Dataproc vs Dataflow
Lead with Dataflow for streaming workloads and serverless ETL. Lead with Dataproc if you rely heavily on Spark/Hadoop ecosystems or are performing a "lift-and-shift" migration.
04
Best Practices & Optimization
Cost Optimization
Utilize Spot VMs: For non-critical, ephemeral clusters (no SLA, subject to preemption).
Autoscaling: Right-size clusters dynamically based on workload demands.
Larger Persistent Disks: Deliver better high disk throughput and cost-effective scaling than simply adding more VMs.
Performance Tuning
Co-locate clusters and buckets in the same region to minimize latency and egress costs.
Use SSDs when performing many shuffling operations or partitioned write workloads.
Increase boot disk size on spot workers for shuffle-heavy workloads.
spark.sql.shuffle.partitions: Set the number of partitions after a shuffle in Spark SQL.
Storage Optimization
Leverage GCS over HDFS: Avoids the 3x replication overhead and cleanly separates compute from storage.
Columnar Formats: Prefer Parquet or ORC for efficient querying.
I/O Optimization: Avoid small reads, use large block sizes, and avoid excessive nested directory traversal.
GCS Caching: Accelerates access for workloads accessing the same data repeatedly.
Advanced Optimizations & Tools
Use distcp for efficient on-premises to cloud data transfers. Apply aggressive query optimizations (AQE) for performance gains. Leverage Spark 3's improved execution engine in Cluster mode for better resource utilization. Secure clusters with VPC Service Controls, CMEK, and Least Privilege.
05
Ephemeral vs Persistent Clusters
Google Cloud strongly recommends using ephemeral clusters tailored for single-job tasks. This approach minimizes operational overhead and avoids the common anti-patterns associated with shared, long-lived persistent clusters.
Issues with Persistent Clusters
Single Points of Failure: Errors in a shared cluster disrupt all jobs. Ephemeral clusters are recreated quickly.
State Management: Challenges migrating and maintaining cluster states across HDFS or MySQL.
Resource Contention: Multiple jobs competing for YARN resources impacts overall performance.
Unresponsive Daemons: Memory pressure causes critical service daemons to freeze.
Disk Capacity Issues: Accumulated logs and temp files fill up disk space.
Outdated Software: Managing outdated images is problematic with long-lived setups.
Migration Considerations
Migrate data to HDFS or GCS (unless heavy I/O is explicitly required).
Migrate existing Hadoop/Spark jobs directly to Dataproc clusters.
Migrate HBase to Bigtable for better managed performance.
Migrate Hive & Impala to BigQuery.
Shift from persistent architectures (running multiple jobs) to ephemeral single-purpose clusters.
Separate storage (GCS) and compute (Dataproc clusters).
Monitoring & ML Tools
Job Driver Output: Log levels per job are gathered automatically.
Logs: Default to INFO; can be configured to DEBUG.
Cluster Details: Graphs for CPU, DISK, and NETWORK utilizations.
Machine Learning:SparkML is the modern, higher-level library for ML pipelines and DataFrame integration. MLlib is the legacy lower-level library focused on RDDs.
06
Languages & Web Interfaces
Language
Pros
Cons
Python (PySpark)
Easy to learn, large community, rich libraries (Pandas, NumPy), good for data exploration and prototyping.
Slower than Scala/Java for complex jobs, less type safety can lead to runtime errors.
Scala (Spark)
Fast performance, statically typed (catches errors early), integrates natively with Spark, concise syntax.
Steeper learning curve than Python, smaller developer community.
Java (Spark)
Very fast, highly mature ecosystem, extensive tooling support.
Verbose syntax, can be significantly less developer-friendly for data scientists.
SQL (Spark SQL)
Familiar to many data analysts, declarative (express what you want), excellent for data warehousing/BI.
Less flexible than general-purpose languages, can be difficult for complex programmatic transformations.
R (SparkR)
Powerful statistical computing language, rich repository of statistical libraries.
Slower than Scala/Java, less integrated with the broader Spark ecosystem compared to other options.
Accessing Web Interfaces
Component Gateway: Provides one-click access to Hadoop, Spark, and other component web UIs directly through the GCP console. Requires explicitly enabling during cluster creation.
Google Cloud CLI: You can use gcloud compute ssh with dynamic port forwarding to create an SSH tunnel and set up a SOCKS proxy for secure, authenticated local access to UIs.