Architecture Notes

Data Lakehouse
Michaël Bettan

Comprehensive study notes covering Data Lakehouse concepts, comparing Data Lake vs Warehouse, Medallion & Lambda architectures, open table formats (Delta, Iceberg, UniForm), and Databricks platform fundamentals.

Topic: Data Architecture

Focus: Lakehouse & Delta Lake

Author: Michaël Bettan

Architectures Compared

Medallion Zones

Open Table Formats

Platform Planes

Overview: Lake vs Warehouse vs Lakehouse

The Lakehouse architecture leverages cloud object storage combined with open standards like Delta Lake to provide ACID transactions and robust governance. This approach supports record-level updates and eliminates the complexity of separate lake and warehouse systems.

Data Lake

Stores raw data in various formats at low cost but lacks structure, governance, and performance for complex queries.

Data Warehouse

Stores structured data optimized for fast querying and BI use cases but can be expensive and inflexible for diverse data types. Traditionally uses schema-on-write and offers ACID transactions.

Data Lakehouse

Combines the best of both worlds, offering low-cost storage, diverse data types, and excellent performance with robust governance.

Solves data lake challenges (data duplication, inconsistent data, lack of schema enforcement).
Solves data warehouse challenges (rigidity, difficulty in making changes).
Single processing path: Spark handles both batch and streaming incrementally.

Spark

Acts as the primary processing engine, unifying real-time and batch workloads under a single, incremental computational path.

Delta Lake

An open storage layer utilizing Parquet files and a transaction log (stored in _delta_log/) to provide ACID integrity, schema enforcement, and time travel across cloud-native object storage.

Medallion & Lambda Architectures

Medallion Architecture uses zones to manage data quality and usability, progressively improving data structure as it flows through the lakehouse.

ZONE 1

Source

Data originates directly from databases, files, or streams. Raw form, varying quality/governance.

ZONE 2

Bronze

Ingested as-is, usually Delta format. Contains history & duplicates. Primary goal is streamlined loading.

ZONE 3

Silver

Cleaned, quality rules applied, duplicates removed via merge. Contains data for business decisions.

ZONE 4

Gold

Curated and optimized (aggregations, joins) to answer business questions. Tuned for reading/dashboards.

Lambda Architecture attempts to balance low-latency and high-throughput processing using two paths. However, a lakehouse simplifies this by employing a single processing path (Spark) for both real-time and batch workflows.

Speed Layer (Hot Path)

Processes near real-time data using stream processing (Storm, Flink, Kafka). Quick but potentially inaccurate results; accuracy sacrificed for speed.

Batch Layer (Cold Path)

Processes in batches (Hadoop, Spark) handling historical data for comprehensive, accurate computations. Eventual correction of speed layer.

Serving Layer

Combines and serves results from both speed and batch layers for a unified view of the data.

Challenges with Lambda Architecture

Complexity: Maintaining two codebases is complex and expensive.
Data Duplication: Processing data twice increases storage costs.
Inconsistency: Reconciling paths can cause temporary data inconsistencies.

Core Principles & Architecture

Lakehouse Core Principles

Open Formats: Leverages open standards like Apache Parquet and Delta Lake to prevent vendor lock-in.
ACID Transactions: Ensures data integrity on cloud object storage through robust transaction logs.
Schema Enforcement & Evolution: Protects data quality while supporting governed structural changes.
Unified Governance: Standardizes security and discovery (e.g., Unity Catalog).
BI + ML on same data: Eliminates redundant data silos and costs.
Decoupled Storage & Compute: Scales independently to optimize TCO and performance.
Streaming + Batch unified: Identical APIs and Delta tables for both workloads.

Lakehouse Architecture Layers

Ingestion Layer
Raw Data

Storage Layer
Delta/Parquet

Metadata Layer
Unity Catalog

Processing Layer
Spark/Trino

Consumption Layer
BI / ML

Open Table Formats

Table Format (Delta Lake): Open-source storage layer that provides ACID transactions and scalable metadata handling for data lakes. It leverages Parquet files coupled with a JSON-based transaction log stored in the _delta_log/ directory.

ACID Transactions: Guarantees full transactional integrity on cloud object storage, ensuring partial writes are never visible.

Schema Enforcement & Evolution: Automatically rejects incompatible writes at ingestion, while supporting governed structural changes via mergeSchema or overwriteSchema.

Time Travel: Enables access to historical data snapshots using specific version numbers or timestamps.

MERGE (Upsert): Performs atomic insert, update, and delete operations, providing a foundation for CDC pipelines.

File Immutability & Deletion Vectors: Preserves original files for writes. Deletion vectors utilize a soft-delete mechanism enabling row-level concurrency without immediate file rewrites.

OPTIMIZE & VACUUM: OPTIMIZE: Compacts small files into larger ones. VACUUM: Removes unreferenced data files (default 7 days retention).

Z-Ordering & Liquid Clustering: Co-locates related data within files. Liquid clustering provides dynamic alternative by automatically reorganizing files based on clustering keys.

Table Format (Iceberg)

Engine-agnostic format providing warehouse-grade features on data lakes.

ACID Properties: Snapshot isolation
Schema Evolution: Add/remove/rename columns without rewrite
Hidden Partitioning: No manual directory structure management
Time Travel: Query via snapshot ID or timestamp
Metadata Layer: Tracks schema, file locations, snapshots
Multi-engine: Readable by multiple compute engines simultaneously

Table Format (UniForm)

Universal Format enabling interoperability across ecosystems.

Universal Format: Single Delta table read simultaneously as Delta, Iceberg, or Hudi without copying data.
Interoperability: Eliminates multi-format fragmentation allowing different teams to access the same source.
Table Configuration: Enabled via TBLPROPERTIES ('delta.universalFormat.enabledFormats' = 'iceberg').

Databricks Platform & Metadata Management

ACID Transactions Details

Atomicity: Write operations fully succeed or fail.
Consistency: Protects table integrity via schema enforcement.
Isolation: Concurrent readers and writers do not block each other (Optimistic Concurrency Control).
Durability: Permanently records every committed transaction in the Delta Log to survive failures.

Aspect	Control Plane	Data Plane
Hosting	Managed within the Databricks cloud subscription	Resides within the customer's own cloud subscription
Components	Includes Web UI, Cluster Management, Workflows, and Notebooks	Consists of compute clusters and storage (DBFS, Delta tables)
Management	Fully administered by Databricks	Administered by the customer via Databricks tooling

Metadata is managed at two distinct levels: table-level via the Delta Transaction Log and platform-level via Unity Catalog.

Delta Transaction Log (Table)

Log Directory: Maintains _delta_log/ directory with ordered JSON commit files.
Commit Metadata: Records schema, file locations, min/max stats, operation type.
Checkpoint Files: Created every 10 commits to optimize reads.
Read Flow: Engine reads last checkpoint + subsequent JSON logs.
Source of Truth: The log is the table, acting as single source of truth for read/write operations.

Unity Catalog (Platform)

Regional Metastore: Single metastore per region shared across workspaces.
Namespace Hierarchy: Catalog → Schema → Table / View / Volume / Function.
Unified Governance: Centralizes control for tables, models, dashboards, volumes, etc.
Automated Lineage: Captures column/table lineage automatically without manual configuration.
System Tables: access.audit, billing.usage, access.column_lineage.

Self-Assessment Questions

Q1. What is the primary purpose of the Bronze zone in Medallion Architecture?

To ingest data as-is from source systems, often retaining history and duplicates, streamlining the loading process to the Silver layer.

Q2. Why does the Lakehouse architecture eliminate the need for Lambda Architecture?

It employs a single processing path (like Apache Spark) capable of handling both batch and streaming workloads incrementally, removing the complexity of maintaining separate codebases.

Q3. In the Databricks Platform Architecture, which plane hosts the compute clusters and the actual Delta tables?

The Data Plane, which resides within the customer's own cloud subscription.

Q4. What is the role of Unity Catalog in a Lakehouse architecture?

It manages metadata at the platform-level, providing a regional metastore, unified governance, a 3-level namespace hierarchy, and automated data lineage.

Q5. What feature of Delta Lake allows for co-locating related data within files to maximize data skipping during queries?

Z-Ordering (and Liquid Clustering as a dynamic alternative).

Study Progress — Topic Review 100%

Data LakehouseMichaël Bettan

Overview: Lake vs Warehouse vs Lakehouse

Data Lake

Data Warehouse

Data Lakehouse

Spark

Delta Lake

Medallion & Lambda Architectures

Source

Bronze

Silver

Gold

Speed Layer (Hot Path)

Batch Layer (Cold Path)

Serving Layer

Challenges with Lambda Architecture

Core Principles & Architecture

Lakehouse Core Principles

Open Table Formats

Table Format (Iceberg)

Table Format (UniForm)

Databricks Platform & Metadata Management

ACID Transactions Details

Delta Transaction Log (Table)

Unity Catalog (Platform)

Self-Assessment Questions

Data Lakehouse
Michaël Bettan