Comprehensive study notes covering Data Lakehouse concepts, comparing Data Lake vs Warehouse, Medallion & Lambda architectures, open table formats (Delta, Iceberg, UniForm), and Databricks platform fundamentals.
The Lakehouse architecture leverages cloud object storage combined with open standards like Delta Lake to provide ACID transactions and robust governance. This approach supports record-level updates and eliminates the complexity of separate lake and warehouse systems.
Stores raw data in various formats at low cost but lacks structure, governance, and performance for complex queries.
Stores structured data optimized for fast querying and BI use cases but can be expensive and inflexible for diverse data types. Traditionally uses schema-on-write and offers ACID transactions.
Combines the best of both worlds, offering low-cost storage, diverse data types, and excellent performance with robust governance.
Acts as the primary processing engine, unifying real-time and batch workloads under a single, incremental computational path.
An open storage layer utilizing Parquet files and a transaction log (stored in _delta_log/) to provide ACID integrity, schema enforcement, and time travel across cloud-native object storage.
Medallion Architecture uses zones to manage data quality and usability, progressively improving data structure as it flows through the lakehouse.
Data originates directly from databases, files, or streams. Raw form, varying quality/governance.
Ingested as-is, usually Delta format. Contains history & duplicates. Primary goal is streamlined loading.
Cleaned, quality rules applied, duplicates removed via merge. Contains data for business decisions.
Curated and optimized (aggregations, joins) to answer business questions. Tuned for reading/dashboards.
Lambda Architecture attempts to balance low-latency and high-throughput processing using two paths. However, a lakehouse simplifies this by employing a single processing path (Spark) for both real-time and batch workflows.
Processes near real-time data using stream processing (Storm, Flink, Kafka). Quick but potentially inaccurate results; accuracy sacrificed for speed.
Processes in batches (Hadoop, Spark) handling historical data for comprehensive, accurate computations. Eventual correction of speed layer.
Combines and serves results from both speed and batch layers for a unified view of the data.
Complexity: Maintaining two codebases is complex and expensive.
Data Duplication: Processing data twice increases storage costs.
Inconsistency: Reconciling paths can cause temporary data inconsistencies.
Table Format (Delta Lake): Open-source storage layer that provides ACID transactions and scalable metadata handling for data lakes. It leverages Parquet files coupled with a JSON-based transaction log stored in the _delta_log/ directory.
mergeSchema or overwriteSchema.Engine-agnostic format providing warehouse-grade features on data lakes.
Universal Format enabling interoperability across ecosystems.
'delta.universalFormat.enabledFormats' = 'iceberg').Atomicity: Write operations fully succeed or fail.
Consistency: Protects table integrity via schema enforcement.
Isolation: Concurrent readers and writers do not block each other (Optimistic Concurrency Control).
Durability: Permanently records every committed transaction in the Delta Log to survive failures.
| Aspect | Control Plane | Data Plane |
|---|---|---|
| Hosting | Managed within the Databricks cloud subscription | Resides within the customer's own cloud subscription |
| Components | Includes Web UI, Cluster Management, Workflows, and Notebooks | Consists of compute clusters and storage (DBFS, Delta tables) |
| Management | Fully administered by Databricks | Administered by the customer via Databricks tooling |
Metadata is managed at two distinct levels: table-level via the Delta Transaction Log and platform-level via Unity Catalog.
_delta_log/ directory with ordered JSON commit files.Q1. What is the primary purpose of the Bronze zone in Medallion Architecture?
Q2. Why does the Lakehouse architecture eliminate the need for Lambda Architecture?
Q3. In the Databricks Platform Architecture, which plane hosts the compute clusters and the actual Delta tables?
Q4. What is the role of Unity Catalog in a Lakehouse architecture?
Q5. What feature of Delta Lake allows for co-locating related data within files to maximize data skipping during queries?