Comprehensive study notes covering data engineering fundamentals, data pipelines, architectures (Data Lake vs Data Mesh), databases, formats, and orchestration.
Author: Michaël Bettan
01
Definition & Core Concepts
Definition
Data engineering enables data-driven decision making by collecting, transforming, and visualizing data. A data engineer designs, builds, maintains, and troubleshoots data processing systems with a particular emphasis on the security, reliability, fault-tolerance, scalability, fidelity, and efficiency of such systems.
Big Data's 3 Vs
Describe the core characteristics that distinguish it from traditional data:
Volume: refers to the massive amount of data
Velocity: to the speed at which it's generated and processed
Variety: to the different formats it comes in (structured, semi-structured, unstructured).
Use Cases
Build/maintain data structures and databases
Design data processing systems
Analyze data and enable machine learning
Design for reliability
Visualize data and advocate policy
Model business processes for analysis
Design for security and compliance
Data Engineers' Role
Build data pipelines to enable data-driven decisions:
Get the data to where it can be useful
Get the data into a usable condition
Add new value to the data
Manage the data
Productionize data processes
Understand how to manipulate data formats, scale data systems, and enforce data quality and security.
Durability
Ensures that data remains intact and persistent, even in the event of system failures such as hardware crashes or power outages. It guarantees data integrity and the ability to recover information, safeguarding it from loss.
Availability
Guarantees that data is always accessible when needed. Systems are built with redundancy and fault tolerance to minimize downtime, ensuring continuous and reliable access to information.
Low Latency
minimal delay between a data request and its response. Low latency is critical for real-time or near real-time data processing, especially in time-sensitive applications like online gaming and financial transactions.
Types of data structures
Structured data has a fixed schema (such as RDBMS)
Semi-streaming data has a flexible schema (can vary)
Unstructured data does not have a structure used to determine how to store data → text, video, images, etc.
02
OLTP vs OLAP & Storage
OLTP (Online Transaction Processing)
Focuses on high-volume, short transactions, prioritizing data integrity and consistency (e.g., banking transactions, e-commerce orders). ACID properties are crucial. Normalized schemas are common.
OLAP (Online Analytical Processing)
Designed for complex queries, aggregations, and analysis over large datasets. Speed and flexibility for analytical queries are prioritized. Denormalized schemas (star/snowflake) are common.
Row-level vs Column-level
RDBMS: row-based storage, all the fields (columns) for a given row are stored together. However, in columnar storage, the data for each column is stored independently.
Efficient Querying: Since data in each column is stored separately, queries that access only specific columns can be executed much faster because BigQuery doesn't need to read entire rows into memory—just the relevant columns. For example, if you have a table with 10 columns but your query only references 2 of them, BigQuery only reads and processes those 2 columns, which reduces the amount of data read from storage.
Better Compression: Columns often have repeating or similar values (like IDs, dates, etc.), making them highly compressible. By storing each column separately, BigQuery can apply more efficient compression algorithms, further reducing storage costs.
Optimized for Analytics: Columnar storage is particularly effective for analytical workloads where large datasets are scanned for a few fields (columns) to produce reports or insights.
RDBMS & ACID
Relational Database Management Systems allows you to create, update, and manage a relational database, and ensure data integrity through ACID properties
Atomicity: All changes in a transaction are treated as a single unit.
Consistency: Data remains consistent before and after a transaction.
Isolation: Multiple transactions don't interfere with each other.
Durability: Once a transaction is committed, the changes are permanent.
Strong consistency: every operation (read or write) guarantees that all transactions are in the same order
Eventual consistency: parallel processes can see changes in different orders or temporarily inconsistent states, but over time, the system guarantees that all processes will converge on the same final state.
03
Data Architectures & Formats
Data Lake
Suitable for organizations with a strong central IT team and a need for a cost-effective solution to store large volumes of raw data. Best when exploring data and use cases aren't clearly defined.
Table Format: Data Lake
Brings together data from across the enterprise into a single and consolidated location → Cloud Storage
Solves for these pain points:
Getting insights across multiple datasets is difficult
Data is often siloed in many upstream source systems.
Cleaning, formatting and getting the data ready for useful business insights in a data warehouse
Key considerations: Can it handle all types of data? Can it scale? High-throughput? Fine-grained access control? Easy tool connection?
Data Mesh
Ideal for large, complex organizations with diverse data domains and a need for greater agility and data ownership. Requires a significant cultural shift and investment in training and infrastructure.
Promotes distributed data ownership while centralizing governance and data discoverability. It shifts the responsibility for managing, producing, and consuming data to domain-specific teams.
Domain-Oriented Data Ownership: Each domain owns and manages its data.
Data as a Product: Useful, documented, and accessible via APIs.
Schema Evolution: Add/remove/modify columns without rewriting dataset.
Hidden Partitioning: Manages partitioning efficiently behind the scenes.
Time Travel: Tracks history, allowing queries of past snapshots.
Data stored in Parquet/Avro/ORC. Metadata tracks schema, location, partitioning, snapshots.
Table Formats: Hive & Delta Lake
Hive: Data warehouse system on Hadoop for querying large datasets (batch, SQL-like).
ACID: Limited support (Hive ACID tables/ORC).
Schema Evolution: Supported.
Partitioning: Directory-based.
Delta Lake: Open-source storage layer providing ACID transactions for data lakes (often with Spark).
ACID Transactions: Full support.
Schema Evolution: Enforcement & evolution.
Time Travel & Partitioning supported.
Data Serialization
The process of transforming a complex data structure into a format that can be easily stored or transmitted and then reconstructed later. (e.g., JSON, XML, CSV, Protocol Buffer).
Protocol Buffer (protobuf)
Mechanism for serializing structured data in a language- and platform-neutral way. Compact, fast. Often used for microservices communication and defining API contracts.
04
Data Warehouses & Pipelines
Data Warehouse
Stores transformed data in a usable condition for business insights (e.g., BigQuery, Snowflake).
RDBMS are optimized for high throughput WRITES to RECORDS. Cloud SQL databases are RECORD-based storage.
BigQuery is COLUMN-based storage that allows for really wide reporting schemas.
Upstream processes: Data acquisition, integration, preparation.
Downstream processes: Data consumption, analysis, reporting.
Collect: raw data, such as streaming data from devices, on-premises batch data, application logs, or mobile-app user events and analytics
Store: the data in a format that is durable and accessible into a data lake or data warehouse
Process or enrich: to make the data useful for the downstream systems for analysis
Analyze: data exploration, ad-hoc querying and analysis
Empower: data visualization, sharing and activation to derive business value and insights from data
Collect mechanisms
Application: push model (Cloud Logging, Pub/Sub, CloudSQL, Spanner)
Streaming: continuous stream of small, asynchronous messages (IoT telemetry, user events → Pub/Sub)
Batching: large amounts transferred in bulk (scientific workloads, backups → Storage Transfer Service)
Data Pipeline patterns
Needs additional processing? → Data processing (Dataproc or Dataflow)
Arrives continuously? → Data streaming (Pub/Sub + Dataflow)
ETL (Extract, Transform, Load): Data is loaded into a database after transformation.
ELT (Extract, Load, Transformation): Data is loaded into a database before transforming.
EL (Extraction, Load): No transform needed.
CDC (Change Data Capture): Captures changes over time in source system.
Batch vs Streaming
Batch: For bounded, finite datasets, delayed processing. Choose if high throughput > low latency, data is finite, complex transforms needed.
Streaming: For unbounded datasets, continuous updates, immediate processing. Uses message queue. Choose if low latency is crucial (fraud detection), data continuous, immediate feedback required. Main challenge: out of order data.
05
Databases & Modeling
Feature
Relational
Document
Wide-Column
Data Model
Table
JSON/BSON document
Sparse multidimensional array
Schema
Fixed
Flexible
Flexible within column families
Scalability
Vertical
Horizontal
Horizontal
Querying
SQL
Specific query languages
Specific query languages
Transactions
ACID
Limited
Limited
Use Cases
Transactional systems, structured data
Content management, e-commerce, user profiles
Time-series data, analytics, large datasets
Star Schema
EDW schema that organizes data into fact and dimension tables.
Fact Table: Contains quantitative data (facts) like sales, revenue. Holds foreign keys to dimension tables.
Lack of Foreign Keys: leads to data integrity issues.
Generic Primary Keys: (auto-incrementing) can obscure relationships.
06
Advanced Concepts & Tooling
Data marts
Siloed copies of some of the data stored in a data warehouse to offload analytic activity and preserve performance. Fulfills the request of a specific division.
Time-series data
Sequence of data points indexed in time order (CPU utilization, stock pricing).
Data orchestration tools
e.g., Apache Airflow, focus on orchestration only and structure the flow of data as a series of sequential tasks.
DataOps
Brings together data engineers, scientists, analysts to apply agile best practices to the data lifecycle.
OLAP cube
Data structure for fast analysis according to multiple dimensions.
Data Wrangling
Process of cleaning, transforming, and preparing data for analysis (handling missing values, etc.).
Data Transformation
Denormalization: Adding precomputed redundant data to improve read performance (can increase storage costs).
Normalization: Removing redundancy (better for integrity, slower queries).
Standardizing: converting data types.
Cleansing: resolving inaccuracies.
Mapping: combining elements.
Deduplication: eliminating duplicate copies.
Streaming semantics
Ensuring processing is accurate/reliable despite failures.
Exactly-Once: Each record processed exactly one time (difficult to achieve).
At-Least-Once: Processed at least once (can lead to duplicates, requires idempotence).
At-Most-Once: Processed at most once (weakest, records may be lost).
Distributed Systems Concepts
Hot Spotting: A specific partition receiving a disproportionately high volume of requests (access patterns).
Data Skew: Uneven distribution of data across partitions.
Shuffling: Data redistribution across nodes based on a key (network-intensive, e.g., Spark joins).
Data lineage: Tracing and visualizing the flow of data from origin to destination.
Data tiering: Classifying data (hot/cold) to optimize storage costs in data lakes/warehouses.
Idempotence: Operation can be applied multiple times without changing the result (vital for retries).
Data Build Tool (DBT)
SQL-first transformation workflow following software engineering best practices.
Pros: Strong community, excels at complex logic, modularity, wide support.
Cons: Steeper learning curve, less focus on orchestration, limited built-in testing.
Resources & Assets
Video Content
Watch the Data Engineering 101 architecture blueprint video on YouTube.