Study Notes — Certification Prep

Data Engineering 101
Michaël Bettan

Comprehensive study notes covering data engineering fundamentals, data pipelines, architectures (Data Lake vs Data Mesh), databases, formats, and orchestration.

Author: Michaël Bettan

Definition & Core Concepts

Definition

Data engineering enables data-driven decision making by collecting, transforming, and visualizing data. A data engineer designs, builds, maintains, and troubleshoots data processing systems with a particular emphasis on the security, reliability, fault-tolerance, scalability, fidelity, and efficiency of such systems.

Big Data's 3 Vs

Describe the core characteristics that distinguish it from traditional data:

Volume: refers to the massive amount of data
Velocity: to the speed at which it's generated and processed
Variety: to the different formats it comes in (structured, semi-structured, unstructured).

Use Cases

Build/maintain data structures and databases
Design data processing systems
Analyze data and enable machine learning
Design for reliability
Visualize data and advocate policy
Model business processes for analysis
Design for security and compliance

Data Engineers' Role

Build data pipelines to enable data-driven decisions:

Get the data to where it can be useful
Get the data into a usable condition
Add new value to the data
Manage the data
Productionize data processes
Understand how to manipulate data formats, scale data systems, and enforce data quality and security.

Durability: Ensures that data remains intact and persistent, even in the event of system failures such as hardware crashes or power outages. It guarantees data integrity and the ability to recover information, safeguarding it from loss.

Availability: Guarantees that data is always accessible when needed. Systems are built with redundancy and fault tolerance to minimize downtime, ensuring continuous and reliable access to information.

Low Latency: minimal delay between a data request and its response. Low latency is critical for real-time or near real-time data processing, especially in time-sensitive applications like online gaming and financial transactions.

Types of data structures

Structured data has a fixed schema (such as RDBMS)
Semi-streaming data has a flexible schema (can vary)
Unstructured data does not have a structure used to determine how to store data → text, video, images, etc.

OLTP vs OLAP & Storage

OLTP (Online Transaction Processing)

Focuses on high-volume, short transactions, prioritizing data integrity and consistency (e.g., banking transactions, e-commerce orders). ACID properties are crucial. Normalized schemas are common.

OLAP (Online Analytical Processing)

Designed for complex queries, aggregations, and analysis over large datasets. Speed and flexibility for analytical queries are prioritized. Denormalized schemas (star/snowflake) are common.

Row-level vs Column-level

RDBMS: row-based storage, all the fields (columns) for a given row are stored together. However, in columnar storage, the data for each column is stored independently.

Efficient Querying: Since data in each column is stored separately, queries that access only specific columns can be executed much faster because BigQuery doesn't need to read entire rows into memory—just the relevant columns. For example, if you have a table with 10 columns but your query only references 2 of them, BigQuery only reads and processes those 2 columns, which reduces the amount of data read from storage.
Better Compression: Columns often have repeating or similar values (like IDs, dates, etc.), making them highly compressible. By storing each column separately, BigQuery can apply more efficient compression algorithms, further reducing storage costs.
Optimized for Analytics: Columnar storage is particularly effective for analytical workloads where large datasets are scanned for a few fields (columns) to produce reports or insights.

RDBMS & ACID

Relational Database Management Systems allows you to create, update, and manage a relational database, and ensure data integrity through ACID properties

Atomicity: All changes in a transaction are treated as a single unit.
Consistency: Data remains consistent before and after a transaction.
Isolation: Multiple transactions don't interfere with each other.
Durability: Once a transaction is committed, the changes are permanent.

Strong consistency: every operation (read or write) guarantees that all transactions are in the same order

Eventual consistency: parallel processes can see changes in different orders or temporarily inconsistent states, but over time, the system guarantees that all processes will converge on the same final state.

Data Architectures & Formats

Data Lake

Suitable for organizations with a strong central IT team and a need for a cost-effective solution to store large volumes of raw data. Best when exploring data and use cases aren't clearly defined.

Table Format: Data Lake

Brings together data from across the enterprise into a single and consolidated location → Cloud Storage
Solves for these pain points:
- Getting insights across multiple datasets is difficult
- Data is often siloed in many upstream source systems.
- Cleaning, formatting and getting the data ready for useful business insights in a data warehouse
Key considerations: Can it handle all types of data? Can it scale? High-throughput? Fine-grained access control? Easy tool connection?

Data Mesh

Ideal for large, complex organizations with diverse data domains and a need for greater agility and data ownership. Requires a significant cultural shift and investment in training and infrastructure.

Promotes distributed data ownership while centralizing governance and data discoverability. It shifts the responsibility for managing, producing, and consuming data to domain-specific teams.

Domain-Oriented Data Ownership: Each domain owns and manages its data.
Data as a Product: Useful, documented, and accessible via APIs.
Federated Governance: Centralized rules ensure compliance (e.g., Dataplex).
Self-Service Infrastructure: Teams are empowered to manage data.

Data Lake vs Data Mesh

Centralized vs Decentralized | Schema-on-read vs Schema-on-write | IT-centric vs Data products/Domain-centric

Choosing the right file format

Data access patterns: Are you primarily performing analytical queries or row-wise operations?
Data size and storage costs: How much data are you storing, and what are your storage budget constraints?
Schema evolution requirements: How often do you expect your data schema to change?
Data quality needs: How important is schema enforcement and data validation?
Tooling and ecosystem: What tools and technologies are you using to process and analyze your data?

Format	Description	Best For
CSV	Simple text format, rows separated by commas. Lacks data type enforcement.	Small datasets, simple data exchange, human readability.
Parquet	Columnar storage optimized for analytical queries. Better compression/performance than CSV.	Analytical workloads needing high performance, compression, schema evolution.
Avro	Row-oriented format based on schemas. Supports schema evolution, good compression.	Data serialization and streaming applications where schema evolution is paramount.
ORC	Optimized Row Columnar. Combines row/columnar aspects. Good compression/querying.	Hive workloads requiring ACID properties and optimized for columnar access.
JSON	Human-readable, key-value pairs. Flexible but challenging to query efficiently at scale.	Web APIs and document databases where flexibility is important.

Table Format: Iceberg

A high-level management layer for your data lake. Provides essential features:

ACID Properties: Atomicity, consistency, isolation, durability.
Schema Evolution: Add/remove/modify columns without rewriting dataset.
Hidden Partitioning: Manages partitioning efficiently behind the scenes.
Time Travel: Tracks history, allowing queries of past snapshots.

Data stored in Parquet/Avro/ORC. Metadata tracks schema, location, partitioning, snapshots.

Table Formats: Hive & Delta Lake

Hive: Data warehouse system on Hadoop for querying large datasets (batch, SQL-like).

ACID: Limited support (Hive ACID tables/ORC).
Schema Evolution: Supported.
Partitioning: Directory-based.

Delta Lake: Open-source storage layer providing ACID transactions for data lakes (often with Spark).

ACID Transactions: Full support.
Schema Evolution: Enforcement & evolution.
Time Travel & Partitioning supported.

Data Serialization: The process of transforming a complex data structure into a format that can be easily stored or transmitted and then reconstructed later. (e.g., JSON, XML, CSV, Protocol Buffer).

Protocol Buffer (protobuf): Mechanism for serializing structured data in a language- and platform-neutral way. Compact, fast. Often used for microservices communication and defining API contracts.

Data Warehouses & Pipelines

Data Warehouse

Stores transformed data in a usable condition for business insights (e.g., BigQuery, Snowflake).

RDBMS are optimized for high throughput WRITES to RECORDS. Cloud SQL databases are RECORD-based storage.
BigQuery is COLUMN-based storage that allows for really wide reporting schemas.
Upstream processes: Data acquisition, integration, preparation.
Downstream processes: Data consumption, analysis, reporting.

Choosing cloud over on-premises: Cost reduction/Scalability, Reduced maintenance/Increased agility, Enhanced accessibility/Collaboration.

Standard workflow & lifecycle

Collect: raw data, such as streaming data from devices, on-premises batch data, application logs, or mobile-app user events and analytics
Store: the data in a format that is durable and accessible into a data lake or data warehouse
Process or enrich: to make the data useful for the downstream systems for analysis
Analyze: data exploration, ad-hoc querying and analysis
Empower: data visualization, sharing and activation to derive business value and insights from data

Collect mechanisms

Application: push model (Cloud Logging, Pub/Sub, CloudSQL, Spanner)
Streaming: continuous stream of small, asynchronous messages (IoT telemetry, user events → Pub/Sub)
Batching: large amounts transferred in bulk (scientific workloads, backups → Storage Transfer Service)

Data Pipeline patterns

Needs additional processing? → Data processing (Dataproc or Dataflow)
Arrives continuously? → Data streaming (Pub/Sub + Dataflow)
ETL (Extract, Transform, Load): Data is loaded into a database after transformation.
ELT (Extract, Load, Transformation): Data is loaded into a database before transforming.
EL (Extraction, Load): No transform needed.
CDC (Change Data Capture): Captures changes over time in source system.

Batch vs Streaming

Batch: For bounded, finite datasets, delayed processing. Choose if high throughput > low latency, data is finite, complex transforms needed.

Streaming: For unbounded datasets, continuous updates, immediate processing. Uses message queue. Choose if low latency is crucial (fraud detection), data continuous, immediate feedback required. Main challenge: out of order data.

Databases & Modeling

Feature	Relational	Document	Wide-Column
Data Model	Table	JSON/BSON document	Sparse multidimensional array
Schema	Fixed	Flexible	Flexible within column families
Scalability	Vertical	Horizontal	Horizontal
Querying	SQL	Specific query languages	Specific query languages
Transactions	ACID	Limited	Limited
Use Cases	Transactional systems, structured data	Content management, e-commerce, user profiles	Time-series data, analytics, large datasets

Star Schema

EDW schema that organizes data into fact and dimension tables.

Fact Table: Contains quantitative data (facts) like sales, revenue. Holds foreign keys to dimension tables.
Dimension Tables: Smaller tables describing attributes of facts (Date, Customer, Product).

Features: Denormalized dimensions (some redundancy), simple joins, faster query performance for analytics.

Snowflake Schema

Extension of Star Schema, dimension tables are further normalized.

Normalized Dimension Tables: Dimension tables are split into multiple related tables (e.g., Product → Product Category, Subcategory).

Features: Normalized dimensions (less redundancy), complex joins, storage efficiency over query speed.

Avoid anti-SQL patterns

Self-joins: potentially squares output rows, causing poor performance.
Cross joins: joins every row to every row.
Single-Row DML: to treat EDW like an OLTP system.
Lack of Foreign Keys: leads to data integrity issues.
Generic Primary Keys: (auto-incrementing) can obscure relationships.

Advanced Concepts & Tooling

Data marts: Siloed copies of some of the data stored in a data warehouse to offload analytic activity and preserve performance. Fulfills the request of a specific division.

Time-series data: Sequence of data points indexed in time order (CPU utilization, stock pricing).

Data orchestration tools: e.g., Apache Airflow, focus on orchestration only and structure the flow of data as a series of sequential tasks.

DataOps: Brings together data engineers, scientists, analysts to apply agile best practices to the data lifecycle.

OLAP cube: Data structure for fast analysis according to multiple dimensions.

Data Wrangling: Process of cleaning, transforming, and preparing data for analysis (handling missing values, etc.).

Data Transformation

Denormalization: Adding precomputed redundant data to improve read performance (can increase storage costs).
Normalization: Removing redundancy (better for integrity, slower queries).
Standardizing: converting data types.
Cleansing: resolving inaccuracies.
Mapping: combining elements.
Deduplication: eliminating duplicate copies.

Streaming semantics

Ensuring processing is accurate/reliable despite failures.

Exactly-Once: Each record processed exactly one time (difficult to achieve).
At-Least-Once: Processed at least once (can lead to duplicates, requires idempotence).
At-Most-Once: Processed at most once (weakest, records may be lost).

Distributed Systems Concepts

Hot Spotting: A specific partition receiving a disproportionately high volume of requests (access patterns).
Data Skew: Uneven distribution of data across partitions.
Shuffling: Data redistribution across nodes based on a key (network-intensive, e.g., Spark joins).
Data lineage: Tracing and visualizing the flow of data from origin to destination.
Data tiering: Classifying data (hot/cold) to optimize storage costs in data lakes/warehouses.
Idempotence: Operation can be applied multiple times without changing the result (vital for retries).

Data Build Tool (DBT)

SQL-first transformation workflow following software engineering best practices.

Pros: Strong community, excels at complex logic, modularity, wide support.

Cons: Steeper learning curve, less focus on orchestration, limited built-in testing.

Resources & Assets

Video Content

Watch the Data Engineering 101 architecture blueprint video on YouTube.

Watch on YouTube

Downloadable Assets

High-Res Infographic Slidedeck (PDF) Podcast Episode

Self-Assessment Questions

Q1. What are the 3 Vs of Big Data?

Volume (amount), Velocity (speed), and Variety (formats).

Q2. What is the main difference between OLTP and OLAP?

OLTP focuses on high-volume, short transactions for operational use, while OLAP is designed for complex queries and analysis over large datasets.

Q3. Which file format is columnar and optimized for analytical queries with good compression?

Parquet (also ORC).

Q4. What is the core principle of a Data Mesh architecture?

Distributed data ownership by domain-specific teams with federated governance.

Q5. What does ACID stand for in database transactions?

Atomicity, Consistency, Isolation, and Durability.

Study Progress — Module Mastery 100%