×
Study Notes — Certification Prep

Data Engineering 101
Michaël Bettan

Comprehensive study notes covering data engineering fundamentals, data pipelines, architectures (Data Lake vs Data Mesh), databases, formats, and orchestration.

Author: Michaël Bettan
01

Definition & Core Concepts

Definition

Data engineering enables data-driven decision making by collecting, transforming, and visualizing data. A data engineer designs, builds, maintains, and troubleshoots data processing systems with a particular emphasis on the security, reliability, fault-tolerance, scalability, fidelity, and efficiency of such systems.

Big Data's 3 Vs

Describe the core characteristics that distinguish it from traditional data:

Use Cases

  • Build/maintain data structures and databases
  • Design data processing systems
  • Analyze data and enable machine learning
  • Design for reliability
  • Visualize data and advocate policy
  • Model business processes for analysis
  • Design for security and compliance

Data Engineers' Role

Build data pipelines to enable data-driven decisions:

  • Get the data to where it can be useful
  • Get the data into a usable condition
  • Add new value to the data
  • Manage the data
  • Productionize data processes
  • Understand how to manipulate data formats, scale data systems, and enforce data quality and security.
Durability
Ensures that data remains intact and persistent, even in the event of system failures such as hardware crashes or power outages. It guarantees data integrity and the ability to recover information, safeguarding it from loss.
Availability
Guarantees that data is always accessible when needed. Systems are built with redundancy and fault tolerance to minimize downtime, ensuring continuous and reliable access to information.
Low Latency
minimal delay between a data request and its response. Low latency is critical for real-time or near real-time data processing, especially in time-sensitive applications like online gaming and financial transactions.

Types of data structures

02

OLTP vs OLAP & Storage

OLTP (Online Transaction Processing)

Focuses on high-volume, short transactions, prioritizing data integrity and consistency (e.g., banking transactions, e-commerce orders). ACID properties are crucial. Normalized schemas are common.

OLAP (Online Analytical Processing)

Designed for complex queries, aggregations, and analysis over large datasets. Speed and flexibility for analytical queries are prioritized. Denormalized schemas (star/snowflake) are common.

Row-level vs Column-level

RDBMS: row-based storage, all the fields (columns) for a given row are stored together. However, in columnar storage, the data for each column is stored independently.

RDBMS & ACID

Relational Database Management Systems allows you to create, update, and manage a relational database, and ensure data integrity through ACID properties


Strong consistency: every operation (read or write) guarantees that all transactions are in the same order

Eventual consistency: parallel processes can see changes in different orders or temporarily inconsistent states, but over time, the system guarantees that all processes will converge on the same final state.

03

Data Architectures & Formats

Data Lake

Suitable for organizations with a strong central IT team and a need for a cost-effective solution to store large volumes of raw data. Best when exploring data and use cases aren't clearly defined.

Table Format: Data Lake

  • Brings together data from across the enterprise into a single and consolidated location → Cloud Storage
  • Solves for these pain points:
    • Getting insights across multiple datasets is difficult
    • Data is often siloed in many upstream source systems.
    • Cleaning, formatting and getting the data ready for useful business insights in a data warehouse
  • Key considerations: Can it handle all types of data? Can it scale? High-throughput? Fine-grained access control? Easy tool connection?

Data Mesh

Ideal for large, complex organizations with diverse data domains and a need for greater agility and data ownership. Requires a significant cultural shift and investment in training and infrastructure.

Promotes distributed data ownership while centralizing governance and data discoverability. It shifts the responsibility for managing, producing, and consuming data to domain-specific teams.

  • Domain-Oriented Data Ownership: Each domain owns and manages its data.
  • Data as a Product: Useful, documented, and accessible via APIs.
  • Federated Governance: Centralized rules ensure compliance (e.g., Dataplex).
  • Self-Service Infrastructure: Teams are empowered to manage data.

Data Lake vs Data Mesh

Centralized vs Decentralized | Schema-on-read vs Schema-on-write | IT-centric vs Data products/Domain-centric

Choosing the right file format

Format Description Best For
CSV Simple text format, rows separated by commas. Lacks data type enforcement. Small datasets, simple data exchange, human readability.
Parquet Columnar storage optimized for analytical queries. Better compression/performance than CSV. Analytical workloads needing high performance, compression, schema evolution.
Avro Row-oriented format based on schemas. Supports schema evolution, good compression. Data serialization and streaming applications where schema evolution is paramount.
ORC Optimized Row Columnar. Combines row/columnar aspects. Good compression/querying. Hive workloads requiring ACID properties and optimized for columnar access.
JSON Human-readable, key-value pairs. Flexible but challenging to query efficiently at scale. Web APIs and document databases where flexibility is important.

Table Format: Iceberg

A high-level management layer for your data lake. Provides essential features:

  • ACID Properties: Atomicity, consistency, isolation, durability.
  • Schema Evolution: Add/remove/modify columns without rewriting dataset.
  • Hidden Partitioning: Manages partitioning efficiently behind the scenes.
  • Time Travel: Tracks history, allowing queries of past snapshots.

Data stored in Parquet/Avro/ORC. Metadata tracks schema, location, partitioning, snapshots.

Table Formats: Hive & Delta Lake

Hive: Data warehouse system on Hadoop for querying large datasets (batch, SQL-like).

  • ACID: Limited support (Hive ACID tables/ORC).
  • Schema Evolution: Supported.
  • Partitioning: Directory-based.

Delta Lake: Open-source storage layer providing ACID transactions for data lakes (often with Spark).

  • ACID Transactions: Full support.
  • Schema Evolution: Enforcement & evolution.
  • Time Travel & Partitioning supported.
Data Serialization
The process of transforming a complex data structure into a format that can be easily stored or transmitted and then reconstructed later. (e.g., JSON, XML, CSV, Protocol Buffer).
Protocol Buffer (protobuf)
Mechanism for serializing structured data in a language- and platform-neutral way. Compact, fast. Often used for microservices communication and defining API contracts.
04

Data Warehouses & Pipelines

Data Warehouse

Stores transformed data in a usable condition for business insights (e.g., BigQuery, Snowflake).

Choosing cloud over on-premises: Cost reduction/Scalability, Reduced maintenance/Increased agility, Enhanced accessibility/Collaboration.

Standard workflow & lifecycle

  1. Collect: raw data, such as streaming data from devices, on-premises batch data, application logs, or mobile-app user events and analytics
  2. Store: the data in a format that is durable and accessible into a data lake or data warehouse
  3. Process or enrich: to make the data useful for the downstream systems for analysis
  4. Analyze: data exploration, ad-hoc querying and analysis
  5. Empower: data visualization, sharing and activation to derive business value and insights from data

Collect mechanisms

  • Application: push model (Cloud Logging, Pub/Sub, CloudSQL, Spanner)
  • Streaming: continuous stream of small, asynchronous messages (IoT telemetry, user events → Pub/Sub)
  • Batching: large amounts transferred in bulk (scientific workloads, backups → Storage Transfer Service)

Data Pipeline patterns

  • Needs additional processing? → Data processing (Dataproc or Dataflow)
  • Arrives continuously? → Data streaming (Pub/Sub + Dataflow)
  • ETL (Extract, Transform, Load): Data is loaded into a database after transformation.
  • ELT (Extract, Load, Transformation): Data is loaded into a database before transforming.
  • EL (Extraction, Load): No transform needed.
  • CDC (Change Data Capture): Captures changes over time in source system.

Batch vs Streaming

Batch: For bounded, finite datasets, delayed processing. Choose if high throughput > low latency, data is finite, complex transforms needed.

Streaming: For unbounded datasets, continuous updates, immediate processing. Uses message queue. Choose if low latency is crucial (fraud detection), data continuous, immediate feedback required. Main challenge: out of order data.

05

Databases & Modeling

Feature Relational Document Wide-Column
Data Model Table JSON/BSON document Sparse multidimensional array
Schema Fixed Flexible Flexible within column families
Scalability Vertical Horizontal Horizontal
Querying SQL Specific query languages Specific query languages
Transactions ACID Limited Limited
Use Cases Transactional systems, structured data Content management, e-commerce, user profiles Time-series data, analytics, large datasets

Star Schema

EDW schema that organizes data into fact and dimension tables.

  • Fact Table: Contains quantitative data (facts) like sales, revenue. Holds foreign keys to dimension tables.
  • Dimension Tables: Smaller tables describing attributes of facts (Date, Customer, Product).

Features: Denormalized dimensions (some redundancy), simple joins, faster query performance for analytics.

Snowflake Schema

Extension of Star Schema, dimension tables are further normalized.

  • Normalized Dimension Tables: Dimension tables are split into multiple related tables (e.g., Product → Product Category, Subcategory).

Features: Normalized dimensions (less redundancy), complex joins, storage efficiency over query speed.

Avoid anti-SQL patterns

06

Advanced Concepts & Tooling

Data marts
Siloed copies of some of the data stored in a data warehouse to offload analytic activity and preserve performance. Fulfills the request of a specific division.
Time-series data
Sequence of data points indexed in time order (CPU utilization, stock pricing).
Data orchestration tools
e.g., Apache Airflow, focus on orchestration only and structure the flow of data as a series of sequential tasks.
DataOps
Brings together data engineers, scientists, analysts to apply agile best practices to the data lifecycle.
OLAP cube
Data structure for fast analysis according to multiple dimensions.
Data Wrangling
Process of cleaning, transforming, and preparing data for analysis (handling missing values, etc.).

Data Transformation

  • Denormalization: Adding precomputed redundant data to improve read performance (can increase storage costs).
  • Normalization: Removing redundancy (better for integrity, slower queries).
  • Standardizing: converting data types.
  • Cleansing: resolving inaccuracies.
  • Mapping: combining elements.
  • Deduplication: eliminating duplicate copies.

Streaming semantics

Ensuring processing is accurate/reliable despite failures.

  • Exactly-Once: Each record processed exactly one time (difficult to achieve).
  • At-Least-Once: Processed at least once (can lead to duplicates, requires idempotence).
  • At-Most-Once: Processed at most once (weakest, records may be lost).

Distributed Systems Concepts

Data Build Tool (DBT)

SQL-first transformation workflow following software engineering best practices.

Pros: Strong community, excels at complex logic, modularity, wide support.

Cons: Steeper learning curve, less focus on orchestration, limited built-in testing.

Resources & Assets

Video Content

Watch the Data Engineering 101 architecture blueprint video on YouTube.

Watch on YouTube

Self-Assessment Questions

Q1. What are the 3 Vs of Big Data?

Volume (amount), Velocity (speed), and Variety (formats).

Q2. What is the main difference between OLTP and OLAP?

OLTP focuses on high-volume, short transactions for operational use, while OLAP is designed for complex queries and analysis over large datasets.

Q3. Which file format is columnar and optimized for analytical queries with good compression?

Parquet (also ORC).

Q4. What is the core principle of a Data Mesh architecture?

Distributed data ownership by domain-specific teams with federated governance.

Q5. What does ACID stand for in database transactions?

Atomicity, Consistency, Isolation, and Durability.
Study Progress — Module Mastery 100%