Data Engineering — Dataform

Dataform
SQL-Based Pipelines

A framework for developing, testing, versioning, and deploying SQL-based data pipelines in BigQuery. Apply software engineering best practices to your data transformation workflows with this serverless, fully integrated Google Cloud service.

Author: Michaël Bettan

Integration: BigQuery

Ecosystem: Google Cloud

Architecture: Serverless

Key Features & Concepts

Serverless & Intuitive

Serverless: Dataform is a serverless service, meaning you don't need to manage any infrastructure.
Intuitive Interface: Offers a user-friendly environment with familiar SQL syntax for easy adoption.

SQL-based Transformations

Use SQL to define data transformations, making it familiar to data professionals.

Supports ELT pipelines, executing SQL transformations directly within BigQuery.

Modularity & Dependencies

Modular Code: Promotes modularity by allowing you to break down your data pipeline into smaller, reusable components.
Dependency Management: Automatically manages dependencies between different parts of your data pipeline, ensuring that tasks are executed in the correct order.

Version Control

Integrate with Git repositories (GitHub, GitLab, etc.) to track changes, collaborate effectively, and revert to previous versions if needed (SQL as Code).

Testing & Validation

Implement data quality tests and assertions to ensure the accuracy and reliability of your data.

Check uniqueness and null values in BQ tables.

Environments & Orchestration

Environments: Define separate environments (e.g., development, staging, production) with different configurations and datasets.
Orchestration: Orchestrates the execution of your data pipeline, handling dependencies and scheduling.
Automated Scheduling: Built-in scheduling simplifies pipeline management and execution.
Documentation: Automatically generates documentation for your data pipeline, making it easier to understand and maintain.

Repository Structure Best Practices

A well-structured repository improves collaboration, maintainability, and navigation. The recommended structure for your definitions directory is organized into distinct logical phases:

DIR 1

sources

Contains declarations of source data and basic transformations like filtering, casting, and renaming columns. Organize sources from different platforms (e.g., Google Ads, Google Analytics) into separate subdirectories.

DIR 2

intermediate

Houses intermediate data transformations that combine data from multiple sources or perform complex calculations. Typically not used directly for analytics. Use a unique prefix (e.g., stg_) for table filenames.

DIR 3

outputs

Stores the definitions of your final output tables, which are ready for consumption by downstream applications or analytics tools. Use concise filenames for output tables.

DIR 4

extras (Optional)

Contains any additional files, such as utility scripts or configuration files.

Best Practices: Naming & Size

File Naming Conventions

Adhere strictly to BigQuery table naming conventions. Reflect the subdirectory structure in filenames for clarity and ease of navigation.

Repository Size

Be mindful of repository size, as it can significantly impact collaboration, readability, development processes, compilation time, and overall execution time.

Environments and Project Overrides

Managing Environments

Use the environments configuration in dataform.json to define environment-specific settings, including project IDs.
Define environment-specific variables (e.g., dataset names) directly within your SQL scripts.
Use the --env and --vars flags when deploying with the Dataform CLI to accurately specify the target environment and inject variables at runtime.

Deployment Options

Dataform Core (Open Source)

Dataform Core is the open-source foundation of Dataform. It provides a meta-language that extends SQL with dependency management, testing, and documentation capabilities.

You can use Dataform Core with the Dataform CLI to run Dataform pipelines locally.

Dataform on Google Cloud

Dataform on Google Cloud offers a fully managed experience for building data pipelines directly in BigQuery.

Includes a cloud development environment.
Serverless orchestration.
Seamless integration with other Google Cloud services.

Video Walkthrough

Watch the Dataform engineering blueprint video on YouTube for a visual walkthrough.

Watch on YouTube

DataformSQL-Based Pipelines

Key Features & Concepts

Serverless & Intuitive

SQL-based Transformations

Modularity & Dependencies

Version Control

Testing & Validation

Environments & Orchestration

Repository Structure Best Practices

sources

intermediate

outputs

outputs

extras (Optional)

Best Practices: Naming & Size

File Naming Conventions

Repository Size

Environments and Project Overrides

Managing Environments

Deployment Options

Dataform Core (Open Source)

Dataform on Google Cloud

Video Walkthrough

Video Walkthrough

Dataform
SQL-Based Pipelines