×
Data Engineering — Dataform

Dataform
SQL-Based Pipelines

A framework for developing, testing, versioning, and deploying SQL-based data pipelines in BigQuery. Apply software engineering best practices to your data transformation workflows with this serverless, fully integrated Google Cloud service.

Author: Michaël Bettan
Integration: BigQuery
Ecosystem: Google Cloud
Architecture: Serverless
01

Key Features & Concepts

Serverless & Intuitive

  • Serverless: Dataform is a serverless service, meaning you don't need to manage any infrastructure.
  • Intuitive Interface: Offers a user-friendly environment with familiar SQL syntax for easy adoption.

SQL-based Transformations

Use SQL to define data transformations, making it familiar to data professionals.

  • Supports ELT pipelines, executing SQL transformations directly within BigQuery.

Modularity & Dependencies

  • Modular Code: Promotes modularity by allowing you to break down your data pipeline into smaller, reusable components.
  • Dependency Management: Automatically manages dependencies between different parts of your data pipeline, ensuring that tasks are executed in the correct order.

Version Control

Integrate with Git repositories (GitHub, GitLab, etc.) to track changes, collaborate effectively, and revert to previous versions if needed (SQL as Code).

Testing & Validation

Implement data quality tests and assertions to ensure the accuracy and reliability of your data.

  • Check uniqueness and null values in BQ tables.

Environments & Orchestration

  • Environments: Define separate environments (e.g., development, staging, production) with different configurations and datasets.
  • Orchestration: Orchestrates the execution of your data pipeline, handling dependencies and scheduling.
  • Automated Scheduling: Built-in scheduling simplifies pipeline management and execution.
  • Documentation: Automatically generates documentation for your data pipeline, making it easier to understand and maintain.
02

Repository Structure Best Practices

A well-structured repository improves collaboration, maintainability, and navigation. The recommended structure for your definitions directory is organized into distinct logical phases:

DIR 1

sources

Contains declarations of source data and basic transformations like filtering, casting, and renaming columns. Organize sources from different platforms (e.g., Google Ads, Google Analytics) into separate subdirectories.

DIR 2

intermediate

Houses intermediate data transformations that combine data from multiple sources or perform complex calculations. Typically not used directly for analytics. Use a unique prefix (e.g., stg_) for table filenames.

DIR 3

outputs

outputs

Stores the definitions of your final output tables, which are ready for consumption by downstream applications or analytics tools. Use concise filenames for output tables.

DIR 4

extras (Optional)

Contains any additional files, such as utility scripts or configuration files.

03

Best Practices: Naming & Size

File Naming Conventions

Adhere strictly to BigQuery table naming conventions. Reflect the subdirectory structure in filenames for clarity and ease of navigation.

Repository Size

Be mindful of repository size, as it can significantly impact collaboration, readability, development processes, compilation time, and overall execution time.

04

Environments and Project Overrides

Managing Environments

05

Deployment Options

Dataform Core (Open Source)

Dataform Core is the open-source foundation of Dataform. It provides a meta-language that extends SQL with dependency management, testing, and documentation capabilities.

  • You can use Dataform Core with the Dataform CLI to run Dataform pipelines locally.

Dataform on Google Cloud

Dataform on Google Cloud offers a fully managed experience for building data pipelines directly in BigQuery.

  • Includes a cloud development environment.
  • Serverless orchestration.
  • Seamless integration with other Google Cloud services.