A framework for developing, testing, versioning, and deploying SQL-based data pipelines in BigQuery. Apply software engineering best practices to your data transformation workflows with this serverless, fully integrated Google Cloud service.
Use SQL to define data transformations, making it familiar to data professionals.
Integrate with Git repositories (GitHub, GitLab, etc.) to track changes, collaborate effectively, and revert to previous versions if needed (SQL as Code).
Implement data quality tests and assertions to ensure the accuracy and reliability of your data.
A well-structured repository improves collaboration, maintainability, and navigation. The recommended structure for your definitions directory is organized into distinct logical phases:
Contains declarations of source data and basic transformations like filtering, casting, and renaming columns. Organize sources from different platforms (e.g., Google Ads, Google Analytics) into separate subdirectories.
Houses intermediate data transformations that combine data from multiple sources or perform complex calculations. Typically not used directly for analytics. Use a unique prefix (e.g., stg_) for table filenames.
Stores the definitions of your final output tables, which are ready for consumption by downstream applications or analytics tools. Use concise filenames for output tables.
Contains any additional files, such as utility scripts or configuration files.
Adhere strictly to BigQuery table naming conventions. Reflect the subdirectory structure in filenames for clarity and ease of navigation.
Be mindful of repository size, as it can significantly impact collaboration, readability, development processes, compilation time, and overall execution time.
dataform.json to define environment-specific settings, including project IDs.
--env and --vars flags when deploying with the Dataform CLI to accurately specify the target environment and inject variables at runtime.
Dataform Core is the open-source foundation of Dataform. It provides a meta-language that extends SQL with dependency management, testing, and documentation capabilities.
Dataform on Google Cloud offers a fully managed experience for building data pipelines directly in BigQuery.