Study Notes — Certification Prep

BigQuery
Study Guide

Fully managed, Serverless (no-ops), powers fast SQL queries. Essential for analytics, cloud-based data warehouse.

Updated: April 2026

Version: 1.1

Category: Data Warehouse

Reading Time: ~30 min

Author: Michaël Bettan

Definition & Use Cases

What is BigQuery?

BigQuery is a fully managed, serverless (no-ops) data warehouse that powers fast SQL queries. It is essential for analytics and serving as a cloud-based data warehouse.

Availability

Region (no cross-region replication with multi-region)

SLA

99.99%

Use Cases

Large-scale data warehousing: Store and analyze massive datasets for historical trends and insights.
Interactive data exploration: Easily explore and analyze Big Data using SQL queries.
Fast, ad-hoc analytics: Get quick answers to complex questions with petabyte-scale data.
Powerful BI reporting: Build interactive dashboards and reports for data-driven decision making.

Architecture

Designed as a cloud-native EDW (first party service).
Serverless: Automated scaling & zero infra management.
Decoupled compute and storage: Scale independently.
OLAP (not OLTP): Optimized for analytical queries, not frequent updates.
Supports ANSI SQL Queries.
Multi-region location: No cross-region replication nor regional redundancy → no increase in dataset availability in the event of a regional outage. Data is stored in a single region within the geographic location.
Columnar-Based Storage: Stores data column by column, with all the values for each column stored together.
- Faster Queries: Quickly scan through the relevant columns without reading unnecessary data.
- Better Compression: Values in a single column have similar data types and patterns (less disk space).
Slots for predictable costs: High, consistent query volume.
Pay-per-query: Infrequent queries, unpredictable usage.
Autoscaler: Dynamically adjusts your reservation allocation (# of slots) based on workloads: minimum guaranteed with baseline, scales up for peak loads to the maximum.
Application Development: Building apps can be challenging due to query latency (multi-seconds) and potential cost implications for frequent and high-volume queries.
Query Queue: Previous limit of 100 concurrent queries has been replaced with dynamic concurrency queuing.

Billing & Jobs

Billing

Load, copy, export, re-cluster, delete/metadata operations are Free.

Storage: Active storage (<= 90 days or data modified) or Long-term storage (> 90 days; automatically).
Querying: Pay-per-query (aka On-Demand) amount of data scanned with 2000 fixed slots capacity or Pay-per-slot_hour (aka Editions) amount of slot hours with different features set, with optional slot commitments.
Ingestion: Streaming Inserts and Storage Write API; amount of data processed.

Jobs & Reservation

# of slots: Unit of compute = ~ throughput (CPU, RAM, NET).
Flex Slots: Available for bursting workloads.
Reservations (bucket of slots): Allocated to split any slot commitment to org, folder, and project levels.
Idle slots: Shared and available for other reservations in your organization to use.
Fair scheduler: To limit resource-heavy query consumption.
Default Pool: Load, export, copy by default is on free shared pool of slots.
Assignments: QUERY, PIPELINE, ML_EXTERNAL.

Data Model & Schema Design

Data Model

Project: Logical grouping of datasets for billing purposes.
Datasets: Top-level container for tables and views.
Authorized dataset: Authorizes all views within a dataset to access data in another (streamlining view authorization).
- Protect Sensitive data
- Simplify access control management
Table: Individual records into rows. Each is composed of columns with an enforced data type. Native or External table.
View: A virtual table defined by a SQL query.

Views & Data Operations

Views let you share query results without giving access to underlying source data.

Authorized Views: Virtual table defined by a SQL query, definition hidden from users. Protects sensitive data.
Materialized Views: Precomputed views that periodically cache query results for performance (zero maintenance, fresh data; best effort refresh within 5 min of base table change).
Job: Asynchronous action (query, extract, load, copy).
DML (Data Manipulation): Update, insert, delete, MERGE. Has daily execution limits.
DDL (Data Definition): Create, alter, delete resources.
Schema Evolution: Supports schema changes without re-creating tables.

Schema Design

Choose Data Types Wisely: Select the most specific data type for your needs to optimize storage and query performance. Avoid overly generic types (like STRING for everything).
Nested and Repeated Fields: Removes the need for a separate table and costly join operations.
- Nested Fields (STRUCTs): Group related data into logical units, improving readability and query efficiency.
- Repeated Fields (ARRAYs): Store multiple values within a single row, reducing data redundancy.
Schema Flexibility: You can have nested structs within repeated fields or repeated fields within structs, providing flexibility to model complex data relationships and track historical changes.
UNNEST operator: Flattens nested and repeated fields for easier querying, treating each item in a repeated field as a separate row.

Data Types

NUMERIC / BIGNUMERIC

Financial analysis for precise currency calculations, stock prices, and reporting.

FLOAT64

Sensor data for capturing temperature readings, pressure, or other sensor values.

TIMESTAMP

Event logging for storing event timestamps with accurate time zone information.

STRUCT

Product catalog representing a product with attributes (name, description, price).

GEOGRAPHY

Location-based services for storing user locations, geofencing, and spatial analysis.

JSON

Exchange Log events representing structured data in a flexible format.

Primary and Foreign Keys

BigQuery doesn't enforce primary and foreign key constraints in the same way as RDBMS, but they are highly useful for query optimization.

Primary Key: Uniquely identifies each row. Must contain a non-null value. Declared during table creation. It doesn't prevent inserting duplicate or NULL values, but if declared, BigQuery can optimize query performance (especially for joins).
Foreign Key: A column referencing the primary key of another table. Declared via DDL. It doesn't automatically enforce referential integrity; you must implement your own checks and validation (e.g., using MERGE statements, pre-insert validation queries).

Partitioning & Clustering

Partitioning

Divides large tables into smaller, more manageable segments.

Purpose: Query performance (faster by scanning only relevant partitions) and Cost reduction (charges based only on processed data).
Limits: Up to 10,000 partitions (~27 years of data). Daily limit of 30,000 partition modifications per table. Rate limit: 50 updates every 10 seconds.
Types:
- Ingestion Time: Based on ingestion date. Pseudo-column _PARTITIONTIME.
- Time-Unit Column: Based on date or timestamp column.
- Integer Range: Ranges of values in a specific column.
Frequencies: Daily (default), hourly, monthly, or yearly.
Requires Partition Filter at table level (+ WHERE clause).
Excessive partitioning: (thousands of partitions) can negatively impact performance.
TABLE_DATE_RANGE: Specify a date range instead of manually listing individual partitions.
Partition Decorators: Reference a partition in a table to write data specifically to it (table_name$partition_id).

Clustering & Sharding

Sort data within partitions based on one or more columns.

Purpose: Further improves query performance, especially for queries with filtering and aggregation, by co-locating related data.
Automatic Reclustering: Maintained in the background for efficiency (free).
Hierarchical Clustering: Cluster a table by multiple columns to create a hierarchical organization (e.g., country, then state/region, then city).
Partitioning and clustering are powerful when used together (e.g., partition by date, cluster by heavily filtered column).
Sharding (Legacy): A similar concept (less preferred) with a single table per day. Uses UNION in query to scan multiple tables. Partitioning within a single table is easier to manage and leverages query optimizations more effectively than creating multiple sharded tables.

Data Ingestion & External Sources

Data Ingestion

Batch vs. Real-time: Factors include data volume, frequency, and latency requirements.
Batch Loading: For large datasets (free, shared resources).
Streaming Loading: For real-time use cases.
- Streaming Inserts: Disable insertId for high volume → best effort deduplication.
- BQ Storage Write API: High throughput.
Streaming Buffer: Data is first held in a temporary staging area to ensure data durability. Available for querying immediately after ack. Up to 90 min for table copy.
File Formats (Fastest to Slowest): Avro, Parquet, ORC, uncompressed CSV & JSON, compressed (gzip).
- Avro: Row-based, faster write.
- Parquet: Columnar, faster read.
Encoding & Schema: UTF-8 (default). Auto-detection for CSV/JSON, or define manually.
Data Sources: GCS (Firestore export), local file, Drive, BigTable, BigQuery Data Transfer Service (DTS).
Data Transformation: During ingestion via Dataflow, Dataproc, or Create a table from a query result (CTAS).

External Data Sources

Supported External Sources (5): Bigtable, Cloud Spanner, Cloud SQL, Cloud Storage, Google Drive.
External Tables: Act like a standard table. The table metadata (including the table schema) is stored in BigQuery storage, but the data itself resides in the external source.
Federated Query: A way to send a query statement to an external database (Cloud SQL or Cloud Spanner) and get the result back as a temporary table.
- Use Cases: ETL operations. Query small and frequently changing data without the need to reload it. Access data being ingested periodically.
- Limitations: No consistency guarantees, potential performance impact, no caching.
Partitioning & Schema: Supports Hive partitioning for data in GCS. Schema flexibility allows adding columns without rewriting the entire table.

Querying, Optimization & BI

Querying Data

Language: GoogleSQL (ANSI SQL) or Legacy SQL.
Query Jobs: Asynchronous actions (load, export, query, copy).
Results: Permanent or temporary tables.
Query Types:
- Interactive: Default, immediate execution, counts toward daily/concurrent usage limits.
- Batch: Queued, starts when idle resources are available, switches to interactive if idle for 24 hours.

BI Engine

An in-memory analysis service that accelerates BigQuery queries for business intelligence (BI) and data visualization tools like Looker Studio and Looker.

Optimizes performance by storing data in memory, providing fast, sub-second query responses.
Ideal for interactive reporting and dashboards.
Automatically integrates with BigQuery, delivering high performance while maintaining real-time insights and reducing costs.
BI Engine SQL: For any BI tooling, e.g., Looker or Tableau.

Query & Cost Optimization

Avoid SELECT *

Select only necessary columns to reduce data scanned using SELECT * EXCEPT.

Sample Data

Use Preview Table (free) and Preview queries (--dry-run) to estimate costs.

Limit Costs

Use max bytes billed. Don't use LIMIT clause (it is still a full scan).

Filtering

Use filters and WHERE clauses early in the query to limit data processed.

Join Efficiency

Join largest table, then smallest, then decreasing size. Filter on both tables.

Approx. Functions

Use APPROX_COUNT_DISTINCT for acceptable accuracy results faster.

Window Functions

Eliminate full table scans and temp tables for intermediate calculations. Avoid self-joins. Includes Aggregations, Navigation, Ranking functions.

Denormalization

Speed up queries by reducing joins (read-heavy dashboards). Trade-offs: increased storage, redundancy.

Handle Data Skew

Uneven distribution slowing queries. Fix via Pre-Aggregation on skewed column or using SHUFFLE Clause to redistribute data.

Monitor costs using dashboards and audit logs. Partition data by date.
Break query results into stages.
Use default table expiration to delete unneeded data.
Ingestion Limits: Set hard limits on bytes processed per day. Set custom quota per user.
Aggregate late and rarely, especially for queries with sub-SELECTs. Move aggregation functions up to the top.
Cached queries: If you run the exact same query again, results from the cached table are used. Maintained per-user, per-project for ~24 hours.

Analytics Hub & Data Clean Room

Analytics Hub

Built-in data exchange platform for secure & scalable data sharing across organizations. Supports a Data Mesh architecture.

Publish and subscribe model: Share datasets without duplicating data.
Shared datasets and exchanges: Organize data for sharing through curated datasets and access control via private or public exchanges.
Cross-domain Data Consumption: Users can easily access data products across different domains.
Data Discoverability: Data consumers can search for datasets across the organization in a standardized way.
Governance: In conjunction with Dataplex, teams maintain high data quality and privacy through federated policies.
Cost-effective sharing: Publishers only pay for storage, subscribers only pay for querying shared data.

Data Clean Room

Analyzing sensitive data with multiple parties while ensuring privacy and security without directly sharing the underlying data itself.

Query controls: Restrictions on the types of queries that can be executed.
Output privacy mechanisms via analysis rule:
- Aggregation threshold: Enforces the minimum number of distinct entities that must be present in a set of data to be included in results.
- Differential privacy: Anonymization technique that limits personal information revealed to prevent learning about an entity.
- List overlap: Enforces an INNER JOIN and returns the exact rows where data intersects between two or more tables.

Data Recovery & BigLake

Data Recovery & Backup

Table clones: Create a lightweight, writable copy of an existing table (only changed data is billed) for dev/testing/sandboxes.
Table Snapshots: Preserve a read-only copy of a base table at a particular time. Retained indefinitely (or via custom expiration). Ideal for long-term backup and disaster recovery beyond the 7-day time travel window.
Time travel: Retrieves full table data for the past 2-7 days. Quick rollbacks, but no selective restore.
Export to GCS: In various formats (Avro, CSV) for long-term archival or disaster recovery.
Fail-Safe: Retains deleted data for an additional 7 days after time travel window for emergency recovery.
Copy dataset: Complete copy of dataset. Costly, not suited for long-term.
ELT pipelines: Organize data in monthly tables, export to GCS to restore specific time periods cleanly.
Regional outage: BigQuery does not offer automatic failover. Export to GCS in a different region to define your RPO.

BigLake

Query structured & unstructured data across analytics & AI engines with built-in governance.

Open Formats: Built on open formats (Parquet, ORC, Iceberg) to avoid vendor lock-in.
Multi-Cloud Governance: Discover and manage tables through Data Catalog.
Fine-Grained Security: Enforce consistent security policies (row filtering, column masking) removing need for file-level permissions.
Simplified Analytics: Query data using SQL interfaces regardless of storage engine.
Metadata caching: From GCS/S3 speeds up queries, especially for many files or Hive partition filters.
Moving cold (archive) data: To GCS, BigLake will query data even in lower-cost storage tiers.
BigLake over Omni: Eliminate data movement pipelines by directly querying data residing in S3 and Azure Data Lake.

Security, Governance & Monitoring

IAM & Roles

Granting an IAM role to an entity at different levels: Organization, Folder, Project, Dataset, Table, View.

BigQuery Admin: Full access to all resources.
Data Owner: Manage data/metadata, create/delete datasets and tables.
Data Editor: Query data, create/modify tables, load data.
Data Viewer: Read-only access to dataset metadata and table data. Cannot run queries without the Job User role at the project level.
User: Can run jobs, including queries.
Job User: Can only run jobs (like queries), but cannot access data directly unless granted a separate data access role (like Data Viewer). Often used for automated processes.
Metadata Viewer: Allows viewing metadata about resources.

Tags: Key-value pairs to conditionally apply IAM to tables & datasets.

Auditing & Monitoring

Cloud Audit Logs: Identify who accessed specific tables, analyze query execution patterns, track IAM/schema changes, ensure compliance, and troubleshoot.
Security Command Center: For security events and alerts.
Cloud Monitoring: Dashboards and alerts for BigQuery usage and performance.
Slot Consumption Metrics: Monitored via Cloud Monitoring, INFORMATION_SCHEMA.JOBS_*, Operational health dashboard, or Audit logs (Jobs API).
Metrics tracked: Query execution time, bytes processed, slots used, job started/failed, storage bytes, streaming latency.
Custom Quota: Set query usage per day per user (applies to all users in a project).

Data Governance, Quality & Lineage

Protect Sensitive Data: Using data masking in table columns and Sensitive Data Protection (DLP) via Sensitive data profiling and On-Demand inspection.
Manage Encryption: Google-Managed, CMEK via Cloud KMS.
AEAD Encryption: Create keysets to encrypt and decrypt individual values in a table, and rotate keys within a keyset.
Dataplex Integration: Used to define continuous data checks, monitor results, and troubleshoot issues with data quality.
Data Lineage: Tracking the origin, transformations, and dependencies of data (where it comes from, where it is passed to). Ensures data quality, compliance, and reproducibility.

VPC Service Controls

Security feature that defines a security perimeter around resources to mitigate the risk of data exfiltration.

Data Protection: Restricts access and establishes a defined perimeter.
Access Levels: Determine which users or services can access resources, ensuring only authorized users can interact.
Service Perimeters: Encompass multiple services, ensuring data does not leave defined boundaries.
Audit Logs: Detailed logs for activities within the perimeter.
Deny Rules: Restrict access from specified IP ranges.
Data Exfiltration Protection: Reduces risk of accidental or intentional exfiltration by enforcing sharing policies.

Change Data Capture (CDC)

CDC tracks changes from transactional databases to BigQuery in near-real time.

Log-based CDC: Preferred for efficient replication of inserts, updates, and deletes.
Staging Tables: Use to capture changes before applying them to reporting tables.
Apply Changes: In batches using DML MERGE to optimize performance and reduce overhead.
Stream CDC events: In real time for low-latency data availability.
Transactional Integrity: Ensures data consistency.
BigQuery supports schema evolution for dynamic structure changes.
Implement error handling for recovery in the pipeline.

Functions, Errors & Misc

SQL Functions & Procedures

User-Defined Functions (UDF): SQL or Javascript function that accepts input, performs actions, and returns a value. Temporary or Persistent.
Stored Procedure: Persistent script invoked from inside a SQL statement. Can take input arguments and return values.
Routines: A resource type containing Stored procedures, UDFs, remote functions, and Table functions to re-use logic.
Scripting: Send multiple statements to BigQuery in one request.
Specialized Functions:
- Working with arrays and JSON.
- Geospatial (GIS): Geography data types/functions for spatial analysis.
- Security functions for data encryption and access control.
- Approximate and HyperLogLog++: For advanced analytics and efficient distinct counting.

Error Handling & Data Quality

Common Errors:
- Invalid Schema: Data type mismatches between source data and table schema.
- Query Timeouts: Queries exceeding configured limits or resource contention.
- Ingestion Issues: Problems loading data, often due to format errors or permissions.
Troubleshooting Tips:
- Analyze using EXPLAIN statements to identify bottlenecks in Query Plans.
- Examine Cloud Logging for error messages.
- Check Job Details history in the console for errors and stats.
Data Validation:
- CHECK Constraints: Define rules at the table level to enforce data integrity (e.g., data type, range).
- ETL Validation: Incorporate data quality checks early in pipelines.

Miscellaneous Features

BigQuery Omni: For multi-cloud querying with External Tables.
Time Travel: At no extra cost via AS OF SYSTEM TIME, going back 7 days.
ROW_NUMBER: A window function that assigns a unique, sequential integer to each row within a partition of a result set, starting at 1, based on the ORDER BY clause.
Wildcard Tables: Query multiple tables using concise SQL, useful when a dataset contains multiple, similarly named tables that have compatible schemas.
BigQuery ML: Enables building and training machine learning models directly on BigQuery data.
Business Intelligence Tools: Integrate seamlessly with BI tools for data visualization and reporting (Tableau, Looker).
BigQuery Tags: Assign metadata to datasets in the form of key-value pairs. Used to apply IAM policies conditionally.

Self-Assessment Questions

Q1. What is the main difference between Partitioning and Clustering in BigQuery?

Partitioning divides large tables into segments (e.g., by date), while Clustering sorts data within those partitions based on one or more columns for better performance.

Q2. How long are cached query results maintained in BigQuery?

Cached query results are maintained per-user and per-project for 24 hours (with some exceptions).

Q3. How do you flatten nested or repeated fields in a BigQuery table for querying?

You use the UNNEST operator, which flattens the data for easier querying and treats each item in a repeated field as a separate row.

Q4. What is the best approach to handle data skew when querying in BigQuery?

You can perform Pre-Aggregation on the skewed column to reduce the number of rows with the same value, or use a SHUFFLE Clause to redistribute data evenly before joining.

Q5. Can you query data residing on AWS S3 or Azure directly using BigQuery?

Yes, using BigLake over Omni, which eliminates the need for data movement pipelines by directly querying data residing in S3 and Azure Data Lake.

BigQueryStudy Guide

Definition & Use Cases

What is BigQuery?

Use Cases

Architecture

Architecture

Billing & Jobs

Billing

Jobs & Reservation

Data Model & Schema Design

Data Model

Views & Data Operations

Schema Design

Data Types

Primary and Foreign Keys

Partitioning & Clustering

Partitioning

Clustering & Sharding

Data Ingestion & External Sources

Data Ingestion

External Data Sources

Querying, Optimization & BI

Querying Data

BI Engine

Query & Cost Optimization

Analytics Hub & Data Clean Room

Analytics Hub

Data Clean Room

Data Recovery & BigLake

Data Recovery & Backup

BigLake

Security, Governance & Monitoring

IAM & Roles

Auditing & Monitoring

Data Governance, Quality & Lineage

VPC Service Controls

Change Data Capture (CDC)

Functions, Errors & Misc

SQL Functions & Procedures

Error Handling & Data Quality

Miscellaneous Features

Self-Assessment Questions

BigQuery
Study Guide