×
Study Notes — Certification Prep

BigQuery
Study Guide

Fully managed, Serverless (no-ops), powers fast SQL queries. Essential for analytics, cloud-based data warehouse.

Updated: April 2026
Version: 1.1
Category: Data Warehouse
Reading Time: ~30 min
Author: Michaël Bettan
01

Definition & Use Cases

What is BigQuery?

BigQuery is a fully managed, serverless (no-ops) data warehouse that powers fast SQL queries. It is essential for analytics and serving as a cloud-based data warehouse.

Availability
Region (no cross-region replication with multi-region)
SLA
99.99%

Use Cases

02

Architecture

Architecture

03

Billing & Jobs

Billing

Load, copy, export, re-cluster, delete/metadata operations are Free.

  1. Storage: Active storage (<= 90 days or data modified) or Long-term storage (> 90 days; automatically).
  2. Querying: Pay-per-query (aka On-Demand) amount of data scanned with 2000 fixed slots capacity or Pay-per-slot_hour (aka Editions) amount of slot hours with different features set, with optional slot commitments.
  3. Ingestion: Streaming Inserts and Storage Write API; amount of data processed.

Jobs & Reservation

  • # of slots: Unit of compute = ~ throughput (CPU, RAM, NET).
  • Flex Slots: Available for bursting workloads.
  • Reservations (bucket of slots): Allocated to split any slot commitment to org, folder, and project levels.
  • Idle slots: Shared and available for other reservations in your organization to use.
  • Fair scheduler: To limit resource-heavy query consumption.
  • Default Pool: Load, export, copy by default is on free shared pool of slots.
  • Assignments: QUERY, PIPELINE, ML_EXTERNAL.
04

Data Model & Schema Design

Data Model

  • Project: Logical grouping of datasets for billing purposes.
  • Datasets: Top-level container for tables and views.
  • Authorized dataset: Authorizes all views within a dataset to access data in another (streamlining view authorization).
    • Protect Sensitive data
    • Simplify access control management
  • Table: Individual records into rows. Each is composed of columns with an enforced data type. Native or External table.
  • View: A virtual table defined by a SQL query.

Views & Data Operations

Views let you share query results without giving access to underlying source data.

  • Authorized Views: Virtual table defined by a SQL query, definition hidden from users. Protects sensitive data.
  • Materialized Views: Precomputed views that periodically cache query results for performance (zero maintenance, fresh data; best effort refresh within 5 min of base table change).
  • Job: Asynchronous action (query, extract, load, copy).
  • DML (Data Manipulation): Update, insert, delete, MERGE. Has daily execution limits.
  • DDL (Data Definition): Create, alter, delete resources.
  • Schema Evolution: Supports schema changes without re-creating tables.

Schema Design

Data Types

NUMERIC / BIGNUMERIC
Financial analysis for precise currency calculations, stock prices, and reporting.
FLOAT64
Sensor data for capturing temperature readings, pressure, or other sensor values.
TIMESTAMP
Event logging for storing event timestamps with accurate time zone information.
STRUCT
Product catalog representing a product with attributes (name, description, price).
GEOGRAPHY
Location-based services for storing user locations, geofencing, and spatial analysis.
JSON
Exchange Log events representing structured data in a flexible format.

Primary and Foreign Keys

BigQuery doesn't enforce primary and foreign key constraints in the same way as RDBMS, but they are highly useful for query optimization.

05

Partitioning & Clustering

Partitioning

Divides large tables into smaller, more manageable segments.

  • Purpose: Query performance (faster by scanning only relevant partitions) and Cost reduction (charges based only on processed data).
  • Limits: Up to 10,000 partitions (~27 years of data). Daily limit of 30,000 partition modifications per table. Rate limit: 50 updates every 10 seconds.
  • Types:
    • Ingestion Time: Based on ingestion date. Pseudo-column _PARTITIONTIME.
    • Time-Unit Column: Based on date or timestamp column.
    • Integer Range: Ranges of values in a specific column.
  • Frequencies: Daily (default), hourly, monthly, or yearly.
  • Requires Partition Filter at table level (+ WHERE clause).
  • Excessive partitioning: (thousands of partitions) can negatively impact performance.
  • TABLE_DATE_RANGE: Specify a date range instead of manually listing individual partitions.
  • Partition Decorators: Reference a partition in a table to write data specifically to it (table_name$partition_id).

Clustering & Sharding

Sort data within partitions based on one or more columns.

  • Purpose: Further improves query performance, especially for queries with filtering and aggregation, by co-locating related data.
  • Automatic Reclustering: Maintained in the background for efficiency (free).
  • Hierarchical Clustering: Cluster a table by multiple columns to create a hierarchical organization (e.g., country, then state/region, then city).
  • Partitioning and clustering are powerful when used together (e.g., partition by date, cluster by heavily filtered column).
  • Sharding (Legacy): A similar concept (less preferred) with a single table per day. Uses UNION in query to scan multiple tables. Partitioning within a single table is easier to manage and leverages query optimizations more effectively than creating multiple sharded tables.
06

Data Ingestion & External Sources

Data Ingestion

  • Batch vs. Real-time: Factors include data volume, frequency, and latency requirements.
  • Batch Loading: For large datasets (free, shared resources).
  • Streaming Loading: For real-time use cases.
    • Streaming Inserts: Disable insertId for high volume → best effort deduplication.
    • BQ Storage Write API: High throughput.
  • Streaming Buffer: Data is first held in a temporary staging area to ensure data durability. Available for querying immediately after ack. Up to 90 min for table copy.
  • File Formats (Fastest to Slowest): Avro, Parquet, ORC, uncompressed CSV & JSON, compressed (gzip).
    • Avro: Row-based, faster write.
    • Parquet: Columnar, faster read.
  • Encoding & Schema: UTF-8 (default). Auto-detection for CSV/JSON, or define manually.
  • Data Sources: GCS (Firestore export), local file, Drive, BigTable, BigQuery Data Transfer Service (DTS).
  • Data Transformation: During ingestion via Dataflow, Dataproc, or Create a table from a query result (CTAS).

External Data Sources

  • Supported External Sources (5): Bigtable, Cloud Spanner, Cloud SQL, Cloud Storage, Google Drive.
  • External Tables: Act like a standard table. The table metadata (including the table schema) is stored in BigQuery storage, but the data itself resides in the external source.
  • Federated Query: A way to send a query statement to an external database (Cloud SQL or Cloud Spanner) and get the result back as a temporary table.
    • Use Cases: ETL operations. Query small and frequently changing data without the need to reload it. Access data being ingested periodically.
    • Limitations: No consistency guarantees, potential performance impact, no caching.
  • Partitioning & Schema: Supports Hive partitioning for data in GCS. Schema flexibility allows adding columns without rewriting the entire table.
07

Querying, Optimization & BI

Querying Data

  • Language: GoogleSQL (ANSI SQL) or Legacy SQL.
  • Query Jobs: Asynchronous actions (load, export, query, copy).
  • Results: Permanent or temporary tables.
  • Query Types:
    • Interactive: Default, immediate execution, counts toward daily/concurrent usage limits.
    • Batch: Queued, starts when idle resources are available, switches to interactive if idle for 24 hours.

BI Engine

An in-memory analysis service that accelerates BigQuery queries for business intelligence (BI) and data visualization tools like Looker Studio and Looker.

  • Optimizes performance by storing data in memory, providing fast, sub-second query responses.
  • Ideal for interactive reporting and dashboards.
  • Automatically integrates with BigQuery, delivering high performance while maintaining real-time insights and reducing costs.
  • BI Engine SQL: For any BI tooling, e.g., Looker or Tableau.

Query & Cost Optimization

Avoid SELECT *
Select only necessary columns to reduce data scanned using SELECT * EXCEPT.
Sample Data
Use Preview Table (free) and Preview queries (--dry-run) to estimate costs.
Limit Costs
Use max bytes billed. Don't use LIMIT clause (it is still a full scan).
Filtering
Use filters and WHERE clauses early in the query to limit data processed.
Join Efficiency
Join largest table, then smallest, then decreasing size. Filter on both tables.
Approx. Functions
Use APPROX_COUNT_DISTINCT for acceptable accuracy results faster.
Window Functions
Eliminate full table scans and temp tables for intermediate calculations. Avoid self-joins. Includes Aggregations, Navigation, Ranking functions.
Denormalization
Speed up queries by reducing joins (read-heavy dashboards). Trade-offs: increased storage, redundancy.
Handle Data Skew
Uneven distribution slowing queries. Fix via Pre-Aggregation on skewed column or using SHUFFLE Clause to redistribute data.
08

Analytics Hub & Data Clean Room

Analytics Hub

Built-in data exchange platform for secure & scalable data sharing across organizations. Supports a Data Mesh architecture.

  • Publish and subscribe model: Share datasets without duplicating data.
  • Shared datasets and exchanges: Organize data for sharing through curated datasets and access control via private or public exchanges.
  • Cross-domain Data Consumption: Users can easily access data products across different domains.
  • Data Discoverability: Data consumers can search for datasets across the organization in a standardized way.
  • Governance: In conjunction with Dataplex, teams maintain high data quality and privacy through federated policies.
  • Cost-effective sharing: Publishers only pay for storage, subscribers only pay for querying shared data.

Data Clean Room

Analyzing sensitive data with multiple parties while ensuring privacy and security without directly sharing the underlying data itself.

  • Query controls: Restrictions on the types of queries that can be executed.
  • Output privacy mechanisms via analysis rule:
    • Aggregation threshold: Enforces the minimum number of distinct entities that must be present in a set of data to be included in results.
    • Differential privacy: Anonymization technique that limits personal information revealed to prevent learning about an entity.
    • List overlap: Enforces an INNER JOIN and returns the exact rows where data intersects between two or more tables.
09

Data Recovery & BigLake

Data Recovery & Backup

  • Table clones: Create a lightweight, writable copy of an existing table (only changed data is billed) for dev/testing/sandboxes.
  • Table Snapshots: Preserve a read-only copy of a base table at a particular time. Retained indefinitely (or via custom expiration). Ideal for long-term backup and disaster recovery beyond the 7-day time travel window.
  • Time travel: Retrieves full table data for the past 2-7 days. Quick rollbacks, but no selective restore.
  • Export to GCS: In various formats (Avro, CSV) for long-term archival or disaster recovery.
  • Fail-Safe: Retains deleted data for an additional 7 days after time travel window for emergency recovery.
  • Copy dataset: Complete copy of dataset. Costly, not suited for long-term.
  • ELT pipelines: Organize data in monthly tables, export to GCS to restore specific time periods cleanly.
  • Regional outage: BigQuery does not offer automatic failover. Export to GCS in a different region to define your RPO.

BigLake

Query structured & unstructured data across analytics & AI engines with built-in governance.

  • Open Formats: Built on open formats (Parquet, ORC, Iceberg) to avoid vendor lock-in.
  • Multi-Cloud Governance: Discover and manage tables through Data Catalog.
  • Fine-Grained Security: Enforce consistent security policies (row filtering, column masking) removing need for file-level permissions.
  • Simplified Analytics: Query data using SQL interfaces regardless of storage engine.
  • Metadata caching: From GCS/S3 speeds up queries, especially for many files or Hive partition filters.
  • Moving cold (archive) data: To GCS, BigLake will query data even in lower-cost storage tiers.
  • BigLake over Omni: Eliminate data movement pipelines by directly querying data residing in S3 and Azure Data Lake.
10

Security, Governance & Monitoring

IAM & Roles

Granting an IAM role to an entity at different levels: Organization, Folder, Project, Dataset, Table, View.

  • BigQuery Admin: Full access to all resources.
  • Data Owner: Manage data/metadata, create/delete datasets and tables.
  • Data Editor: Query data, create/modify tables, load data.
  • Data Viewer: Read-only access to dataset metadata and table data. Cannot run queries without the Job User role at the project level.
  • User: Can run jobs, including queries.
  • Job User: Can only run jobs (like queries), but cannot access data directly unless granted a separate data access role (like Data Viewer). Often used for automated processes.
  • Metadata Viewer: Allows viewing metadata about resources.

Tags: Key-value pairs to conditionally apply IAM to tables & datasets.

Auditing & Monitoring

  • Cloud Audit Logs: Identify who accessed specific tables, analyze query execution patterns, track IAM/schema changes, ensure compliance, and troubleshoot.
  • Security Command Center: For security events and alerts.
  • Cloud Monitoring: Dashboards and alerts for BigQuery usage and performance.
  • Slot Consumption Metrics: Monitored via Cloud Monitoring, INFORMATION_SCHEMA.JOBS_*, Operational health dashboard, or Audit logs (Jobs API).
  • Metrics tracked: Query execution time, bytes processed, slots used, job started/failed, storage bytes, streaming latency.
  • Custom Quota: Set query usage per day per user (applies to all users in a project).

Data Governance, Quality & Lineage

VPC Service Controls

Security feature that defines a security perimeter around resources to mitigate the risk of data exfiltration.

  • Data Protection: Restricts access and establishes a defined perimeter.
  • Access Levels: Determine which users or services can access resources, ensuring only authorized users can interact.
  • Service Perimeters: Encompass multiple services, ensuring data does not leave defined boundaries.
  • Audit Logs: Detailed logs for activities within the perimeter.
  • Deny Rules: Restrict access from specified IP ranges.
  • Data Exfiltration Protection: Reduces risk of accidental or intentional exfiltration by enforcing sharing policies.

Change Data Capture (CDC)

CDC tracks changes from transactional databases to BigQuery in near-real time.

  • Log-based CDC: Preferred for efficient replication of inserts, updates, and deletes.
  • Staging Tables: Use to capture changes before applying them to reporting tables.
  • Apply Changes: In batches using DML MERGE to optimize performance and reduce overhead.
  • Stream CDC events: In real time for low-latency data availability.
  • Transactional Integrity: Ensures data consistency.
  • BigQuery supports schema evolution for dynamic structure changes.
  • Implement error handling for recovery in the pipeline.
11

Functions, Errors & Misc

SQL Functions & Procedures

  • User-Defined Functions (UDF): SQL or Javascript function that accepts input, performs actions, and returns a value. Temporary or Persistent.
  • Stored Procedure: Persistent script invoked from inside a SQL statement. Can take input arguments and return values.
  • Routines: A resource type containing Stored procedures, UDFs, remote functions, and Table functions to re-use logic.
  • Scripting: Send multiple statements to BigQuery in one request.
  • Specialized Functions:
    • Working with arrays and JSON.
    • Geospatial (GIS): Geography data types/functions for spatial analysis.
    • Security functions for data encryption and access control.
    • Approximate and HyperLogLog++: For advanced analytics and efficient distinct counting.

Error Handling & Data Quality

  • Common Errors:
    • Invalid Schema: Data type mismatches between source data and table schema.
    • Query Timeouts: Queries exceeding configured limits or resource contention.
    • Ingestion Issues: Problems loading data, often due to format errors or permissions.
  • Troubleshooting Tips:
    • Analyze using EXPLAIN statements to identify bottlenecks in Query Plans.
    • Examine Cloud Logging for error messages.
    • Check Job Details history in the console for errors and stats.
  • Data Validation:
    • CHECK Constraints: Define rules at the table level to enforce data integrity (e.g., data type, range).
    • ETL Validation: Incorporate data quality checks early in pipelines.

Miscellaneous Features

Self-Assessment Questions

Q1. What is the main difference between Partitioning and Clustering in BigQuery?

Partitioning divides large tables into segments (e.g., by date), while Clustering sorts data within those partitions based on one or more columns for better performance.

Q2. How long are cached query results maintained in BigQuery?

Cached query results are maintained per-user and per-project for 24 hours (with some exceptions).

Q3. How do you flatten nested or repeated fields in a BigQuery table for querying?

You use the UNNEST operator, which flattens the data for easier querying and treats each item in a repeated field as a separate row.

Q4. What is the best approach to handle data skew when querying in BigQuery?

You can perform Pre-Aggregation on the skewed column to reduce the number of rows with the same value, or use a SHUFFLE Clause to redistribute data evenly before joining.

Q5. Can you query data residing on AWS S3 or Azure directly using BigQuery?

Yes, using BigLake over Omni, which eliminates the need for data movement pipelines by directly querying data residing in S3 and Azure Data Lake.