×
Study Notes — Certification Prep

Google Cloud Database
Bigtable

Comprehensive study notes covering Bigtable architecture, schema design, operational best practices, and configuration. Everything you need to pass with confidence.

Author: Michaël Bettan
Topic: Bigtable Operations
01

Core Concepts & Overview

Definition
A managed, scalable NoSQL database service designed for large analytical and operational workloads.
Availability
Available in multi-regional or regional deployments.
SLA
Offers up to 99.999% uptime for multi-regional instances.

Use Cases

  • PB Large analytical workloads (OLTP)
  • High throughput analysis & Low latency writes
  • Key-based reads
  • Time-series workloads (Financial, IoT, etc.)
  • HBase legacy migration
  • Real-time serving (ad serving, mobile app recommendations, etc.)
  • Batch analytics (ML, Data Mining, etc)

Key Capabilities

  • Wide Column database (multidimensional array which may contain many columns)
  • Managed service (nodes provisioning required)
  • Horizontal scaling
  • HBase API compatibility interoperability
  • Cluster configuration required (nodes, disk, regions, etc)
  • Adding a new cluster for High-Availability
  • Operations are atomic at the row level
02

Architecture & Storage Model

Cell
Each row/column intersection can contain multiple cells (versions of data).
Rows & Row Key
A single entity is indexed by a single row-key. The table has only one index based on this row key. Rows are sorted lexicographically by row key.
Columns & Families
Columns are grouped into column families (groups of related columns in a table).
Tablets
A Table is sharded into blocks of contiguous rows called Tablets. This is the smallest logical structure for grouping/storing data, associated with a specific node.
SSTable
Tablets are stored in Colossus in the SSTable format: persistent, ordered, immutable maps from keys to values (arbitrary byte strings).
Nodes & Colossus
Nodes store the metadata (SSTables locations). Colossus's shared log stores all writes. The cluster routes requests to appropriate nodes.
03

Instance Configuration & Billing

Instance Configuration

  • Instance is the container for your clusters.
  • SSD: Lower latency and more rows read per second. Typically used for real-time serving.
  • HDD: Higher latency for random reads. Good performance on scans and used for batch analytics.
  • Important: Changing disk type requires a new instance (cannot flip HDD to SSD).
  • Scaling: Manual scaling (add/delete nodes without downtime), or recently introduced Autoscaling.
  • Replication: Add clusters to automatically replicate (cross-regional supported) to isolate serving from batch, or improve availability.
  • App profile: Defines routing policy (single-cluster or multi-cluster routing) and controls single-row transactions.

Billing & Security

  • Compute Billing: Type of instance and number of nodes in your clusters.
  • Storage Billing: Amount of storage your tables use (GiB).
  • Network Billing: Amount of network bandwidth used (GiB).
  • Backup Billing: Billed only for the backup storage you use. A backup can never be larger than the original table.
  • Security Access Control: Managed via IAM at Project or Instance level. Roles include Read, Write, and Manage.
04

Schema Design & Row Keys

Schema Considerations

Designing a schema is different from a typical RDBMS. There is no support for joining (no foreign key concept). Each table has only one index (the row key, up to 4 KB). All operations are atomic (ACID) at the row level only, and both reads and writes should be distributed evenly.

Row Key Design Characteristics / Examples
Good Row Keys Distributed load. Design from a common value to a granular value. Include a timestamp (but not at the beginning).
Poor Row Keys Sequential numeric IDs. Timestamps alone or at the start. Hashed values. Values expressed as raw bytes.
Hot-spotting Avoidance Avoid auto-incrementing keys or time-values alone as they hit the same node repeatedly.

Designing Row Keys

05

Best Practices & Limitations

1,000
Tables Limit
100
Col Families Limit
100 MB
Max Cell Size
256 MB
Max Row Size

Best Practices

06

Garbage Collection & Misc

Garbage Collection

  • Asynchronous and Delayed: runs continuously in the background. Data marked for deletion can take up to a week to be fully removed.
  • Column Family Specific: GC rules are defined independently at the column family level.
  • Multiple Deletion Criteria: configure rules based on age (TTL), number of cell versions, or a combination.
  • Tombstoning: When all cells within a row/qualifier are deleted, it is marked as "tombstoned" and removed during GC.
  • Filtering for Accuracy: Apply timestamp range filters within queries to bypass the GC delay and retrieve only recent data.

Miscellaneous

  • Related columns should be placed in a column family. A single table can have multiple column families.
  • Related data should be in one table, not multiple tables.
  • Bigtable does not support secondary indexes.
  • Row keys are specified for each row, not for each column family.
  • Command line usage: use cbt tool (preferred) or HBase shell.

Self-Assessment Questions

Q1. What happens if you use a sequential numeric ID or a timestamp as the beginning of your row key?

It leads to hot-spotting when writing, as all sequential writes will hit the exact same node instead of being distributed across the cluster.

Q2. You initially provisioned an HDD Bigtable instance for batch analytics, but now want to serve real-time app traffic requiring lower latency. How do you switch to SSD?

Changing disk type requires creating a entirely new instance. You cannot toggle a setting from HDD to SSD.

Q3. At what level are operations guaranteed to be atomic (ACID) in Bigtable?

Operations are atomic only at the row level.

Q4. Because Garbage Collection is asynchronous and can take up to a week, how do you ensure deleted data isn't returned in your queries?

By applying timestamp range filters for accuracy within your queries, which bypasses the garbage collection delay.

Q5. What is a common way to avoid hot-spotting for row keys when writing time-series data?

Use field promotion (shifting a column to be part of the row key before the timestamp) or salting (hashing the timestamp plus row key).