Study Notes — Certification Prep

Google Cloud Database
Bigtable

Comprehensive study notes covering Bigtable architecture, schema design, operational best practices, and configuration. Everything you need to pass with confidence.

Author: Michaël Bettan

Topic: Bigtable Operations

Core Concepts & Overview

Definition: A managed, scalable NoSQL database service designed for large analytical and operational workloads.

Availability: Available in multi-regional or regional deployments.

SLA: Offers up to 99.999% uptime for multi-regional instances.

Use Cases

PB Large analytical workloads (OLTP)
High throughput analysis & Low latency writes
Key-based reads
Time-series workloads (Financial, IoT, etc.)
HBase legacy migration
Real-time serving (ad serving, mobile app recommendations, etc.)
Batch analytics (ML, Data Mining, etc)

Key Capabilities

Wide Column database (multidimensional array which may contain many columns)
Managed service (nodes provisioning required)
Horizontal scaling
HBase API compatibility interoperability
Cluster configuration required (nodes, disk, regions, etc)
Adding a new cluster for High-Availability
Operations are atomic at the row level

Architecture & Storage Model

Cell: Each row/column intersection can contain multiple cells (versions of data).

Rows & Row Key: A single entity is indexed by a single row-key. The table has only one index based on this row key. Rows are sorted lexicographically by row key.

Columns & Families: Columns are grouped into column families (groups of related columns in a table).

Tablets: A Table is sharded into blocks of contiguous rows called Tablets. This is the smallest logical structure for grouping/storing data, associated with a specific node.

SSTable: Tablets are stored in Colossus in the SSTable format: persistent, ordered, immutable maps from keys to values (arbitrary byte strings).

Nodes & Colossus: Nodes store the metadata (SSTables locations). Colossus's shared log stores all writes. The cluster routes requests to appropriate nodes.

Instance Configuration & Billing

Instance Configuration

Instance is the container for your clusters.
SSD: Lower latency and more rows read per second. Typically used for real-time serving.
HDD: Higher latency for random reads. Good performance on scans and used for batch analytics.
Important: Changing disk type requires a new instance (cannot flip HDD to SSD).
Scaling: Manual scaling (add/delete nodes without downtime), or recently introduced Autoscaling.
Replication: Add clusters to automatically replicate (cross-regional supported) to isolate serving from batch, or improve availability.
App profile: Defines routing policy (single-cluster or multi-cluster routing) and controls single-row transactions.

Billing & Security

Compute Billing: Type of instance and number of nodes in your clusters.
Storage Billing: Amount of storage your tables use (GiB).
Network Billing: Amount of network bandwidth used (GiB).
Backup Billing: Billed only for the backup storage you use. A backup can never be larger than the original table.
Security Access Control: Managed via IAM at Project or Instance level. Roles include Read, Write, and Manage.

Schema Design & Row Keys

Schema Considerations

Designing a schema is different from a typical RDBMS. There is no support for joining (no foreign key concept). Each table has only one index (the row key, up to 4 KB). All operations are atomic (ACID) at the row level only, and both reads and writes should be distributed evenly.

Row Key Design	Characteristics / Examples
Good Row Keys	Distributed load. Design from a common value to a granular value. Include a timestamp (but not at the beginning).
Poor Row Keys	Sequential numeric IDs. Timestamps alone or at the start. Hashed values. Values expressed as raw bytes.
Hot-spotting Avoidance	Avoid auto-incrementing keys or time-values alone as they hit the same node repeatedly.

Designing Row Keys

Design your row keys based on the queries you will use to retrieve the data.
Keep your row keys short (4 KB or less).
Store multiple delimited values in each row key.
Use human-readable string values.
Field promotion (shift a column as part of the row key) and salting (remainder of division of hash of timestamp plus row key) are ways to help design row keys.

Best Practices & Limitations

1,000

Tables Limit

100

Col Families Limit

100 MB

Max Cell Size

256 MB

Max Row Size

Best Practices

Store datasets with similar schemas in the same table. Creating many small tables is an anti-pattern.
Time-series data: use tall/narrow tables (one event per row).
Denormalize: prefer multiple tall and narrow tables rather than small tables.
Try to keep all info for an entity in a single row. Related entities should be stored in adjacent rows.
Target sizes: Try to store 10 MB in a single cell (max 100 MB) and 100 MB in a single row (max 256 MB).
Column Qualifiers: create as many as needed in each row, but avoid splitting data across more than necessary (16 KB).
Tables are sparse. Empty columns don’t take up any space.
Columns are grouped together when frequently accessed (limitation: up to 100 columns).

Garbage Collection & Misc

Garbage Collection

Asynchronous and Delayed: runs continuously in the background. Data marked for deletion can take up to a week to be fully removed.
Column Family Specific: GC rules are defined independently at the column family level.
Multiple Deletion Criteria: configure rules based on age (TTL), number of cell versions, or a combination.
Tombstoning: When all cells within a row/qualifier are deleted, it is marked as "tombstoned" and removed during GC.
Filtering for Accuracy: Apply timestamp range filters within queries to bypass the GC delay and retrieve only recent data.

Miscellaneous

Related columns should be placed in a column family. A single table can have multiple column families.
Related data should be in one table, not multiple tables.
Bigtable does not support secondary indexes.
Row keys are specified for each row, not for each column family.
Command line usage: use cbt tool (preferred) or HBase shell.

Self-Assessment Questions

Q1. What happens if you use a sequential numeric ID or a timestamp as the beginning of your row key?

It leads to hot-spotting when writing, as all sequential writes will hit the exact same node instead of being distributed across the cluster.

Q2. You initially provisioned an HDD Bigtable instance for batch analytics, but now want to serve real-time app traffic requiring lower latency. How do you switch to SSD?

Changing disk type requires creating a entirely new instance. You cannot toggle a setting from HDD to SSD.

Q3. At what level are operations guaranteed to be atomic (ACID) in Bigtable?

Operations are atomic only at the row level.

Q4. Because Garbage Collection is asynchronous and can take up to a week, how do you ensure deleted data isn't returned in your queries?

By applying timestamp range filters for accuracy within your queries, which bypasses the garbage collection delay.

Q5. What is a common way to avoid hot-spotting for row keys when writing time-series data?

Use field promotion (shifting a column to be part of the row key before the timestamp) or salting (hashing the timestamp plus row key).