×
Study Notes — Certification Prep

Google Foundational Models
Study Guide

Comprehensive overview of Google's foundational AI architecture, generation models, audio pipelines, safety frameworks, and specialized scientific models.

Updated: April 2026
Version: 1.0
Category: Google AI
Reading Time: ~9 min
Author: Michaël Bettan
01

Core Definition & Architecture

Definition: Native Multimodality

Google’s foundational AI models are built on native multimodality (early fusion). Instead of bolting separate vision or audio encoders onto a text model which causes lag, lost nuance, and hallucinations. All data types (text, image, audio, video, code) are projected into a unified, shared token space from the very beginning. This data is then processed by a singular Sparse Mixture-of-Experts (MoE) transformer backbone, enabling fluid, real-time agentic interaction and deep cross-modal reasoning.

Architecture

02

Ecosystem Tiers

Gemini Nano
On-device edge computing (strict thermal/memory constraints like Pixel 9) → Zero-latency, offline processing.
Gemini Flash & Flash-Lite
Optimized for extreme inference speed and high-volume. Uses knowledge distillation (compressing larger models into a k-sparse distribution) → Ideal for fast TTFT (time-to-first-token).
Gemini Pro
The versatile workhorse balancing massive parameter capacity with efficient serving for complex multi-step reasoning.
Gemini Ultra
Maximum-capacity frontier model running on extensive TPU clusters for the most complex scientific and algorithmic challenges.
03

Gemini 3.1 & Context Mechanics

Gemini 3.1 Pro preview & "Deep think" paradigm

Google's apex reasoning model optimized for agentic workflows and complex problem-solving.

Context mechanics & tokenization

04

Generation Models

Lyria 3 (audio & music generation)

Nano Banana 2 (Gemini 3.1 Flash image)

Veo 3.1 (video generation)

05

Gemini Audio, Live API & Embeddings

A foundational model designed specifically for live voice-agent interactions, bypassing intermediate text generation for pure acoustic understanding.

Gemini Audio

Gemini multimodal live API

  • Shifts from stateless RESTful prompts to stateful, bidirectional WebSocket (WSS) connections.
  • Streams raw 16-bit PCM audio, video frames, and text simultaneously.
  • Bypasses STT/TTS: Understands raw acoustic waveforms (tone, hesitation) → and outputs raw 24kHz audio.
  • Supports "barge-in" (interrupting the model mid-sentence) → Affective dialog (mirroring emotional state).

Gemini embedding 2

  • Natively multimodal unified vector space. Maps text, images, video, and audio into a single mathematical store.
  • Use case: Upload a picture of a broken part to query a vector store and instantly retrieve the exact timestamp of a relevant video tutorial and text manual.
06

Gemma Architecture

Brings Gemini's core architecture to local, compute-constrained environments.

Gemma 3 architecture

RecurrentGemma (Griffin architecture)

07

Safety & Security

Safety & security: SynthID framework

Traditional metadata watermarks (C2PA) are easily stripped. SynthID embeds imperceptible cryptographic signatures directly into the data structures at inference.

08

Scientific & Specialized Models

AlphaFold 3 (generative molecular simulation)

AlphaGeometry 2 (neuro-symbolic reasoning)

Solves IMO geometry problems (84% success rate) using a dual-engine architecture:

Model references

SIMA 2 (generalist 3D agent)
An agent that plays, reasons, and learns with users in diverse virtual 3D worlds, functioning as a collaborative generalist agent rather than a purely autonomous one.
Genie 2 / 3 (world models)
Generates interactive, playable 2D and 3D environments from single-frame or textual prompts.
AlphaFold 3 (biology)
Predicts joint molecular structures and interactions with atomic precision across all life’s molecules.
AlphaProteo (biology)
Designs novel, high-affinity protein binders for diverse target proteins to accelerate drug discovery.
AlphaMissense (biology)
Classifies missense mutations to identify pathogenic variants in the human genome.
DolphinGemma (biology)
Specialized model for decoding and modeling complex animal communication patterns.
AlphaGeometry 2 (mathematics)
Reasons through IMO-level geometry problems using a neuro-symbolic dual-engine approach.
AlphaTensor (mathematics)
Discovers computationally efficient algorithms for fundamental matrix multiplication tasks.
AlphaCode (programming)
Generates competitive-level code solutions through massive-scale sampling and filtering.
AlphaDev (algorithms)
Utilizes reinforcement learning to discover faster sorting and hashing algorithms in assembly.
AlphaGo / AlphaZero / MuZero
Frontier RL models achieving superhuman mastery in games without human intervention or known rules.
AlphaStar (games)
First AI to defeat Grandmaster-level players in the complex real-time strategy game StarCraft II.
GraphCast / GenCast (climate)
Provides high-resolution global weather forecasting and probabilistic extreme weather prediction.
AlphaChip (hardware)
Optimizes superhuman chip layouts to accelerate hardware design cycles using RL.
AlphaQubit (quantum)
Applies AI to quantum error correction to improve the stability of quantum computations.
PaLM 2 / USM / WaveNet
Foundational text-first, speech, and audio synthesis models serving as previous-generation baselines.
Phenaki (video)
Early research model for generating long, temporally consistent video sequences from text prompts.
Aeneas (humanities)
Restores missing text and dates damaged ancient inscriptions using neural sequence modeling.
PaLM-SayCan / RT-2 / ALOHA
Advanced robotics models integrating language and vision for dexterous manipulation and planning.

Self-Assessment Questions

Q1. What is the core architectural shift in Gemini compared to previous generation models?

Native multimodality (early fusion) where all data types are projected into a unified token space from the beginning, rather than bolting on separate encoders.

Q2. What is the SynthID framework used for?

To embed imperceptible cryptographic signatures directly into text, audio, image, and video data structures at inference to track AI provenance.

Q3. What is the difference between Gemini Nano and Gemini Flash?

Gemini Nano is optimized for on-device edge computing with zero latency, while Gemini Flash is optimized for extreme inference speed and high-volume serving in the cloud.