Study Notes — Certification Prep

Anthropic Foundational Models
Study Guide

Comprehensive overview of Anthropic's core philosophy, Constitutional AI, and model family.

Updated: April 2026

Version: 1.0

Category: Anthropic

Reading Time: ~8 min

Author: Michaël Bettan

Core Philosophy

Anthropic focuses on reason-based alignment, agentic autonomy, and mechanistic interpretability.

Feature	Legacy Baseline (Claude 2)	Transitional (Claude 3.5 Family)	Frontier Architecture (Claude 4.6)
Core Design	Standard autoregressive LLM	Multi-modal text & vision processor	Hybrid reasoning engine
Alignment	High false-refusal rate	Nuanced understanding	Reason-based Constitutional resolution
Context Window	100k static tokens	200k tokens	1M tokens with Context Compaction
Compute Routing	Uniform allocation	Nuanced allocation	Adaptive Thinking (dynamic scaling)
Interaction Limit	Stateless text Ping-Pong	Interactive Artifacts	Agent Teams & Native Computer Use

Constitutional AI

The Compounding Costs of Human-Only Alignment

Traditional RLHF (Reinforcement Learning from Human Feedback) relies entirely on human raters, creating systemic bottlenecks:

Epistemological Vulnerability (Sycophancy): Models trained strictly to maximize human approval learn to tell raters what they want to hear, sacrificing objective truth and rigorous deduction for polite agreement.
Fragmented Reward Signals: Crowdsourced human values are notoriously inconsistent. Contradictory feedback leads to uneven moral guidelines and catastrophic alignment failures in edge cases.
Compounded Latency: Human labeling simply cannot keep pace with the massive throughput required to train 2026 frontier models on multimodal datasets.

Projecting Values via Constitutional AI (CAI)

Replaces human-bottlenecks with an explicit, reason-based "Constitution."

Broadly Safe: The absolute highest priority. The model must never undermine human mechanisms designed to oversee or halt AI operations.
Broadly Ethical: Must maintain honesty and avoid dangerous actions.
Compliant: Must follow specific directives regarding medical or cybersecurity advice.
Genuinely Helpful: Only when the first three conditions are met does the model optimize for user helpfulness.

Phase 1

Supervised (Critique & Revision)

The base model generates a response, critiques its own output against the Constitution, and autonomously generates a revised response (zero human intervention).

Phase 2

RLAIF (Reinforcement Learning from AI Feedback)

An AI evaluator scores candidate responses using the Constitution. A preference model is trained from these AI scores, replacing human raters entirely and allowing alignment to scale with compute.

System 2 Reasoning & Deployment

Context Mechanics

Flat-Rate Pricing: Claude 4.6 generation maintains the same per-token rate regardless of context length (e.g., 9,000 vs 900,000 tokens).
Context Compaction API (compact_20260112): Even a 1M limit gets exhausted by persistent agents running for days. This new server-side protocol automatically compresses early conversation history into a dense "compaction block," allowing agents to run indefinitely without context rot or systemic timeouts.

Claude Deployment Matrix

Claude Haiku 4.5: Optimized for inference speed. Target use cases: triage and real-time support. Cost: $1/MTok.
Claude Sonnet 4.6: Mid-tier model. Target use cases: autonomous software engineering and multi-agent workflow orchestration. Cost: $3/MTok. (1M tokens Beta)
Claude Opus 4.6: Maximum capacity reasoning model. Target use cases: theoretical mathematics, strategic planning, and complex multi-agent oversight. Cost: $5/MTok.

System 2 Reasoning: "Adaptive Thinking"

The legacy "Extended Thinking" mode (where developers statically allocated a token budget) has been replaced by native "Adaptive Thinking".

Autonomous Scaling: The model autonomously evaluates the complexity of an incoming request and dynamically scales its reasoning depth before outputting a response.
The /effort Parameter: Developers can override this via the API by assigning four distinct effort levels (low, medium, high, max). You can force Opus to use max compute for an algorithmic proof, or dial Sonnet down to low for simple classification to save latency.

Agentic Bifurcation: Code vs. Cowork

Claude Code (Developer Layer)

A terminal-based CLI. Operates deeply within the local system architecture, capable of spinning up parallel sub-agents to execute complex, repository-wide refactoring.

Claude Cowork (Knowledge Worker Layer)

A secure, desktop-based GUI operating in an isolated Virtual Machine. Requires no coding familiarity. Targeted for long-running desktop automation (bulk file org, app-to-app data extraction), though its reliance on visual/screenshot parsing consumes higher token volumes.

Zoom Action (Computer Use)

To handle high-resolution IDEs, the 2025 API update introduced a zoom action, allowing Claude to request a localized, full-resolution crop of a specific screen coordinate.

Interpretability & Steering

Opening the Black Box

Sparse Autoencoders (SAEs): Anthropic decompiles dense neural activations into readable features (e.g., "immunology" or "sycophancy").
The Stream Algorithm (Oct 2025): Addresses previous computational limitations of scaling SAEs to massive context windows. The algorithm hierarchically prunes 97% to 99% of irrelevant token interactions, achieving near-linear time complexity and allowing developers to trace why a model made a decision for up to 100k tokens.

Behavioral Vaccination

Because internal features are mapped, they can be manually manipulated to secure the model.

Persona Vectors: By extracting the specific neural vectors responsible for character traits, engineers actively monitor the model's "mood" during deployment. If the activation state drifts toward "hallucination" or "evil," the system detects the trajectory before the output is generated.
Behavioral Vaccination: During fine-tuning, researchers intentionally steer the model toward undesirable personas (artificially dosing it with "toxicity"). By forcing these states, the model builds a natural resilience, designed to reduce the likelihood of "alignment faking" (a sleeper agent hiding its true intent) during real-world deployment.

AI Safety & MCP

Operationalizes safety via the Responsible Scaling Policy.

AI Safety Levels (ASL) Framework

ASL-2 (The Baseline): Models exhibit dangerous theoretical knowledge but lack practical autonomy. Secured via automated red-teaming.
ASL-3 (The Opus 4 Trigger): Triggered when models demonstrate advanced proficiency in workflows associated with CBRN (Chemical, Biological, Radiological, Nuclear) threats.
ASL-3 Defenses:
- Constitutional Classifiers: Specialized, low-latency sentinel LLMs that monitor input/output streams to proactively block harmful CBRN workflows.
- Weight Security: Mandates 2-Party Authorization (2PA) for infrastructure access and strict Egress Bandwidth Controls to throttle network traffic, ensuring security systems have time to detect and terminate illicit multi-gigabyte weight exfiltration by state-level threat actors.

Model Context Protocol (MCP)

An emerging open standard providing a universal interface between LLMs, tools, and data.

Universal Interface: Acting as the "USB-C" for AI, MCP uses a standardized JSON-RPC 2.0 interface. Developers write an integration once, and any MCP-aware model can instantly discover and use it.
Ecosystem State: Rapidly growing but early-stage ecosystem. One of the leading approaches for agent interoperability, though adoption is still fragmented.
Automatic Discovery & Chaining: Claude automatically fetches a live catalog of a server’s methods and schemas. Multiple MCP servers can be chained together in a single prompt to execute complex workflows.

Multimodal RAG & Vision Pipelines

Fusing document search, semantic understanding, and image analysis.

Contextual Retrieval: Combines standard vector embeddings (e.g., FAISS nearest-neighbor indexing) with BM25 keyword filtering to reduce retrieval failures in domain-specific corpora.
Multimodal Fusion: Claude can process text, code snippets, and UI screenshots simultaneously in a single prompt. It maps variables in the code directly to visual elements in the screenshot to pinpoint layout mismatches or errors.

Auditing & Securing MCP Workflows

MCP increases the attack surface and requires strict controls.

Core Risks: Introduces risks like prompt injection, tool poisoning, data exfiltration, and privilege escalation.
Threat Mitigations: Defends against Prompt Injection (via strict separation of system/user channels) and Tool Poisoning (via strict schema validation and method whitelisting).
MCPSafetyScanner: An automated auditing tool that acts as an AI penetration tester. It hammers exposed MCP endpoints with adversarial JSON-RPC requests to catch schema violations and path-traversal attempts before deployment.

Glossary

RLHF

Reinforcement Learning from Human Feedback — foundational but limited alignment technique.

CAI

Constitutional AI — Anthropic's reason-based alignment methodology using a defined Constitution.

RLAIF

Reinforcement Learning from AI Feedback — AI-scored preference model replacing human raters.

Adaptive Thinking

Dynamic reasoning depth scaling; replaces fixed-budget Extended Thinking.

MCP

Model Context Protocol — an emerging open standard providing a universal interface between LLMs, tools, and data (early-stage ecosystem).

MCPSafetyScanner

An automated agentic tool that audits MCP servers using adversarial testing to identify security vulnerabilities.

Context Compaction

Lossy server-side summarization of conversation history to prevent context overflow.

FAISS

Facebook AI Similarity Search — a library for efficient similarity search and clustering of dense vectors, used in RAG pipelines.

Prompt Caching

Server-side storage of static context prefixes to reduce redundant computation (up to 90% cost savings and 2x faster latency).

RAG

Retrieval-Augmented Generation — a technique that retrieves relevant document fragments at query time to supply factual context to the model.

SAE

Sparse Autoencoder — decompiles superposition into human-readable feature vectors.

Feature Clamping

Manually elevating or suppressing specific neural feature activations to control behavior.

Persona Vectors

Low-rank vectors representing character traits; used for monitoring and behavioral vaccination.

Stream Algorithm

Hierarchical pruning technique enabling near-linear interpretability analysis.

ASL

AI Safety Level — capability-gated security framework under the RSP.

RSP

Responsible Scaling Policy — Anthropic's governance framework for safe capability deployment.

Constitutional Classifiers

Sentinel LLMs monitoring input/output streams for CBRN threat patterns.

2PA

Two-Party Authorization — dual human approval required for critical infrastructure access.

TTL

Time-to-Live — cache duration (5-minute or 1-hour options).

TTFT

Time-to-First-Token — primary latency metric.

Computer Use API

Tool enabling Claude to view and control desktop environments via screenshot analysis.

Claude Code

CLI-based developer tool for autonomous codebase management and agent team orchestration.

Claude Cowork

GUI-based desktop automation tool for non-technical knowledge workers.

Artifacts

Stateful interactive UI windows rendered alongside chat for code and visual output.

Self-Assessment Questions

Q1. What is "Constitutional AI" (CAI) and how does it differ from traditional RLHF?

CAI uses an explicit, reason-based "Constitution" to align the model, replacing human raters in the evaluation phase with AI evaluators (RLAIF) to scale alignment with compute.

Q2. What are the two layers in Anthropic's "Agentic Bifurcation"?

Claude Code (Developer Layer, CLI-based) and Claude Cowork (Knowledge Worker Layer, GUI-based).

Q3. What is the purpose of "Sparse Autoencoders" (SAEs) in Anthropic's research?

To decompile dense neural activations into readable features, opening the "black box" of AI to understand why a model made a decision.

Anthropic Foundational ModelsStudy Guide

Core Philosophy

Core Philosophy

Constitutional AI

The Compounding Costs of Human-Only Alignment

Projecting Values via Constitutional AI (CAI)

Supervised (Critique & Revision)

RLAIF (Reinforcement Learning from AI Feedback)

System 2 Reasoning & Deployment

Context Mechanics

Claude Deployment Matrix

System 2 Reasoning: "Adaptive Thinking"

Agentic Bifurcation: Code vs. Cowork

Interpretability & Steering

Opening the Black Box

Behavioral Vaccination

AI Safety & MCP

AI Safety Levels (ASL) Framework

Model Context Protocol (MCP)

Multimodal RAG & Vision Pipelines

Auditing & Securing MCP Workflows

Glossary

Self-Assessment Questions

Anthropic Foundational Models
Study Guide