×
Study Notes — Certification Prep

Anthropic Foundational Models
Study Guide

Comprehensive overview of Anthropic's core philosophy, Constitutional AI, and model family.

Updated: April 2026
Version: 1.0
Category: Anthropic
Reading Time: ~8 min
Author: Michaël Bettan
01

Core Philosophy

Core Philosophy

Anthropic focuses on reason-based alignment, agentic autonomy, and mechanistic interpretability.

Feature Legacy Baseline (Claude 2) Transitional (Claude 3.5 Family) Frontier Architecture (Claude 4.6)
Core Design Standard autoregressive LLM Multi-modal text & vision processor Hybrid reasoning engine
Alignment High false-refusal rate Nuanced understanding Reason-based Constitutional resolution
Context Window 100k static tokens 200k tokens 1M tokens with Context Compaction
Compute Routing Uniform allocation Nuanced allocation Adaptive Thinking (dynamic scaling)
Interaction Limit Stateless text Ping-Pong Interactive Artifacts Agent Teams & Native Computer Use
02

Constitutional AI

The Compounding Costs of Human-Only Alignment

Traditional RLHF (Reinforcement Learning from Human Feedback) relies entirely on human raters, creating systemic bottlenecks:

Projecting Values via Constitutional AI (CAI)

Replaces human-bottlenecks with an explicit, reason-based "Constitution."

Phase 1

Supervised (Critique & Revision)

The base model generates a response, critiques its own output against the Constitution, and autonomously generates a revised response (zero human intervention).

Phase 2

RLAIF (Reinforcement Learning from AI Feedback)

An AI evaluator scores candidate responses using the Constitution. A preference model is trained from these AI scores, replacing human raters entirely and allowing alignment to scale with compute.

03

System 2 Reasoning & Deployment

Context Mechanics

  • Flat-Rate Pricing: Claude 4.6 generation maintains the same per-token rate regardless of context length (e.g., 9,000 vs 900,000 tokens).
  • Context Compaction API (compact_20260112): Even a 1M limit gets exhausted by persistent agents running for days. This new server-side protocol automatically compresses early conversation history into a dense "compaction block," allowing agents to run indefinitely without context rot or systemic timeouts.

Claude Deployment Matrix

  • Claude Haiku 4.5: Optimized for inference speed. Target use cases: triage and real-time support. Cost: $1/MTok.
  • Claude Sonnet 4.6: Mid-tier model. Target use cases: autonomous software engineering and multi-agent workflow orchestration. Cost: $3/MTok. (1M tokens Beta)
  • Claude Opus 4.6: Maximum capacity reasoning model. Target use cases: theoretical mathematics, strategic planning, and complex multi-agent oversight. Cost: $5/MTok.

System 2 Reasoning: "Adaptive Thinking"

The legacy "Extended Thinking" mode (where developers statically allocated a token budget) has been replaced by native "Adaptive Thinking".

04

Agentic Bifurcation: Code vs. Cowork

Claude Code (Developer Layer)
A terminal-based CLI. Operates deeply within the local system architecture, capable of spinning up parallel sub-agents to execute complex, repository-wide refactoring.
Claude Cowork (Knowledge Worker Layer)
A secure, desktop-based GUI operating in an isolated Virtual Machine. Requires no coding familiarity. Targeted for long-running desktop automation (bulk file org, app-to-app data extraction), though its reliance on visual/screenshot parsing consumes higher token volumes.
Zoom Action (Computer Use)
To handle high-resolution IDEs, the 2025 API update introduced a zoom action, allowing Claude to request a localized, full-resolution crop of a specific screen coordinate.
05

Interpretability & Steering

Opening the Black Box

  • Sparse Autoencoders (SAEs): Anthropic decompiles dense neural activations into readable features (e.g., "immunology" or "sycophancy").
  • The Stream Algorithm (Oct 2025): Addresses previous computational limitations of scaling SAEs to massive context windows. The algorithm hierarchically prunes 97% to 99% of irrelevant token interactions, achieving near-linear time complexity and allowing developers to trace why a model made a decision for up to 100k tokens.

Behavioral Vaccination

Because internal features are mapped, they can be manually manipulated to secure the model.

  • Persona Vectors: By extracting the specific neural vectors responsible for character traits, engineers actively monitor the model's "mood" during deployment. If the activation state drifts toward "hallucination" or "evil," the system detects the trajectory before the output is generated.
  • Behavioral Vaccination: During fine-tuning, researchers intentionally steer the model toward undesirable personas (artificially dosing it with "toxicity"). By forcing these states, the model builds a natural resilience, designed to reduce the likelihood of "alignment faking" (a sleeper agent hiding its true intent) during real-world deployment.
06

AI Safety & MCP

Operationalizes safety via the Responsible Scaling Policy.

AI Safety Levels (ASL) Framework

  • ASL-2 (The Baseline): Models exhibit dangerous theoretical knowledge but lack practical autonomy. Secured via automated red-teaming.
  • ASL-3 (The Opus 4 Trigger): Triggered when models demonstrate advanced proficiency in workflows associated with CBRN (Chemical, Biological, Radiological, Nuclear) threats.
  • ASL-3 Defenses:
    • Constitutional Classifiers: Specialized, low-latency sentinel LLMs that monitor input/output streams to proactively block harmful CBRN workflows.
    • Weight Security: Mandates 2-Party Authorization (2PA) for infrastructure access and strict Egress Bandwidth Controls to throttle network traffic, ensuring security systems have time to detect and terminate illicit multi-gigabyte weight exfiltration by state-level threat actors.

Model Context Protocol (MCP)

An emerging open standard providing a universal interface between LLMs, tools, and data.

  • Universal Interface: Acting as the "USB-C" for AI, MCP uses a standardized JSON-RPC 2.0 interface. Developers write an integration once, and any MCP-aware model can instantly discover and use it.
  • Ecosystem State: Rapidly growing but early-stage ecosystem. One of the leading approaches for agent interoperability, though adoption is still fragmented.
  • Automatic Discovery & Chaining: Claude automatically fetches a live catalog of a server’s methods and schemas. Multiple MCP servers can be chained together in a single prompt to execute complex workflows.

Multimodal RAG & Vision Pipelines

Fusing document search, semantic understanding, and image analysis.

  • Contextual Retrieval: Combines standard vector embeddings (e.g., FAISS nearest-neighbor indexing) with BM25 keyword filtering to reduce retrieval failures in domain-specific corpora.
  • Multimodal Fusion: Claude can process text, code snippets, and UI screenshots simultaneously in a single prompt. It maps variables in the code directly to visual elements in the screenshot to pinpoint layout mismatches or errors.

Auditing & Securing MCP Workflows

MCP increases the attack surface and requires strict controls.

  • Core Risks: Introduces risks like prompt injection, tool poisoning, data exfiltration, and privilege escalation.
  • Threat Mitigations: Defends against Prompt Injection (via strict separation of system/user channels) and Tool Poisoning (via strict schema validation and method whitelisting).
  • MCPSafetyScanner: An automated auditing tool that acts as an AI penetration tester. It hammers exposed MCP endpoints with adversarial JSON-RPC requests to catch schema violations and path-traversal attempts before deployment.
07

Glossary

RLHF
Reinforcement Learning from Human Feedback — foundational but limited alignment technique.
CAI
Constitutional AI — Anthropic's reason-based alignment methodology using a defined Constitution.
RLAIF
Reinforcement Learning from AI Feedback — AI-scored preference model replacing human raters.
Adaptive Thinking
Dynamic reasoning depth scaling; replaces fixed-budget Extended Thinking.
MCP
Model Context Protocol — an emerging open standard providing a universal interface between LLMs, tools, and data (early-stage ecosystem).
MCPSafetyScanner
An automated agentic tool that audits MCP servers using adversarial testing to identify security vulnerabilities.
Context Compaction
Lossy server-side summarization of conversation history to prevent context overflow.
FAISS
Facebook AI Similarity Search — a library for efficient similarity search and clustering of dense vectors, used in RAG pipelines.
Prompt Caching
Server-side storage of static context prefixes to reduce redundant computation (up to 90% cost savings and 2x faster latency).
RAG
Retrieval-Augmented Generation — a technique that retrieves relevant document fragments at query time to supply factual context to the model.
SAE
Sparse Autoencoder — decompiles superposition into human-readable feature vectors.
Feature Clamping
Manually elevating or suppressing specific neural feature activations to control behavior.
Persona Vectors
Low-rank vectors representing character traits; used for monitoring and behavioral vaccination.
Stream Algorithm
Hierarchical pruning technique enabling near-linear interpretability analysis.
ASL
AI Safety Level — capability-gated security framework under the RSP.
RSP
Responsible Scaling Policy — Anthropic's governance framework for safe capability deployment.
Constitutional Classifiers
Sentinel LLMs monitoring input/output streams for CBRN threat patterns.
2PA
Two-Party Authorization — dual human approval required for critical infrastructure access.
TTL
Time-to-Live — cache duration (5-minute or 1-hour options).
TTFT
Time-to-First-Token — primary latency metric.
Computer Use API
Tool enabling Claude to view and control desktop environments via screenshot analysis.
Claude Code
CLI-based developer tool for autonomous codebase management and agent team orchestration.
Claude Cowork
GUI-based desktop automation tool for non-technical knowledge workers.
Artifacts
Stateful interactive UI windows rendered alongside chat for code and visual output.

Self-Assessment Questions

Q1. What is "Constitutional AI" (CAI) and how does it differ from traditional RLHF?

CAI uses an explicit, reason-based "Constitution" to align the model, replacing human raters in the evaluation phase with AI evaluators (RLAIF) to scale alignment with compute.

Q2. What are the two layers in Anthropic's "Agentic Bifurcation"?

Claude Code (Developer Layer, CLI-based) and Claude Cowork (Knowledge Worker Layer, GUI-based).

Q3. What is the purpose of "Sparse Autoencoders" (SAEs) in Anthropic's research?

To decompile dense neural activations into readable features, opening the "black box" of AI to understand why a model made a decision.