EXECUTIVE BRIEF · EVIDENCE-BASED · JUNE 2026

Agent frameworks: what we're actually choosing

LangChain · LangGraph · Claude Agent SDK · OpenAI Agents SDK · Flue · Pydantic AI · Sandcastle · NeMo Agent Toolkit — a fundamentals-level analysis for building and deploying agents at NVIDIA, across self-hosted GPUs and cloud APIs.

Methodology note: claims in this deck are tagged MEASURED — we installed all the frameworks, built and ran the experiments ourselves (2026-06-11; scripts in experiments/) — or SOURCED — taken from vendor documentation. Nothing is asserted from vibes.

← → to navigate · F for fullscreen · companion technical deep-dive: deck.html

The selection problem, stated precisely

"Which agent framework should we use?" is a malformed question

Teams across the industry shortlist these eight names as if they were substitutes. They are not. Treating them as interchangeable is the root cause of most bad selections — and of most framework-war noise online.

An agent is a loop: a model proposes actions; a runtime executes tools, manages state and context, and feeds results back — until the task is done.

Every product on the shortlist is a different slice of that loop. So the real decision is:

How much of the loop do we build vs. buy?
At which layer are we willing to accept lock-in?
Where must it run — our GPUs, someone's cloud, or both?

Why the confusion is universal

All seven market themselves with the same word: "agents."
All seven have a near-identical hello-world. We wrote the same minimal agent in three of them: 31–32 lines of code each, functionally indistinguishable MEASURED. Demos cannot differentiate these tools.
The differences are operational — state, durability, failure recovery, portability, observability — precisely the properties that don't show up until production.

Fundamentals

Five kinds of artifact, one stack

Sorted by how much of the agent loop each one owns. Head-to-head comparison is only meaningful within a layer; across layers, these compose.

Integration layer

Uniform interfaces to models, tools, vector stores, data. Owns none of the loop — supplies its parts.

LangChain

→

Orchestration runtime

Owns control flow: state machines/graphs, handoffs, checkpointing, retries, human-in-the-loop. You define the loop's shape.

LangGraph · OpenAI Agents SDK · Pydantic AI

→

Agent harness

Owns the whole loop: planning, tool execution, sandboxing, context/memory management, subagents — pre-built and opinionated.

Claude Agent SDK · Flue

→

Fleet orchestration

Treats whole harnesses as schedulable units: sandboxes them, runs N in parallel, manages git lifecycle and sessions.

Sandcastle

+

Instrumentation plane

Owns no part of the loop; observes all of it. Profiling, evaluation, optimization across frameworks.

NeMo Agent Toolkit

The composition insight: a production system routinely uses one item from each column — e.g. LangChain connectors, inside a LangGraph state machine, instrumented by NeMo Agent Toolkit. The only genuine either/or decisions are within the orchestration column and within the harness column.

Finding 1 — The de facto standard is a wire protocol, not a framework MEASURED

We built a mock inference server speaking only the OpenAI Chat Completions protocol — the same protocol NIM, vLLM and TGI expose — then pointed the identical tool-calling agent at it in three frameworks.

# the only provider-specific line in each agent:
base_url = "http://127.0.0.1:8123/v1"   # ← any NIM / vLLM / TGI

LangGraph 1.2.4          → PASS  # full loop: tool call → answer
OpenAI Agents SDK 0.17.5 → PASS
Pydantic AI 1.107.0      → PASS

Pass criterion: the final answer contains data that exists only behind the tool — proving the full model → tool → result → answer loop executed. Scripts and raw logs in experiments/ and results/logs/.

Why this matters strategically

Orchestration-layer lock-in is weak. Swapping model vendors — including swapping to our own GPUs — is a config change, not a rewrite.
NIM inherits the entire ecosystem for free. Because NIM speaks the protocol, every protocol-compatible framework is already an NVIDIA framework.
Framework choice at this layer is low-regret. The expensive commitments live elsewhere (next slide).

Honest caveats: OpenAI's SDK needed two documented switches (Chat-Completions model class; tracing off). And protocol compatibility ≠ model quality — open models must still be good enough for the task.

Finding 2 — Lock-in concentrates at the harness layer MEASURED

We opened up the Claude Agent SDK package to see what it actually is:

pip install claude-agent-sdk        # v0.2.97
└─ claude_agent_sdk/
   ├─ *.py                          # thin wrapper (~1 MB)
   └─ _bundled/claude               # 248 MB compiled ELF binary
                                    # = the entire Claude Code harness

# provider routing baked into the binary (strings found):
ANTHROPIC_BASE_URL · CLAUDE_CODE_USE_BEDROCK · CLAUDE_CODE_USE_VERTEX

All three routes — Anthropic API, AWS Bedrock, Google Vertex — serve Claude. The endpoint override speaks Anthropic's own protocol, so it cannot target an OpenAI-compatible endpoint like NIM.

The fundamental trade

You don't adopt a harness — you embed one. The Claude Agent SDK is the most battle-tested agent loop in the industry (it is Claude Code), delivered as a sealed unit. Capability density and vendor coupling arrive in the same box; that's not a flaw, it's the product.

Corollary across the stack

Lock-in rises with the amount of loop you buy: integration layer ≈ none → orchestration ≈ weak (Finding 1) → harness ≈ strong. Flue is the counter-bet: a harness that is multi-provider by design SOURCED — but it's the youngest framework here (v0.9.2) MEASURED. And the market is already routing around harness lock-in from above: Sandcastle treats Claude Code, Codex and Cursor as swappable plug-in providers — its install is 7 packages / 15 MB because it owns no loop and no model SDK MEASURED. A commoditization layer forming on top is exactly what you expect where lock-in concentrates.

Finding 3 — The category is younger than the discourse MEASURED

Version numbers are the vendors' own statement of API stability. The hottest layer of this market has not declared itself stable.

0.17

OpenAI Agents SDK — OpenAI's flagship agent offering is pre-1.0

0.2.97

Claude Agent SDK — Anthropic's, likewise pre-1.0

0.9 / 0.8

Flue runtime / Sandcastle — the newest layers of the stack, both pre-1.0

1.3 / 1.2 / 1.1xx

LangChain / LangGraph / Pydantic AI — the orchestration layer has crossed into stability commitments

Supply-chain surface (fresh install) MEASURED

pydantic-ai-slim  30 pkgs / 87 MB  ·  langchain  34 / 95 MB
langgraph+openai  41 / 120 MB  ·  openai-agents  41 / 103 MB
claude-agent-sdk  31 / 308 MB (248 MB binary)
flue runtime  271 npm pkgs / 210 MB  ·  pydantic-ai full  147 / 406 MB
sandcastle  7 npm pkgs / 15 MB (owns no loop, no model SDK)

Comparable cores cluster at 30–41 packages; outliers are explainable (bundled harness binary; bundled provider SDKs; npm ecosystem granularity) — but each is real attack/maintenance surface to own.

The demo trap

Hello-world was 31–32 LOC in every framework we ran — selection by demo or by bake-off prototype will produce a coin flip.

Select on what only shows up in production: state semantics, durable execution, failure recovery, human-in-the-loop, portability, observability. And pin versions + wrap framework APIs behind thin internal interfaces — at 0.x velocity, breaking changes are a when, not an if.

The field, positioned

Eight names, one honest table

Tool	Layer	Stewardship	Version M	Runs vs OpenAI-compat endpoint (NIM)	Distinct strength	Primary risk
LangChain	Integration	LangChain Inc · MIT	1.3.7	Yes — verified M	Largest integration ecosystem; fastest prototyping	Abstraction churn history; depth of call stacks
LangGraph	Orchestration	LangChain Inc · MIT	1.2.4	Yes — verified M	Durable, checkpointed, auditable control flow S	Steepest learning curve in this set
OpenAI Agents SDK	Orchestration	OpenAI · MIT	0.17.5	Yes — verified, 2 switches M	Minimal primitives (agents/handoffs/guardrails); fast to ship	Pre-1.0; OpenAI-first defaults
Pydantic AI	Orchestration	Pydantic · MIT	1.107.0	Yes — verified M	Type-validated I/O; OTel-native observability (Logfire) S	Smaller ecosystem than LangChain's
Claude Agent SDK	Harness	Anthropic · proprietary models	0.2.97	No — Anthropic protocol only M	The Claude Code loop itself: most battle-tested harness available	Full vendor coupling; sealed 248 MB binary; pre-1.0
Flue	Harness	withastro (F. Schott) · Apache-2.0	0.9.2	By design (multi-provider) — not tested by us S	Portable harness: Node/edge/CI targets; TypeScript-native	Youngest tool here; ecosystem still forming
Sandcastle	Fleet orchestration	Matt Pocock (AI Hero) · MIT	0.8.0	N/A — drives harnesses (Claude Code, Codex, Cursor…); inherits their model paths	Sandboxed parallel coding-agent fleets with git worktree/branch lifecycle; 7-pkg / 15 MB install M	Coding agents only; pre-1.0; single-maintainer project
NeMo Agent Toolkit	Instrumentation	NVIDIA · Apache-2.0	—	Native to NVIDIA serving S	Framework-agnostic profiling/eval down to per-tool, per-token S	Complement, not a replacement — value needs NVIDIA infra

M = measured by us, 2026-06-11 · S = vendor docs · full data: results/benchmarks/measured_findings.md

Implications for NVIDIA

Our own product strategy already answers this

NVIDIA ships the serving layer (NIM — OpenAI-compatible) and the instrumentation plane (NeMo Agent Toolkit — deliberately framework-agnostic). Neither bets on a framework winner. The findings above explain why that's the right read of the market.

The trunk: protocol-portable orchestration on our GPUs

For sustained, sensitive, or cost-sensitive workloads: LangGraph or Pydantic AI against NIM. Verified portable M, post-1.0, durable, observable. Zero rewrite to move between our GPUs and any cloud API.

The fast lane: harnesses where autonomy density wins

For coding agents, deep research, computer-use — where loop quality dominates — Claude Agent SDK (accepting Claude coupling, cloud/Bedrock/Vertex) or Flue (TypeScript, portable, pre-1.0 risk priced in).

The constant: instrumentation

NeMo Agent Toolkit over everything — per-tool/per-token profiling and evals regardless of which framework each team chose S. This is also our home-field telemetry for GPU cost.

The deeper point: the durable asset isn't the framework choice — it's protocol compatibility + instrumentation discipline. Teams that keep those two invariants can change frameworks in weeks; teams that don't are rewriting for quarters.

Operating guidance

Five decision rules

If…	Then…
Data or models must stay on our infrastructure (or could need to later)	Require protocol-portable orchestration (LangGraph / Pydantic AI / OpenAI Agents SDK) on NIM. Harness SDKs that can't target it are out for that workload.
The workload is long-running, multi-step, or audit-sensitive	LangGraph — checkpointed, durable, resumable control flow is its core design S.
Outputs feed systems (APIs, DBs) rather than humans	Pydantic AI — schema-validated I/O turns a class of silent agent failures into loud type errors.
Loop quality is the product (coding/research/computer-use agents) and Claude coupling is acceptable	Claude Agent SDK — buying the best harness beats rebuilding it. TypeScript-native or edge/CI deployment instead → Flue, with 0.x risk priced in.
You need N coding agents running in parallel, safely, with results landing as branches/PRs	Sandcastle on top of your chosen harness — container sandboxing, git worktree/branch lifecycle, session fork/resume. Composes with the rule above; doesn't replace it.
Any of the above goes to production	Instrument with NeMo Agent Toolkit from day one; pin versions; wrap framework APIs behind thin internal interfaces (everything below 1.0 here will break).

And one anti-rule: don't run a bake-off of hello-world prototypes to decide — we measured them at 31–32 LOC and functionally identical. Bake off the operational properties or don't bake off at all.

Epistemic honesty

What we verified, and what we didn't

Verified by experiment (2026-06-11)

Full agent loop runs unmodified against an OpenAI-compatible endpoint in LangGraph, OpenAI Agents SDK, Pydantic AI — one base_url line each.
Claude Agent SDK = thin wrapper + 248 MB sealed harness binary; Anthropic-protocol routes only (API / Bedrock / Vertex).
Dependency & disk footprints; current shipped versions; hello-world LOC parity.
Sandcastle installed & inspected: v0.8.0, 7 npm packages / 15 MB — an orchestration shell over harness CLIs, with no agent loop or model SDK of its own.
Reproducible: experiments/setup.sh → experiments/run_all.sh.

Not tested — known unknowns

No live GPU/NIM run — we verified protocol compatibility (the claim at issue), not inference performance or open-model task quality.
Flue untested by us — its multi-provider support is per docs; a TypeScript port of our experiment is the obvious next step.
Sandcastle not exercised end-to-end — running a real fleet needs Docker plus a harness CLI and API credentials; its feature claims rest on repo docs.
No durability/failure-injection or long-horizon quality evals — durability claims rest on vendor docs.
Snapshot risk — this market moves monthly; numbers dated 2026-06-11.

Recommended follow-up if we operationalize this: run the same portability experiment against a real NIM on internal GPUs, add Flue to it, and failure-inject a LangGraph checkpointed workflow. Each is a day of work and would upgrade the remaining SOURCED claims to MEASURED.

If you remember three things

Takeaways

1

It's a category error, not a bake-off. Five kinds of artifact share one buzzword. Only within-layer comparisons are meaningful; across layers, compose.

2

The standard is the protocol; lock-in lives in the harness. We proved orchestration portability with one config line — and found the coupling sealed inside a 248 MB binary. Buy harnesses deliberately, not by default.

3

Protect two invariants and the choice becomes reversible: protocol-compatible serving (NIM) and framework-agnostic instrumentation (NeMo Agent Toolkit). Both are layers NVIDIA owns.

All data measured 2026-06-11 · reproduction scripts in experiments/ · full numbers in results/benchmarks/measured_findings.md · technical deep-dive in deck.html