LangChain · LangGraph · Claude Agent SDK · OpenAI Agents SDK · Flue · Pydantic AI · Sandcastle · NeMo Agent Toolkit — a fundamentals-level analysis for building and deploying agents at NVIDIA, across self-hosted GPUs and cloud APIs.
experiments/) — or
SOURCED — taken from vendor documentation. Nothing is asserted from vibes.
← → to navigate · F for fullscreen · companion technical deep-dive: deck.html
Teams across the industry shortlist these eight names as if they were substitutes. They are not. Treating them as interchangeable is the root cause of most bad selections — and of most framework-war noise online.
Every product on the shortlist is a different slice of that loop. So the real decision is:
Sorted by how much of the agent loop each one owns. Head-to-head comparison is only meaningful within a layer; across layers, these compose.
Uniform interfaces to models, tools, vector stores, data. Owns none of the loop — supplies its parts.
Owns control flow: state machines/graphs, handoffs, checkpointing, retries, human-in-the-loop. You define the loop's shape.
Owns the whole loop: planning, tool execution, sandboxing, context/memory management, subagents — pre-built and opinionated.
Treats whole harnesses as schedulable units: sandboxes them, runs N in parallel, manages git lifecycle and sessions.
Owns no part of the loop; observes all of it. Profiling, evaluation, optimization across frameworks.
We built a mock inference server speaking only the OpenAI Chat Completions protocol — the same protocol NIM, vLLM and TGI expose — then pointed the identical tool-calling agent at it in three frameworks.
# the only provider-specific line in each agent: base_url = "http://127.0.0.1:8123/v1" # ← any NIM / vLLM / TGI LangGraph 1.2.4 → PASS # full loop: tool call → answer OpenAI Agents SDK 0.17.5 → PASS Pydantic AI 1.107.0 → PASS
Pass criterion: the final answer contains data that exists only behind the tool — proving the full model → tool → result → answer loop executed. Scripts and raw logs in experiments/ and results/logs/.
We opened up the Claude Agent SDK package to see what it actually is:
pip install claude-agent-sdk # v0.2.97 └─ claude_agent_sdk/ ├─ *.py # thin wrapper (~1 MB) └─ _bundled/claude # 248 MB compiled ELF binary # = the entire Claude Code harness # provider routing baked into the binary (strings found): ANTHROPIC_BASE_URL · CLAUDE_CODE_USE_BEDROCK · CLAUDE_CODE_USE_VERTEX
All three routes — Anthropic API, AWS Bedrock, Google Vertex — serve Claude. The endpoint override speaks Anthropic's own protocol, so it cannot target an OpenAI-compatible endpoint like NIM.
You don't adopt a harness — you embed one. The Claude Agent SDK is the most battle-tested agent loop in the industry (it is Claude Code), delivered as a sealed unit. Capability density and vendor coupling arrive in the same box; that's not a flaw, it's the product.
Lock-in rises with the amount of loop you buy: integration layer ≈ none → orchestration ≈ weak (Finding 1) → harness ≈ strong. Flue is the counter-bet: a harness that is multi-provider by design SOURCED — but it's the youngest framework here (v0.9.2) MEASURED. And the market is already routing around harness lock-in from above: Sandcastle treats Claude Code, Codex and Cursor as swappable plug-in providers — its install is 7 packages / 15 MB because it owns no loop and no model SDK MEASURED. A commoditization layer forming on top is exactly what you expect where lock-in concentrates.
Version numbers are the vendors' own statement of API stability. The hottest layer of this market has not declared itself stable.
pydantic-ai-slim 30 pkgs / 87 MB · langchain 34 / 95 MB
langgraph+openai 41 / 120 MB · openai-agents 41 / 103 MB
claude-agent-sdk 31 / 308 MB (248 MB binary)
flue runtime 271 npm pkgs / 210 MB · pydantic-ai full 147 / 406 MB
sandcastle 7 npm pkgs / 15 MB (owns no loop, no model SDK)
Comparable cores cluster at 30–41 packages; outliers are explainable (bundled harness binary; bundled provider SDKs; npm ecosystem granularity) — but each is real attack/maintenance surface to own.
Hello-world was 31–32 LOC in every framework we ran — selection by demo or by bake-off prototype will produce a coin flip.
Select on what only shows up in production: state semantics, durable execution, failure recovery, human-in-the-loop, portability, observability. And pin versions + wrap framework APIs behind thin internal interfaces — at 0.x velocity, breaking changes are a when, not an if.
| Tool | Layer | Stewardship | Version M | Runs vs OpenAI-compat endpoint (NIM) | Distinct strength | Primary risk |
|---|---|---|---|---|---|---|
| LangChain | Integration | LangChain Inc · MIT | 1.3.7 | Yes — verified M | Largest integration ecosystem; fastest prototyping | Abstraction churn history; depth of call stacks |
| LangGraph | Orchestration | LangChain Inc · MIT | 1.2.4 | Yes — verified M | Durable, checkpointed, auditable control flow S | Steepest learning curve in this set |
| OpenAI Agents SDK | Orchestration | OpenAI · MIT | 0.17.5 | Yes — verified, 2 switches M | Minimal primitives (agents/handoffs/guardrails); fast to ship | Pre-1.0; OpenAI-first defaults |
| Pydantic AI | Orchestration | Pydantic · MIT | 1.107.0 | Yes — verified M | Type-validated I/O; OTel-native observability (Logfire) S | Smaller ecosystem than LangChain's |
| Claude Agent SDK | Harness | Anthropic · proprietary models | 0.2.97 | No — Anthropic protocol only M | The Claude Code loop itself: most battle-tested harness available | Full vendor coupling; sealed 248 MB binary; pre-1.0 |
| Flue | Harness | withastro (F. Schott) · Apache-2.0 | 0.9.2 | By design (multi-provider) — not tested by us S | Portable harness: Node/edge/CI targets; TypeScript-native | Youngest tool here; ecosystem still forming |
| Sandcastle | Fleet orchestration | Matt Pocock (AI Hero) · MIT | 0.8.0 | N/A — drives harnesses (Claude Code, Codex, Cursor…); inherits their model paths | Sandboxed parallel coding-agent fleets with git worktree/branch lifecycle; 7-pkg / 15 MB install M | Coding agents only; pre-1.0; single-maintainer project |
| NeMo Agent Toolkit | Instrumentation | NVIDIA · Apache-2.0 | — | Native to NVIDIA serving S | Framework-agnostic profiling/eval down to per-tool, per-token S | Complement, not a replacement — value needs NVIDIA infra |
M = measured by us, 2026-06-11 · S = vendor docs · full data: results/benchmarks/measured_findings.md
NVIDIA ships the serving layer (NIM — OpenAI-compatible) and the instrumentation plane (NeMo Agent Toolkit — deliberately framework-agnostic). Neither bets on a framework winner. The findings above explain why that's the right read of the market.
For sustained, sensitive, or cost-sensitive workloads: LangGraph or Pydantic AI against NIM. Verified portable M, post-1.0, durable, observable. Zero rewrite to move between our GPUs and any cloud API.
For coding agents, deep research, computer-use — where loop quality dominates — Claude Agent SDK (accepting Claude coupling, cloud/Bedrock/Vertex) or Flue (TypeScript, portable, pre-1.0 risk priced in).
NeMo Agent Toolkit over everything — per-tool/per-token profiling and evals regardless of which framework each team chose S. This is also our home-field telemetry for GPU cost.
| If… | Then… |
|---|---|
| Data or models must stay on our infrastructure (or could need to later) | Require protocol-portable orchestration (LangGraph / Pydantic AI / OpenAI Agents SDK) on NIM. Harness SDKs that can't target it are out for that workload. |
| The workload is long-running, multi-step, or audit-sensitive | LangGraph — checkpointed, durable, resumable control flow is its core design S. |
| Outputs feed systems (APIs, DBs) rather than humans | Pydantic AI — schema-validated I/O turns a class of silent agent failures into loud type errors. |
| Loop quality is the product (coding/research/computer-use agents) and Claude coupling is acceptable | Claude Agent SDK — buying the best harness beats rebuilding it. TypeScript-native or edge/CI deployment instead → Flue, with 0.x risk priced in. |
| You need N coding agents running in parallel, safely, with results landing as branches/PRs | Sandcastle on top of your chosen harness — container sandboxing, git worktree/branch lifecycle, session fork/resume. Composes with the rule above; doesn't replace it. |
| Any of the above goes to production | Instrument with NeMo Agent Toolkit from day one; pin versions; wrap framework APIs behind thin internal interfaces (everything below 1.0 here will break). |
base_url line each.experiments/setup.sh → experiments/run_all.sh.It's a category error, not a bake-off. Five kinds of artifact share one buzzword. Only within-layer comparisons are meaningful; across layers, compose.
The standard is the protocol; lock-in lives in the harness. We proved orchestration portability with one config line — and found the coupling sealed inside a 248 MB binary. Buy harnesses deliberately, not by default.
Protect two invariants and the choice becomes reversible: protocol-compatible serving (NIM) and framework-agnostic instrumentation (NeMo Agent Toolkit). Both are layers NVIDIA owns.
All data measured 2026-06-11 · reproduction scripts in experiments/ · full numbers in results/benchmarks/measured_findings.md · technical deep-dive in deck.html