Seven Frameworks, One Category Error

Ask “which agent framework should we use?” and you have already made the mistake this study exists to correct. The seven contenders are not the same kind of thing: some are integration glue, some are orchestration graphs, some are complete autonomous harnesses, and one — NVIDIA’s NeMo Agent Toolkit — is orthogonal to all of them. Comparing them on features is like comparing an engine, a chassis, and a race team.

The Premise

The study evaluates the field through a deliberately concrete lens: building and deploying an agent in a hybrid environment, where self-hosted GPU inference and cloud APIs coexist. Every claim is tagged as measured (its own experiments) or sourced (vendor documentation), with a stated validity window of June 2026 — a discipline most framework roundups skip.

‘Which framework?’ is a category error — four kinds of artifact share one buzzword.

The Machine

The methodology maps every contender onto layers — integration glue (LangChain), orchestration (LangGraph, OpenAI Agents SDK, Pydantic AI), full harness (Claude Agent SDK, Flue), and a new fleet-orchestration layer above harnesses (Sandcastle) — then scores twelve decision axes, from model portability and self-hosting fit to observability and supply-chain footprint. The decisive technical fact: a self-hosted NIM endpoint speaks the OpenAI-compatible protocol, so any framework that accepts a custom base_url runs your own models with a one-line change.

The Test Drive

The portability experiment built the identical ~32-line tool-calling agent in each Python framework against an OpenAI-compatible endpoint: LangGraph, OpenAI Agents SDK, and Pydantic AI all passed with a one-line base-URL swap. The Claude Agent SDK could not take the test — the study cracked it open and found a 248 MB compiled Claude Code binary inside the Python package, whose supported model paths are Anthropic, Bedrock, and Vertex. All Claude, all cloud.

The footprint audit was equally clarifying: Sandcastle installs 7 packages in 15 MB, LangChain 34 packages in 95 MB, Flue 271 packages in 210 MB, and pydantic-ai’s full bundle 147 packages in 406 MB. And the maturity check cut along layer lines — the orchestration layer has crossed 1.0 into stability commitments, while every harness and vendor SDK is still pre-1.0.

The OpenAI Chat Completions wire protocol — not any framework — is the de facto interoperability standard.

The Fine Print

No formal security scan was run — this is a comparison, not a single-repo audit — but the risk observations are pointed. Lock-in concentrates where the capability is densest: the Claude Agent SDK’s sealed harness is both the most battle-tested loop on the market and the least auditable. What was deliberately not tested: live GPU inference, durability under stress, and long-horizon agent quality — the study is honest that those claims rest on vendor documentation.

The Verdict

The durable invariants are the protocol and the instrumentation, not any single framework. Build on the orchestration layer that has reached stability, keep everything pointed at an OpenAI-compatible endpoint so self-hosting stays a config change, and treat “can’t self-host” as a claim to verify — it is rarely true, except where it is absolute.