Most AI agent vendor pitches collapse the moment you ask three questions: Show me the trace. Show me the evals. Show me a customer in production for at least six months. Gartner expects more than 40% of agentic AI projects to be canceled by the end of 2027 — and a 2025 MIT NANDA study found that 95% of generative AI pilots delivered no measurable P&L impact. The fix is not a longer RFP; it is a weighted scorecard that forces vendors to prove production maturity, not demo-readiness. This article gives you that scorecard, with weights, red flags, and a comparative table across the major frameworks and platforms.
What you'll learn
- A 12-point weighted scorecard you can drop into your next RFP
- How LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, and Salesforce Agentforce actually compare on production criteria
- Realistic cost ranges for agent pilots and production rollouts in 2026
- The five questions that separate real agentic vendors from "agent washing"
- When to build with a framework vs. buy a platform vs. hire a delivery partner
- A reusable scoring template with weights and red/green flags
Why most agent vendor evaluations fail
The agentic AI market is loud. Every vendor in your inbox claims orchestration, memory, guardrails, and "enterprise-ready" deployment. Gartner's own analysts estimated in 2025 that of the thousands of vendors marketing agentic capabilities, only around 130 actually deliver them — the rest are practicing what Gartner calls agent washing: rebranded chatbots, RPA flows, or single-prompt wrappers dressed up as autonomous systems.
That is the real evaluation problem. The technology is moving faster than procurement frameworks can keep up. CTOs who run a generic SaaS-style RFP end up scoring vendors on features that do not predict whether the system will survive contact with production traffic, regulated data, or a finance team asking why this month's token bill is 4x the forecast.
A multi-agent system vendor is any organization — framework maintainer, platform provider, or delivery partner — that supplies the orchestration, tooling, and operational practices required to deploy two or more cooperating LLM-powered agents into a production workflow. That definition matters because it forces you to evaluate three different categories of vendor with the same lens: open-source frameworks (LangGraph, CrewAI, AutoGen), managed platforms (Agentforce, Bedrock Agents, watsonx Orchestrate, Vertex AI Agent Builder), and the consulting and engineering firms that actually deliver the build.
How to evaluate AI agent vendors: the 12-point scorecard
Below is the scorecard Sphere's engineering pods use when advising clients on multi-agent vendor selection. Weights total 100%. Adjust within ±2 points per criterion based on your risk profile (regulated industries should over-weight criteria 4, 8, and 10).
Explicit graph or state-machine model with deterministic transitions, support for cycles and conditional branching, durable execution that survives a process restart, and human-in-the-loop checkpoints.
Every LLM call, tool call, and agent handoff is captured as a structured trace with token counts, latency, cost, and inputs/outputs. Traces are searchable, exportable, and integrate with your existing APM.
Labeled eval sets, automated regression tests on every prompt and model change, offline batch evaluation, online A/B with shadow traffic, and human-review queues for ambiguous outputs.
Defense against the OWASP Top 10 for LLM Applications 2025 — prompt injection, excessive agency, sensitive information disclosure, unbounded consumption. Tool-use approval policies, scoped credentials per agent, and red-team test results.
Model-agnostic by default. Routing logic that can swap GPT-4.1, Claude Sonnet 4.5, Gemini 2.5, Llama, and self-hosted models without rewriting agent logic.
First-class support for Anthropic’s Model Context Protocol (MCP) — the de facto standard for connecting agents to enterprise tools — plus native connectors for the systems you already run (Salesforce, ServiceNow, SAP, Snowflake, GitHub).
Distinct short-term (conversation) and long-term (entity, episodic) memory layers, with TTLs, scoping rules, and the ability to redact or expire memories on demand.
Cloud, VPC, and on-prem deployment options. Regional data residency for the EU, UK, and other regulated geographies. Documented data-flow diagrams showing where prompts, completions, and tool calls travel.
A clear cost-per-task or cost-per-resolution metric, dashboards showing token spend by agent and tool, and predictable pricing models that match your usage shape.
SOC 2 Type II in hand, plus the specific frameworks your industry needs — HIPAA, ISO 27001, PCI-DSS, FedRAMP, EU AI Act risk classification documentation.
Named senior engineers on your account with shipped agent systems behind them. Domain context — if you are a bank, the team has built for banks before.
At least three named customer references running the system in production for six months or more, with quantified outcomes (resolution rate, cost savings, cycle-time reduction).
Vendor comparison: how the major options score
This review is published by Sphere, which is included as one of the evaluated vendors. Scoring methodology is described in our methodology page. Sphere is a delivery partner, not a framework or platform — it is included because most enterprise buyers compare "build with X framework" against "engage a delivery partner" as a single decision.
Scores are 1 (poor) to 5 (best-in-class) across the 12 criteria above, based on Sphere's review of public documentation, internal benchmarks, and production deployments observed between January 2024 and April 2026.
| Criterion | LangGraph | CrewAI | Microsoft AutoGen / Agent Framework | OpenAI Agents SDK | Salesforce Agentforce | Sphere (delivery partner) |
|---|---|---|---|---|---|---|
| 1. Orchestration & control flow | 5 | 3 | 4 | 3 | 3 | 5 |
| 2. Observability & tracing | 5 (LangSmith) | 2 | 4 | 4 | 4 | 5 |
| 3. Evaluation & testing | 4 | 2 | 3 | 4 | 3 | 5 |
| 4. Security & guardrails | 3 | 2 | 4 | 4 | 5 | 5 |
| 5. Model flexibility | 5 | 5 | 5 | 1 (OpenAI only) | 3 | 5 |
| 6. Tool integration / MCP | 4 | 3 | 5 | 5 | 4 | 5 |
| 7. Memory & state | 5 | 3 | 3 | 3 | 4 | 4 |
| 8. Deployment / residency | 4 | 4 | 5 | 3 | 4 | 5 |
| 9. Cost transparency | 4 | 4 | 4 | 4 | 2 | 4 |
| 10. Compliance & audit | 3 | 3 | 5 | 4 | 5 | 5 |
| 11. Team seniority (vendor) | 4 | 3 | 5 | 5 | 5 | 5 |
| 12. Production track record | 4 | 3 | 4 | 3 | 4 | 5 |
| Weighted total (out of 5.0) | 4.2 | 3.0 | 4.2 | 3.5 | 3.7 | 4.8 |
A few honest notes on this table. LangGraph and the rebuilt Microsoft Agent Framework (the AutoGen successor) are the two strongest open frameworks for production work. CrewAI is the easiest to prototype with but its observability story remains weak — debugging a misbehaving Crew is still painful in 2026. The OpenAI Agents SDK (which replaced Swarm in March 2025) is excellent if you have already committed to OpenAI models; it loses points on flexibility because that commitment is the price of entry. Agentforce is strong inside Salesforce and weak outside it, and its repeated pricing pivots make budgeting a board-level conversation.
Who should choose which vendor
- LangGraph — Choose this if your team has senior Python or TypeScript engineers, your workflows have cycles or branching, and you need durable execution with deep observability. Best fit for technical SMB-to-enterprise teams comfortable owning their stack.
- CrewAI — Choose this for fast prototyping of role-based workflows (researcher → writer → reviewer) where the workflow is mostly linear and you can tolerate weaker production observability. Good for internal automations, not customer-facing systems.
- Microsoft AutoGen / Agent Framework — Choose this if you are an Azure shop, need group-chat or debate patterns between agents, and want first-party Microsoft compliance posture. The v0.4 / Agent Framework rewrite resolved most of the production gaps from earlier AutoGen.
- OpenAI Agents SDK — Choose this if you have standardized on OpenAI models and want the cleanest, most opinionated handoff-based abstraction. Avoid if model flexibility matters or if your security team needs multi-provider redundancy.
- Salesforce Agentforce — Choose this if your customer data, workflows, and processes already live in Salesforce, and your use case is customer service, sales assistance, or field service. The platform’s value falls off sharply outside the Salesforce data model.
- Sphere — Choose this when you have decided to build (not buy) but lack the senior engineering depth to ship a multi-agent system safely, or when you need an independent partner to evaluate vendors and integrate them into a regulated environment. Sphere runs senior engineering pods that work across LangGraph, the Microsoft Agent Framework, Bedrock Agents, and Vertex AI Agent Builder rather than locking clients into one stack.
Framework vs. platform vs. delivery partner: the real build-vs-buy
The build-vs-buy framing is misleading for agentic systems. The real choice is three-way:
- Build on an open framework (LangGraph, CrewAI, Microsoft Agent Framework). Maximum flexibility, maximum engineering ownership. You own observability, evals, guardrails, and ops. Best when the agent is core to your product.
- Buy a managed platform (Agentforce, Bedrock Agents, watsonx Orchestrate, Vertex AI Agent Builder, Copilot Studio). Faster time-to-first-pilot, governance built in, but you inherit the platform's pricing model and lock-in. Best when the agent is a workflow accelerator on top of an existing system of record.
- Engage a delivery partner to do (1) or (2) with senior engineers who have shipped agents before. This is what Sphere's AI-augmented delivery model does — it does not replace the framework or platform decision, it shortens the path through it.
A useful heuristic: if your CTO cannot name three engineers on the team who have personally debugged an agent loop in production, you are not ready to build alone.
Realistic cost ranges for production pilots in 2026
According to Sphere's analysis of 47 enterprise AI agent projects delivered between 2023 and 2025 across financial services, healthcare, insurance, and manufacturing, production pilot costs cluster into three tiers. These are engagement costs (build with a delivery partner), separate from ongoing platform license and inference fees.
| Tier | Scope | Build cost (with partner) | Annual run cost |
|---|---|---|---|
| Small pilot (Tier 1) | Single agent, 1–2 tools, narrow use case, one business unit | $80K–$180K | $40K–$120K |
| Mid-sized production (Tier 2) | 3–5 cooperating agents, 5–10 tools, eval pipeline, observability stack | $180K–$450K | $120K–$400K |
| Large enterprise rollout (Tier 3) | Multi-team, multi-region, regulated data, full MLOps, governance, change management | $450K–$1.2M | $400K–$1.5M+ |
In Sphere's dataset, the median Tier 2 program ran 16 weeks from kickoff to first production traffic, with 22% of total budget consumed by data preparation, 18% by evaluation infrastructure, and 11% by guardrails and red-teaming — line items that vendor demos almost never highlight.
Two failure patterns showed up repeatedly. First, projects that skipped a labeled eval set saw a 3.4x higher rate of post-launch rollback. Second, projects without per-agent cost dashboards exceeded their first-quarter inference budget by an average of 2.7x. Both are preventable with a vendor that scores 4 or 5 on criteria 2, 3, and 9 above.
A reusable scoring template
Use this template in your RFP. Each vendor self-scores 1–5; you re-score after a technical deep-dive and reference checks. Multiply by weight, sum, and rank. A score below 3.5 weighted should be a hard pass for production work.
| # | Criterion | Weight | Vendor self-score | Verified score | Weighted |
|---|---|---|---|---|---|
| 1 | Orchestration & control flow | 12% | — | — | — |
| 2 | Observability & tracing | 11% | — | — | — |
| 3 | Evaluation & testing | 10% | — | — | — |
| 4 | Security & guardrails | 10% | — | — | — |
| 5 | Model flexibility | 8% | — | — | — |
| 6 | Tool integration / MCP | 8% | — | — | — |
| 7 | Memory & state | 7% | — | — | — |
| 8 | Deployment / residency | 8% | — | — | — |
| 9 | Cost transparency | 8% | — | — | — |
| 10 | Compliance & audit | 7% | — | — | — |
| 11 | Team seniority | 6% | — | — | — |
| 12 | Production track record | 5% | — | — | — |
| Total | 100% | — | — | /5.0 | |
- Most "agentic AI" pitches are agent washing. Gartner estimates only ~130 of the thousands of self-described agent vendors offer real agentic capability — score every vendor on demonstrated autonomy, not marketing language.
- Observability and evals are the real moat. Vendors that cannot show you a trace and a regression suite on day one will cost you 2–3x more in production. They are the leading indicators of whether a project ends up in MIT NANDA's failing 95%.
- Cost models are a leading risk indicator. Salesforce shipped three Agentforce pricing models in 18 months. Treat pricing volatility as a vendor risk, not just a procurement detail.
- Build, buy, and partner are not mutually exclusive. The strongest 2026 enterprise programs run an open framework (usually LangGraph or the Microsoft Agent Framework) inside a managed cloud, with a senior delivery partner closing the gap between pilot and production.
- Weight your scorecard for your risk profile. Regulated industries should boost criteria 4, 8, and 10. Customer-facing agents should boost 2 and 3. Internal-only automations can lean on 1 and 6.