Specification

Zaira Standard

Version 0.9
Date March 2026
Status Current
Purpose The open specification for evaluating whether a developer tool is ready for AI agents to use. Defines the modular evaluation framework, criteria, scoring gradients, and tier thresholds by which a tool earns Agent Ready or Agent Native designation. Every criterion is grounded in academic research or empirical measurement.

1. Evaluation Logic

The design principles behind the Zaira Standard. These explain why the standard is structured the way it is and guide interpretation of the criteria.

The standard scales with tool complexity

The simplest tools face the simplest evaluation. A library installed via npm install and used locally is evaluated against the Base Standard alone: documentation, lifecycle health, supply chain integrity, installation simplicity. No authentication criteria. No API surface area management. No rate limit communication. Those criteria don't apply because the tool doesn't have those concerns.

As a tool's complexity increases (it exposes an API, runs as a hosted service, handles destructive operations, requires credentials) complexity modules activate and add criteria proportional to that complexity. A cloud database with auth, writes, and a REST API is evaluated against the Base Standard plus four complexity modules. Evaluation scope matches the tool's surface area.

Every criterion in a tool's evaluation applies to that tool. There are no N/A markings, no skipped criteria, no denominator adjustments. If a criterion is activated in a tool's evaluation, it is relevant to that evaluation.

Health over activity

The standard measures whether a tool functions, not whether it is being actively developed. A stable, feature-complete library with zero open CVEs, passing continuous integration, and accurate documentation satisfies health criteria even without recent commits; a project with weekly commits and unaddressed issues does not. Criteria that might otherwise penalize inactivity (B4, B11) are defined in terms of empirical health signals (documentation-behavior correspondence, vulnerability patching cadence, installation success on current runtimes) rather than recency signals. Completion is a state demonstrated through continued health, not declared through version increments or announcements.

Discoverability is outside scope

The Zaira Standard evaluates whether a tool is ready for agents to use, not whether agents can find it. Discoverability (registry presence, search engine indexing, structured metadata) is a separate concern. Conflating discoverability with agent-readiness would penalize tools whose distribution is weak and reward tools whose distribution is strong, independent of agent-readiness.

Agent capability floor

The Zaira Standard defines a minimum agent capability threshold: the capability floor. Agents scoring below the floor are not target consumers of Zaira Standard evaluation results.

The capability floor for Zaira Standard v0.9 is 35% on SWE-Bench Pro V9, administered by SEAL. This benchmark evaluates base model and sub-agent performance on software engineering tasks with tool usage. It strips wrapper scaffolding and vendor-specific enhancements.

The floor is defined by benchmark score, not by model name. The capability floor is a normative parameter of each standard version, reviewed with each minor revision (see §5) following the standard change process.


2. How the Standard Works

Module Activation

Every tool starts with the Base Standard: 15 criteria that apply universally. Then, based on what the tool does and how it's accessed, complexity modules activate:

ModuleTrigger QuestionCriteria Added
Programmatic InterfaceDoes the tool expose an API, SDK, or MCP server?16
Network ServiceIs the tool a hosted/remote service?8
Write OperationsCan the tool create, modify, or delete data/resources?4
AuthenticationDoes the tool require credentials or tokens?4
CLIDoes the tool have a command-line interface?4

After complexity modules, one or more domain modules may apply based on the tool's functional domains (Payments, Databases, Communications, and others defined in §9).

Each trigger question is a simple yes/no. A tool may trigger zero, one, or many complexity modules. The triggers are independent: a tool can be a Network Service without having a CLI, or have Write Operations without requiring Authentication.

Example Evaluations

ToolBaseInterfaceNetworkWriteAuthCLIDomain(s)Total Criteria
SQLite15
lodash15
ReactFrameworks18
git CLI23
Terraform CLIHosting30
Neon (DB)Databases51
StripePayments51
SupabaseDB + Auth + Hosting57
AWS RedshiftDatabases55

The simplest tools have the smallest evaluation scope. The most complex tools activate every applicable module.


3. How to Read This Document

Criterion Structure

Each criterion includes:

FieldMeaning
IDUnique identifier (e.g., B1, AU3, NS5)
NameShort descriptive name
DescriptionWhat this criterion evaluates
Requirement LevelMUST (binary gate), SHOULD (expected), or MAY (aspirational)
WeightCritical (×2) or Standard (×1): determines point multiplier
Scoring Gradient0 (Failing), 1 (Basic), 2 (Good), 3 (Excellent)

Criterion IDs use module-based prefixes: B (Base Standard), PI (Programmatic Interface), NS (Network Service), WO (Write Operations), AU (Authentication), CLI (CLI), PM (Payments), CM (Communications), DB (Databases), HI (Hosting), AP (Auth Providers), FL (Frameworks). IDs are sequential within each module: a criterion's ID indicates which module it belongs to.

Requirement Levels (RFC 2119)

  • MUST: Binary gate. Score ≥1 required for Agent Ready or Agent Native. A tool that scores 0 on any MUST criterion in its activated modules cannot achieve either designation. MUST gates only appear in complexity modules: the Base Standard has no MUST gates.
  • SHOULD: Expected for meaningful agent readiness. Scored and weighted. Low scores reduce tier eligibility.
  • MAY: Aspirational. Demonstrates excellence. Primarily differentiates the highest tier.

Weight Categories

  • Critical (×2): Criteria with the strongest documented impact on agent success rates, including error handling, description quality, and authentication.
  • Standard (×1): All other criteria. Important but with less dramatic measured impact or less universal applicability.

Open-Source vs. Commercial Tool Splits

Where a criterion measures fundamentally different things depending on whether the tool is open-source or commercial, two scoring gradients are provided. An open-source library's sustainability risk is contributor concentration; a commercial SaaS tool's sustainability risk is corporate strategy and sunset policy. Both matter, but they require different evidence.

  • Open-source tools: Use the open-source rubric (listed first).
  • Commercial/closed-source tools: Use the commercial rubric instead.
  • Hybrid tools (open-source core with commercial hosted offering): Evaluate against the open-source rubric for the open-source artifact. If the primary product being evaluated is the hosted service, use the commercial rubric.

Scores are directly comparable: a score of 2 on either rubric indicates the same qualitative level (Good).


4. Scoring System

How Criteria Map to Points

Each criterion is scored on a 0-3 scale:

ScoreLabelMeaning
0FailingDoes not meet minimum requirements; actively harms agent usability
1BasicMinimum viable implementation; functional but limited
2GoodSolid implementation; meaningfully supports agent workflows
3ExcellentExemplary implementation; designed with agents as a first-class consumer

Point Calculation

Criterion Score = Raw Score (0-3) × Weight (1 or 2)

Module Score = Sum of all criterion scores in module
Total Score = Sum of all module scores

Percentage = Total Score / Maximum Possible Score for activated modules

Because modules only activate when relevant, there are no N/A adjustments. The denominator is always the maximum possible score for the specific modules a tool activates.

What is a good score?

A Zaira Score indicates a tool's agent-readiness. Scores fall into three descriptive bands, each of which characterizes a range of agent-usability. These bands describe how a score is interpreted; they are not certification outcomes.

Agent-Ready Base

Describes the score of a tool whose entire surface is covered by the Base Standard: a utility library, a pure-function module, or any tool without an API, network calls, write operations, or credentials. A score in this band indicates that the fundamentals (documentation, lifecycle health, supply chain integrity, installation) meet the standard. No additional evaluation applies because the tool does not expose further surface area.

IndicatorWhat it looks like
Tool shapeTriggers zero complexity modules
Overall score≥60% of Base Standard
Zero scoresNo more than 3 SHOULD criteria score 0

Agent Ready

Describes the score of a tool with substantial surface area (APIs, hosted services, write operations, authentication) that is substantively usable by agents for standard workflows across that surface. MUST gates pass, every applicable module is covered, and any remaining gaps are documented in the scorecard.

IndicatorWhat it looks like
Tool shapeTriggers one or more complexity modules
Overall score≥60%
MUST criteriaAll score ≥1
Zero scoresNo more than 5 MUST or SHOULD criteria score 0

Agent Native

The highest band. Describes the score of a tool that treats agents as first-class consumers across its full surface: no criterion at zero on any MUST or SHOULD, every MUST scoring at least 2, and an overall percentage of 80% or higher. A score in this band satisfies every Agent Ready indicator plus additional thresholds.

IndicatorWhat it looks like
Tool shapeTriggers one or more complexity modules
Overall score≥80%
MUST criteriaAll score ≥2
Zero scoresNo MUST or SHOULD criterion scores 0

Anti-Gaming Mechanisms

  1. MUST gates. MUST criteria in activated modules are binary gates: score 0 on any and the tool cannot achieve Agent Ready or Agent Native regardless of total score.
  2. Critical weighting. Criteria with the strongest empirical impact on agent success count double, preventing optimization toward lower-weighted criteria.
  3. Zero-score limits. Agent Ready allows no more than 5 SHOULD/MUST criteria at 0; Agent Native allows none. This prevents broad neglect across any part of the evaluation.
  4. Evidence requirements. Every score requires documented evidence (automated test output, URL, or evaluator rationale). No score without evidence.

Together, these mechanisms ensure that band placement reflects agent-readiness across a tool's full surface area, not selective optimization of lower-weighted criteria.

Scoring Examples

Example 1: Simple library (lodash)

Modules: Base only (15 criteria)

  • 2 Critical × 3 × 2 = 12
  • 13 Standard × 3 × 1 = 39
  • Max: 51

No complexity modules triggered. Scores produced under this profile fall in the Agent-Ready Base band: the Base Standard covers the tool's entire surface, so the fundamentals are the complete evaluation.

Example 2: CLI tool with write operations (Terraform CLI)

Modules: Base (15) + Write Operations (4) + Authentication (4) + CLI (4) = 27 criteria

  • Critical criteria: B1, B10, WO1, AU1 = 4 Critical
  • 4 Critical × 3 × 2 = 24
  • 23 Standard × 3 × 1 = 69
  • Max: 93

MUST gates: AU1 Eligible for Agent Ready or Agent Native band.

Example 3: Full SaaS payment platform (Stripe)

Modules: Base (15) + Programmatic Interface (16) + Network Service (8) + Write Operations (4) + Authentication (4) + Payments (4) = 51 criteria

  • Critical criteria: B1, B10, PI1, PI2, PI3, PI4, PI10, NS1, NS2, WO1, AU1 = 11 Critical
  • 11 Critical × 3 × 2 = 66
  • 40 Standard × 3 × 1 = 120
  • Max: 186

MUST gates: PI1, PI2, PI15, NS1, AU1 Eligible for Agent Ready or Agent Native band.

Example scores:

  • Base: 34/51 (67%)
  • Programmatic Interface: 42/63 (67%)
  • Network Service: 20/30 (67%)
  • Write Operations: 11/15 (73%)
  • Authentication: 11/15 (73%)
  • Payments: 8/12 (67%)

Total: 126/186 = 68%

Band check:

  • ≥60%? Yes → Agent Ready candidate
  • All MUSTs ≥1? (PI1 ✓, PI2 ✓, PI15 ✓, NS1 ✓, AU1 ✓) Yes ✓
  • ≤5 SHOULD/MUST criteria score 0? Yes (2 zeros) ✓
  • Result: Agent Ready

5. Versioning & Governance

The Zaira Standard is public and stable. Revisions are made only when durable shifts in agent-tool interaction warrant them, not in response to short-term trends. Version changes are announced at least 30 days before they take effect, giving implementers and downstream evaluators time to adapt.

Version Scheme

  • v0.9: Current version. Scoring thresholds and weights may be refined based on evaluation data before v1.0.
  • v1.0: Stable release with finalized thresholds.
  • v1.x (Minor): Additive criteria, refined scoring, weight adjustments. Backward compatible.
  • v2.0 (Major): Structural changes (module additions/removals, score band or threshold revisions).

6. Scope & Limitations

The Zaira Standard evaluates agent usability. It does not evaluate:

  • Tool quality or fitness for purpose. Whether the tool is good at what it does.
  • Security guarantees. A tool can meet all Safety criteria and still have undiscovered vulnerabilities. The standard evaluates evidence and practices, not absence of risk.
  • Performance benchmarks. Response time, throughput, and uptime are not evaluated (though health endpoints are).
  • Pricing fairness. Whether pricing is transparent and machine-readable is evaluated, not whether it's competitive.
  • Training data representation. How well current models know a tool is not under the tool's control.
  • Compliance certifications. SOC 2, ISO 27001, HIPAA, etc. are noted when present but not replicated.
  • Runtime enforcement. The standard publishes evaluation data; enforcement is the responsibility of agent runtimes and policy engines.
  • Human developer experience. Criteria are evaluated from the agent's perspective.
  • Discoverability. Whether agents can find a tool is outside the standard's scope.

7. Base Standard

15 criteria that apply to every tool, regardless of type.

The Base Standard evaluates the fundamentals: whether an agent can learn to use a tool, whether it is safe to depend on, and whether it will remain available in the future. Every tool (from a 50-line utility library to a cloud platform) is evaluated against these criteria.

The Base Standard has no MUST gates. Simple tools meet the Base Standard or do not, based on overall score. MUST gates appear in complexity modules, where failure has greater operational impact.


Documentation & Usability

Documentation and installation quality.

Documentation and installation are the entry point. An agent encountering a tool for the first time needs machine-readable, structured, self-contained information to understand capabilities and usage, and a friction-free path to get it installed and working. More documentation is not better. Irrelevant docs actively harm agent performance. Quality, structure, and machine-readability matter more than volume.


B1Machine-Readable Documentation Formats

Requirement: SHOULD | Weight: Critical (×2)

Whether documentation is available in formats agents can directly consume, not just human-rendered HTML requiring JavaScript.

ScoreDescription
0. FailingInteractive-only docs (Swagger UI without downloadable spec); video tutorials; CSS-styled HTML requiring JS rendering
1. BasicDocs available as static HTML or Markdown; README with usage instructions
2. GoodComprehensive Markdown docs; llms.txt present; docs available via content negotiation or direct download
3. Excellentllms.txt + AGENTS.md + content negotiation (Accept: text/markdown) + version-matched bundled docs or MCP documentation server

B2Code Example Coverage & Quality

Requirement: SHOULD | Weight: Standard (×1)

The density, quality, realism, and progressive complexity of code examples.

ScoreDescription
0. FailingNo code examples; or examples use placeholder data ("string", 123)
1. BasicAt least one example per major feature; examples use realistic data
2. GoodExamples for most features/methods; include both success and error scenarios; copy-pasteable
3. ExcellentProgressive complexity (minimal → common → advanced); 1–5 examples per feature focusing on ambiguous cases; error recovery examples included

B3Documentation Structure & Self-Containment

Requirement: SHOULD | Weight: Standard (×1)

Whether documentation sections are self-contained (extractable independently), structured for agent consumption, and appropriately chunked.

ScoreDescription
0. FailingDocumentation is a single monolithic page; no logical sections; requires full-document context to understand any part
1. BasicLogical section divisions with headings; most sections are readable independently
2. GoodAnswer-first format; descriptive headings that function as queries; self-contained sections of 100–200 words; cross-references include inline context
3. ExcellentAll sections independently extractable; tables for structured data; critical information in first 30% of each section; optimized for retrieval-augmented generation

B4Documentation Accuracy & Synchronization

Requirement: SHOULD | Weight: Standard (×1)

Whether documentation accurately reflects actual tool behavior. Accuracy takes precedence over recency. Documentation that has been unchanged for an extended period but accurately describes current tool behavior scores higher than recently updated documentation that contains errors.

ScoreDescription
0. FailingDocumented behavior contradicts actual tool behavior; or documented methods/functions don't exist; or docs describe a different version than the current release
1. BasicNo known major inaccuracies; documented examples produce expected results
2. GoodDocs-as-code (versioned alongside product); documented examples tested; version-matched (docs specify which version they describe)
3. ExcellentDocs updated with every release; CI/CD blocks deployment without doc updates; automated drift detection

B5Getting Started Completeness

Requirement: SHOULD | Weight: Standard (×1)

Whether an agent can go from zero to working usage using only the documentation.

ScoreDescription
0. FailingNo getting-started guide; or guide requires significant external knowledge
1. BasicGetting-started guide exists; covers basic setup
2. GoodGuide is completable by following docs alone without external knowledge; includes installation, first usage, and expected output
3. ExcellentAgent-specific quickstart or integration guide; includes common pitfalls; minimal viable example under 20 lines of code; first successful usage achievable in <5 minutes

B6Changelog & Migration Guidance

Requirement: MAY | Weight: Standard (×1)

Whether changes are communicated in structured, parseable formats.

ScoreDescription
0. FailingNo changelog; or changelog is unstructured prose buried in blog posts
1. BasicChangelog exists with dated entries
2. GoodStructured and parseable changelog (consistent format); semantic versioning; breaking changes clearly marked
3. ExcellentChangelog available as structured data (JSON, RSS/Atom feed); migration guides for breaking changes; deprecated markers in code or specs

B7Installation & Configuration Simplicity

Requirement: SHOULD | Weight: Standard (×1)

How easily an agent can install, set up, and start using a tool.

ScoreDescription
0. FailingGUI installer required; complex multi-step build process; missing executables with no clear resolution
1. BasicPackage manager install (npm install, pip install); basic documentation
2. GoodSingle command install; environment variable configuration; clear error messages on misconfiguration
3. ExcellentSingle static binary or zero-dependency install; zero-config usage possible; scaffolding tools for project setup; JSON Schema for configuration validation

Safety Fundamentals

Supply chain integrity and vulnerability disclosure.

Two safety criteria apply to every tool regardless of type: supply chain integrity and vulnerability disclosure. Together they establish the minimum baseline: provenance verifiability and a defined path for reporting security issues.


B8Supply Chain Integrity

Requirement: SHOULD | Weight: Standard (×1)

Whether publisher identity is verified, releases are signed, and the supply chain is tamper-evident.

Open-source tools: evaluate this section:

ScoreDescription
0. FailingAnonymous or unclear maintainer identity; no integrity signals
1. BasicVerified publisher on npm/PyPI; GitHub org verification; some identity signals present
2. GoodSigned tags or releases; dependency pinning (lock files); verified domain → repo → artifact chain
3. ExcellentSigstore/cosign signing; SLSA provenance attestation; reproducible builds; SBOM published; complete identity chain (domain → repo → artifact maintainer match)

Trust signal hierarchy: Reproducible builds > SLSA attestation > Code signing > Verified domain > GitHub org verification > npm/PyPI verified publisher > SBOM > security.txt

Commercial/closed-source tools: evaluate this section instead:

ScoreDescription
0. FailingNo verifiable publisher identity; SDK or agent distributed through unofficial channels; no integrity signals
1. BasicVerified company domain; SDK published under verified org on npm/PyPI; official distribution channels clearly identified
2. GoodSDKs signed or published with verified provenance; official distribution channels documented; dependency pinning in SDK; checksums for downloadable artifacts
3. ExcellentSDKs with code signing and provenance attestation; SOC 2 Type II or equivalent supply chain controls; SBOM for SDK dependencies; documented build and release security practices

B9Vulnerability Disclosure & Security Contact

Requirement: SHOULD | Weight: Standard (×1)

Whether a clear, machine-readable path exists for reporting security vulnerabilities.

ScoreDescription
0. FailingNo clear vulnerability reporting path
1. BasicGeneric security contact exists (email address, contact form)
2. GoodPublished vulnerability disclosure policy with clear process and timeline commitments
3. Excellentsecurity.txt present (RFC 9116) at /.well-known/security.txt; published vulnerability disclosure policy; bug bounty program; clear response commitments (acknowledgment, assessment, and fix timelines)

Lifecycle Health

Long-term reliability and continuity.

These criteria evaluate long-term viability: whether a tool will continue to work, be maintained, and remain safe to depend on. A tool that scores well at evaluation but is abandoned soon after offers limited long-term reliability for agent dependency.

Open-source and commercial tools have fundamentally different risk profiles. An open-source tool's risk is contributor abandonment; a commercial tool's risk is corporate sunset or acquisition. Where this distinction matters, criteria provide separate rubrics. For open-source tools, most criteria are fully automatable from public data (GitHub API, package registries, OpenSSF Scorecard).


B10Project Sustainability

Requirement: SHOULD | Weight: Critical (×2)

The likelihood that this tool will continue to be maintained and supported over time.

Open-source tools: evaluate this section:

ScoreDescription
0. FailingBus factor of 1 (single contributor accounts for >50% of contributions); or no commits in 12+ months with open issues
1. BasicBus factor of 2–3; some contributor diversity
2. GoodBus factor of 4–10; multiple active contributors; no single contributor >50% of recent commits
3. ExcellentBus factor >10; organizational backing; contributor pipeline visible (new contributors joining)

Commercial/closed-source tools: evaluate this section instead:

ScoreDescription
0. FailingNo visible team or organization; single-person operation with no stated continuity plan
1. BasicEstablished company; identifiable team; product actively marketed
2. GoodCompany with public funding or revenue signals; dedicated product team; published product roadmap
3. ExcellentPublicly traded or well-funded company; product is a core revenue line (not a side project); published sunset/migration policy; data export API

B11Maintenance Health

Requirement: SHOULD | Weight: Standard (×1)

Whether the tool demonstrates operational health. Active health is not synonymous with recent commit activity: it is evidence that the tool continues to function and that unresolved issues receive maintainer response. A project with no open issues and no recent commits satisfies this criterion; a project with numerous unanswered issues and no recent commits does not. The distinction is measurable.

ScoreDescription
0. FailingUnanswered issues accumulating (>10 open issues with no maintainer response in 90+ days); or unpatched known vulnerabilities >90 days old; or fails to install on current LTS runtimes
1. BasicOpen issues receive some response; no unpatched critical vulnerabilities; installs and runs on current platforms
2. Good<7 days median issue response when issues exist; dependencies up to date or pinned to non-vulnerable versions; CI passing on current runtimes
3. Excellent<48 hours issue triage; proactive dependency updates; CI tests against multiple runtime versions; clear triage labels; published response time commitments. OR for mature stable projects: zero unpatched vulnerabilities; CI passing on current LTS runtimes; <7 day response on the last 5 issues filed (whenever they were filed); dependencies pinned to non-vulnerable versions; no open issues older than 180 days without maintainer response

B12Semver Adherence & Version Stability

Requirement: SHOULD | Weight: Standard (×1)

Whether breaking changes are confined to major versions and the tool follows predictable versioning.

ScoreDescription
0. FailingNo versioning strategy; breaking changes in minor/patch releases; or perpetually pre-1.0 with breaking changes
1. BasicVersioned releases exist; some semver adherence
2. GoodSemver-compliant; breaking changes in major versions only; pre-1.0 tools clearly labeled as unstable
3. ExcellentStrict semver; documented API stability guarantees; machine-readable compatibility matrices; LTS versions for production use

B13Governance & Continuity

Requirement: MAY | Weight: Standard (×1)

Whether the tool has governance structures that reduce single-entity risk and provide continuity assurance.

Open-source tools: evaluate this section:

ScoreDescription
0. FailingSingle individual maintainer with no organizational backing; no succession plan
1. BasicMultiple maintainers with informal governance; or backed by a single company
2. GoodOpen governance model; contributor guidelines; decision-making process documented; multiple organizational contributors
3. ExcellentFoundation governance (CNCF, Apache, Linux Foundation); formal succession planning; multiple organizational contributors with commit rights

Commercial/closed-source tools: evaluate this section instead:

ScoreDescription
0. FailingNo public information about the company or team; no terms of service addressing continuity
1. BasicEstablished company with identifiable leadership; standard terms of service
2. GoodPublished data portability/export mechanisms; documented SLA; company financials or funding publicly known
3. ExcellentPublicly traded or independently audited financials; published sunset policy with migration timeline commitments; data escrow or open-source fallback clause; contractual SLA with uptime guarantees

B14Security Track Record

Requirement: SHOULD | Weight: Standard (×1)

Vulnerability response speed and proactive security practices.

Open-source tools: evaluate this section:

ScoreDescription
0. FailingKnown unpatched vulnerabilities >90 days old; no security response history; OpenSSF Scorecard <3/10
1. BasicVulnerabilities patched within 90 days; some security practices visible; OpenSSF Scorecard 3–5/10
2. GoodVulnerabilities patched within 30 days; code review enforced; branch protection enabled; OpenSSF Scorecard 5–7/10
3. ExcellentVulnerabilities patched within 14 days; comprehensive security practices; CI security scanning; OpenSSF Scorecard >7/10; Code-Review check passing

Commercial/closed-source tools: evaluate this section instead:

ScoreDescription
0. FailingNo public security information; no evidence of security practices; known incidents with no public response
1. BasicSecurity contact or security.txt exists; incidents acknowledged publicly; some security practices described on website
2. GoodPublished security practices page; SOC 2 Type I or equivalent; vulnerability disclosure policy with timeline commitments; incident post-mortems published
3. ExcellentSOC 2 Type II or ISO 27001 certified; bug bounty program; incident post-mortems with root cause analysis; proactive security advisories; SDK dependencies regularly audited

B15Terms & Licensing Stability

Requirement: SHOULD | Weight: Standard (×1)

Whether the terms under which the tool is available are stable and free from change risk signals.

Open-source tools: evaluate this section:

ScoreDescription
0. FailingNo license specified; or non-standard/proprietary license with no stability commitment
1. BasicOSI-approved license
2. GoodStable OSI license with no change risk signals (no single-company >80% commits + broad CLA + cloud competition pattern)
3. ExcellentStable license + none of the known change risk indicators; or irrevocable license grant; foundation-held copyright

Commercial/closed-source tools: evaluate this section instead:

ScoreDescription
0. FailingNo published terms of service; or terms allow unilateral changes with no notice
1. BasicPublished terms of service; clear commercial licensing terms
2. GoodPricing commitments of 12+ months; terms require 90+ days notice for material changes; grandfathering policy for existing customers
3. ExcellentMulti-year pricing commitments or published pricing history demonstrating stability; contractual protection against adverse term changes; machine-readable pricing API; published API deprecation policy with 12+ month windows

8. Complexity Modules

Complexity modules add criteria based on how the tool is accessed and what it does. Each module is activated by a yes/no trigger question. A tool may activate zero, one, or many complexity modules: the triggers are independent.

Activating at least one complexity module places a tool's score in the Agent Ready or Agent Native band (see §4). Tools evaluated on the Base Standard alone land in the Agent-Ready Base band.


8.1 Module: Programmatic Interface

Trigger: Does the tool expose an API (REST, GraphQL, gRPC), SDK, or MCP server?

16 criteria. Evaluates the quality and safety of the agent-facing programmatic interface: descriptions, schemas, outputs, naming, protocol support, and interface-level security.

The quality and safety of the tool's agent-facing programmatic interface. Tool description quality is identified across 13+ independent sources as the factor with the greatest single impact on agent success rates. Across description quality, schema design, output size, and naming, minimalism correlates with improved agent performance: fewer tools, tighter schemas, smaller outputs, and more precise names each improve measured outcomes. Interface-level security (input sanitization, prompt injection resistance) applies to any programmatic interface: read-only or read-write.


PI1Interface Reference Completeness

Requirement: MUST | Weight: Critical (×2)

Whether the programmatic interface (API endpoints, SDK methods, MCP tools) is documented with sufficient detail for an agent to use without guessing.

ScoreDescription
0. FailingNo interface documentation; or docs exist but cover <50% of methods/endpoints
1. BasicMethods/endpoints are documented; >50% have basic descriptions
2. Good>80% of methods/endpoints have request/response examples; parameter types and constraints documented
3. Excellent100% coverage with examples, edge cases, and error scenarios documented per method/endpoint; parameter constraints include formats, ranges, and valid values

PI2Tool/Endpoint Description Quality

Requirement: MUST | Weight: Critical (×2)

The completeness, specificity, and actionability of descriptions attached to tools, functions, or API endpoints.

ScoreDescription
0. FailingMissing descriptions, name restatement ("Gets data"), no parameter descriptions
1. BasicBasic description of what tool/endpoint does; some parameter documentation
2. GoodSpecific descriptions with when-to-use guidance; all parameters described with types and examples; 1–2 usage examples per tool
3. ExcellentWhen-to-use AND when-NOT-to-use; inline examples with realistic data; enum values listed; return format documented; 1–5 examples focusing on ambiguous cases; edge cases noted

PI3Tool Count & Surface Area Management

Requirement: SHOULD | Weight: Critical (×2)

The number of tools/endpoints exposed to an agent at once, and whether mechanisms exist to manage surface area.

ScoreDescription
0. Failing50+ undifferentiated tools; each tool = one REST endpoint (API-mirroring pattern); or 31–49 tools with no meaningful grouping or surface area management
1. Basic≤30 tools; some logical grouping
2. Good5–15 focused tools designed around user outcomes (not API operations); logical grouping
3. Excellent5–15 tools + dynamic discovery/deferred loading for larger catalogs; code-mode pattern for complex APIs; semantic search over tool catalog

PI4Input Schema Design

Requirement: SHOULD | Weight: Critical (×2)

The degree to which input validation prevents common agent errors through strict schemas, constrained formats, and actionable feedback.

ScoreDescription
0. FailingNo schema validation; loose types; accepts malformed input silently
1. BasicBasic JSON Schema with types; some required field marking
2. GoodStrict schemas with enums for constrained fields; format examples; ≤3 nesting levels; all properties documented
3. ExcellentFlat top-level primitives; comprehensive enum/default/description; ≤500 tokens per tool schema; strict: true compatible; additionalProperties: false

PI5Output Quality & Token Efficiency

Requirement: SHOULD | Weight: Standard (×1)

Whether responses contain high-signal, bounded-size data with pagination and format optimization.

ScoreDescription
0. FailingUnbounded responses with no size constraints (100K+ tokens possible); opaque UUIDs without context; no pagination
1. BasicPagination exists; typical responses <10K tokens
2. GoodPaginated with cursor metadata (has_more, next_cursor); compact summaries; semantic identifiers; filtering parameters
3. ExcellentToken-budgeted responses; outputSchema defined; concise mode available; cursor-based pagination; CSV/TSV option for tabular data; response size bounded by default

PI6Response Envelope Consistency

Requirement: SHOULD | Weight: Standard (×1)

Whether all API endpoints return responses in the same structural shape.

ScoreDescription
0. FailingVariable response shapes across endpoints; inconsistent naming; fields omitted when null; type instability (field is sometimes string, sometimes array)
1. BasicMostly consistent; some endpoints deviate; same general pattern
2. GoodConsistent envelope structure; consistent naming convention (snake_case or camelCase, not mixed); null fields included as null (scalars) or [] (collections)
3. ExcellentIdentical envelope everywhere; automated linting enforces consistency; type stability guaranteed; published response schema

PI7Naming & Namespacing

Requirement: SHOULD | Weight: Standard (×1)

The predictability, distinctiveness, and collision-resistance of tool, function, or endpoint names.

ScoreDescription
0. FailingGeneric names like search, get_data, doThing; inconsistent casing; no namespace prefix
1. BasicDescriptive names; consistent casing (snake_case preferred); no service prefix
2. GoodService-prefixed snake_case (e.g., stripe_create_charge, github_list_issues)
3. ExcellentService-prefixed + self-descriptive + unique within multi-server ecosystem + predictable pattern across all tools

PI8Behavioral Metadata & Annotations

Requirement: SHOULD | Weight: Standard (×1)

Machine-readable metadata declaring whether a tool is read-only, destructive, idempotent, and whether it interacts with external entities.

ScoreDescription
0. FailingNo behavioral annotations; all tools appear equivalent
1. BasicreadOnlyHint set on obvious read-only tools
2. GoodAll four MCP annotations set accurately (readOnlyHint, destructiveHint, idempotentHint, openWorldHint) on every tool
3. ExcellentFull annotations + output annotations (audience, priority) + risk ratings + HTTP method semantics matching behavior (GET = read-only, DELETE = destructive)

PI9MCP Implementation Quality

Requirement: MAY | Weight: Standard (×1)

When an MCP server exists (official or community), the quality of that implementation.

ScoreDescription
0. FailingNo MCP server exists (official or community); OR MCP server exists but has critical quality issues: 50+ undifferentiated tools, no descriptions, command injection vulnerabilities, no error handling
1. BasicTools have descriptions; auth documented; read and write operations present
2. Good5–20 focused tools designed around outcomes (not API mirroring); all four MCP annotations set accurately; read-only and read-write tools clearly separated; documented auth
3. ExcellentAll of above + deferred loading / dynamic discovery for large catalogs; safety-tiered tools (read/write/destructive separated); read-only mode available; output annotations; supported clients documented; maintained alongside product releases

Always evaluated: PI9 is always included in the evaluation for tools that trigger the Programmatic Interface module (it is never excluded from the denominator). Tools without an MCP server score 0 on PI9. Because PI9 is a MAY criterion, this score-0 has limited impact on reaching the Agent Ready band (where MAY criteria primarily contribute to the overall percentage) but meaningfully affects the Agent Native band (which requires no SHOULD or MUST criterion scores 0, and where MAY criteria still contribute to the ≥80% overall threshold). The practical effect: a tool's score can reach Agent Ready without MCP, but Agent Native demands either an MCP server or enough excellence elsewhere to absorb the PI9 zero. This creates a directional incentive toward MCP adoption without penalizing non-adoption at the Agent Ready band.


PI10Programmatic Setup / Time to First API Call

Requirement: SHOULD | Weight: Critical (×2)

The amount of tool-specific configuration required after an agent has valid credentials (or no credentials are needed) before it can make a successful API call. This criterion measures post-credential setup quality: how well the tool minimizes friction between credential acquisition and a successful first API call.

Separation of concerns with AU1: Credential acquisition (account creation, key generation, OAuth setup) is evaluated under AU1 (Non-Interactive Authentication Methods). PI10 measures everything after authentication is solved. A tool that requires browser-based account creation is already penalized on AU1; PI10 does not double-count that friction.

ScoreDescription
0. Failing>10 minutes post-credential setup; multiple dashboard-only configuration steps required before first API call; tool-specific configuration requires human interaction
1. Basic5–10 minutes; 1–2 tool-specific configuration steps (project creation, API enablement, webhook setup)
2. Good2–5 minutes; single environment variable or config file; sandbox/test mode available immediately; clear error on misconfiguration
3. Excellent<2 minutes; zero-config possible for basic usage; test/sandbox works immediately with credentials alone; config validation with actionable errors; programmatic project setup via API

PI11API Workflow Coverage

Requirement: SHOULD | Weight: Standard (×1)

The percentage of common workflows completable entirely through the API without requiring web dashboard interaction.

ScoreDescription
0. FailingCore functionality requires dashboard; API covers <50% of common workflows
1. BasicCore CRUD operations available via API; some configuration requires dashboard
2. Good>80% of common workflows completable via API; dashboard-only steps documented
3. Excellent100% of functionality available via API; no dashboard-only features for any common workflow

PI12Versioning & API Stability

Requirement: SHOULD | Weight: Standard (×1)

Whether the API uses explicit versioning with adequate deprecation signals and managed breaking changes.

ScoreDescription
0. FailingNo versioning strategy; unannounced breaking changes
1. BasicVersion identifier exists (URL path, header, or parameter); some deprecation notices
2. GoodExplicit versioning with documented deprecation policy; deprecated: true in specs; 6+ month deprecation windows
3. ExcellentSemver-adherent; Sunset headers; machine-readable deprecation timeline; previous version maintained for 12+ months after deprecation

PI13SDK Availability & Quality

Requirement: SHOULD | Weight: Standard (×1)

Whether official SDKs exist in languages agents commonly use, and whether they're well-maintained.

ScoreDescription
0. FailingNo SDK; raw HTTP only
1. BasicOfficial SDK in 1 major language (Python or TypeScript/JavaScript)
2. GoodOfficial SDKs in 2+ major languages; idiomatic to each; typed interfaces
3. ExcellentSDKs in 4+ languages; type-safe with branded types; auto-generated from OpenAPI spec; maintained in sync with API releases

PI14Agent Protocol Availability

Requirement: SHOULD | Weight: Standard (×1)

Whether the tool provides high-quality programmatic interfaces for agent interaction. A well-designed REST API, an MCP server, or both are valid paths.

ScoreDescription
0. FailingNo programmatic interface; GUI/dashboard only; or API exists but is undocumented
1. BasicREST API with basic documentation; OR community MCP server exists
2. GoodWell-designed REST API with OpenAPI spec and SDKs in 2+ languages; OR official MCP server with documented auth and core operation coverage
3. ExcellentExcellent REST API with comprehensive SDKs AND official MCP server; OR one interface executed at exceptional quality (e.g., Stripe-quality API without MCP, or best-in-class MCP without REST)

PI15Input Sanitization & Injection Resistance

Requirement: MUST | Weight: Standard (×1)

Whether the tool demonstrates evidence of input sanitization and defense against injection attacks through schema design, security infrastructure, documentation, and architectural patterns. Because PI15 is a MUST gate, a tool that exposes a programmatic interface with no evidence of input sanitization cannot reach the Agent Ready or Agent Native band regardless of total score.

ScoreDescription
0. FailingNo evidence of input sanitization: no schema validation, no parameterized queries, no security infrastructure, no documentation of input handling practices
1. BasicBasic input validation evidenced: strict input schemas with type checking (from PI4); parameterized queries documented; or WAF/CDN security infrastructure detected
2. GoodComprehensive sanitization evidence: strict schemas with additionalProperties: false across all endpoints; parameterized operations documented throughout; security infrastructure present; input validation practices documented
3. ExcellentAll of above + allowlist-based input validation documented where feasible; security testing in CI (detected via workflow analysis); defense-in-depth architecture documented; vendor-provided security assessment or third-party audit results available

PI16Prompt Injection Resistance

Requirement: SHOULD | Weight: Standard (×1)

Tool-level defense-in-depth against prompt injection: strict schemas, output sanitization, injection-resistant designs.

ScoreDescription
0. FailingUnsafe patterns present or encouraged; no awareness of injection risks; tool descriptions contain narrative or references to other tools
1. BasicStrict input schema validation; parameterized operations; minimal description surface
2. GoodOutput structured with clear field boundaries (JSON); response size limits; descriptions concise and self-contained; no cross-tool references in descriptions
3. ExcellentExplicit design mitigations documented; policy layer for untrusted content; output validation; structured action metadata; separation of untrusted content from control flow

8.2 Module: Network Service

Trigger: Is the tool a hosted or remote service (SaaS, PaaS, cloud API)?

8 criteria. Evaluates concerns specific to services that run remotely: error handling, rate limits, health endpoints, observability, sandboxing, environment separation, and data portability.


NS1Error Response Quality & Structure

Requirement: MUST | Weight: Critical (×2)

Whether error responses provide structured, machine-parseable information enabling agents to diagnose problems, determine retryability, and execute recovery actions.

ScoreDescription
0. FailingHTML error pages; empty responses; generic "Something went wrong"; silent failures (no error flag set)
1. BasicJSON errors with human-readable message and machine-readable error code; MCP errors set isError: true
2. GoodRFC 9457 compliant (type, title, status, detail); all validation errors reported simultaneously (not one-at-a-time); field-level identification; doc_url per error type
3. ExcellentAll of above + is_retriable boolean + retry_after_seconds + suggested alternative actions + hierarchical error taxonomy (e.g., Stripe: type → code → decline_code) + numbered recovery steps

NS2Rate Limit Communication

Requirement: SHOULD | Weight: Critical (×2)

Whether rate limits are communicated proactively and include machine-actionable timing signals.

ScoreDescription
0. FailingNo rate limit headers; no Retry-After on 429 responses; undocumented limits
1. BasicRetry-After on 429 responses; limits documented somewhere
2. GoodRate limit headers on every response (X-RateLimit-Remaining, X-RateLimit-Limit, X-RateLimit-Reset); per-key limits; scope declared (per-endpoint vs. global)
3. ExcellentFull header suite on all responses; batch endpoints to reduce call count; resource-aware cost metadata (e.g., operationCost: { credits: 5 }); per-agent rate limits

NS3Health & Status Communication

Requirement: SHOULD | Weight: Standard (×1)

Whether the tool provides structured health endpoints that agents can query to assess availability.

ScoreDescription
0. FailingNo health endpoint; HTML status pages only
1. Basic/health endpoint returns JSON with aggregate up/down status
2. GoodComponent-level status; Retry-After on 503 responses; maintenance schedule available
3. ExcellentPer-dependency status; degradation warnings in response metadata; application/health+json format (IETF Internet-Draft); planned maintenance pre-signaled

NS4Audit & Observability

Requirement: SHOULD | Weight: Standard (×1)

Whether the tool logs agent interactions with sufficient detail for forensic analysis and compliance.

ScoreDescription
0. FailingNo meaningful logs or observability for API/tool interactions
1. BasicBasic request/response logging; API key identified in logs
2. GoodAudit logs with correlation IDs; sensitive-data redaction; rate limits enforced with logged violations; agent identity distinguished from human in logs
3. ExcellentOpenTelemetry-compatible trace/span IDs; immutable append-only audit logs; delegation chain logging; anomaly detection or alerting; per-action risk tier logging

NS5Test/Sandbox Environment Support

Requirement: SHOULD | Weight: Standard (×1)

Whether the tool provides sandbox environments, test keys, and safe experimentation modes.

ScoreDescription
0. FailingNo test mode; no sandbox; all mutations occur in the production environment
1. BasicTest mode exists but with limited simulation
2. GoodSeparate sandbox environment + basic behavioral simulation + API-verifiable mode (test responses indicate test mode)
3. ExcellentStructurally distinct test/live keys (prefixed like sk_test_); separate sandbox URLs; full behavioral simulation; multiple sandboxes; time simulation (Stripe test clocks, Neon database branching)

NS6Environment Separation

Requirement: SHOULD | Weight: Standard (×1)

Whether the tool architecturally separates development, staging, and production environments.

ScoreDescription
0. FailingNo environment separation; single set of credentials for all environments; test and production data co-mingled
1. BasicSeparate environments exist but share credentials or configuration
2. GoodDistinct credentials per environment; environment clearly indicated in API responses; preview/staging deployments available
3. ExcellentEnvironment-specific URLs and credentials; database branching for isolated experimentation; deploy previews via API; environment promotion workflow (dev → staging → prod)

NS7Asynchronous Operation Support

Requirement: MAY | Weight: Standard (×1)

Whether long-running operations return immediately with a durable handle and provide status mechanisms.

ScoreDescription
0. FailingLong-running operations block until complete; timeouts cause retries with potential duplicates
1. BasicHTTP 202 Accepted pattern with task/job ID on some operations
2. GoodConsistent async pattern across all long-running operations; polling endpoint with status; estimated_seconds in 202 response
3. ExcellentFull async with lifecycle states (working → completed/failed/cancelled); both polling and webhook notification; blocking result endpoint for simple cases; progress reporting

NS8Data Portability & Pricing Transparency

Requirement: MAY | Weight: Standard (×1)

Whether the service provides programmatic access to pricing, usage tracking, and data export. Agents operating autonomously cannot parse marketing pages, "Contact Sales" buttons, or dashboard-only usage tracking: they need structured, machine-readable access to costs, consumption, and data portability.

ScoreDescription
0. FailingNo data export capability; pricing only on marketing pages; no usage tracking API
1. BasicManual data export (dashboard); published pricing page; basic usage visible in dashboard
2. GoodProgrammatic data export API; published pricing with clear unit costs; usage tracking API; billing alerts
3. ExcellentBulk export API with standard formats (CSV, JSON, Parquet); machine-readable pricing API or structured pricing page; real-time usage tracking; spending limit API; cost estimation before provisioning

Relationship to HI1 (Cost Guardrails): HI1 evaluates mechanisms to prevent cost overruns (spending limits, auto-stop, cost caps). NS8 evaluates information availability: can agents determine what something costs, how much has been spent, and whether data can be extracted? A tool can score well on NS8 (transparent pricing, usage API) while scoring poorly on HI1 (no spending limits), or vice versa.


8.3 Module: Write Operations

Trigger: Can the tool create, modify, or delete data or resources?

4 criteria. Evaluates safeguards for irreversible actions: destructive operation safety, dry-run capability, idempotency, and multi-step error handling.


WO1Destructive Operation Safety

Requirement: SHOULD | Weight: Critical (×2)

Mechanisms that prevent agents from executing irreversible destructive operations without appropriate safeguards.

ScoreDescription
0. FailingNo guardrails; agent gets full read/write/delete access by default; no confirmation patterns
1. BasicDatabase-level or API-level permissions with agent-specific restricted roles; some operations require confirmation
2. GoodLayered defenses: read-only modes + lexical blocklists (DROP, DELETE, TRUNCATE) + human confirmation gates for high-risk operations; soft delete support
3. ExcellentPhysical write prevention (read-only replicas); destructive ops excluded from agent-facing interfaces; structural prevention patterns (auth-capture for payments, plan-apply for infra); draft/preview/publish separation

WO2Dry-Run / Validation Capability

Requirement: MAY | Weight: Standard (×1)

Whether the tool provides mechanisms to validate requests without executing them.

ScoreDescription
0. FailingNo dry-run or validation capability
1. BasicValidation endpoint exists for some operations
2. GoodDry-run parameter or validation endpoint for most mutating operations; returns what would happen without side effects
3. ExcellentDry-run executes full validation chain (Terraform plan, Kubernetes server-side dry-run); standardized parameter (e.g., validate_only: true per Google AIP-163); diff output showing proposed changes

WO3Idempotency & Safe Retry Support

Requirement: SHOULD | Weight: Standard (×1)

Whether mutating operations accept idempotency keys to prevent duplicate side effects when agents retry failed requests.

ScoreDescription
0. FailingNo idempotency support; retries cause duplicate side effects
1. BasicIdempotency-Key accepted on critical mutating operations
2. GoodIdempotency enforced with 24h+ key persistence; concurrent request handling via locking; documented key behavior
3. ExcellentComprehensive idempotency across all non-idempotent operations; conflict detection (same key, different params → 409); Stripe-model parameter validation

WO4Workflow Error Communication

Requirement: SHOULD | Weight: Standard (×1)

Whether multi-step operations communicate progress, partial success, and resumability.

ScoreDescription
0. FailingNo step-level feedback; atomic success-or-fail with no intermediate state visibility
1. BasicFailed step identified in error response; no resume capability
2. GoodCompleted/failed/pending step enumeration; resume tokens or checkpoint IDs; severity indication (reversible vs. irreversible failure)
3. ExcellentFull checkpoint-based recovery; draft/preview/publish separation; compensating transactions for partial failures; 202 Accepted + polling for multi-step workflows

8.4 Module: Authentication

Trigger: Does the tool require credentials, API keys, OAuth, or any form of authentication?

4 criteria. Authentication is widely cited as the most persistent unresolved problem in agent-tool interaction. It functions as a binary gate: a tool that satisfies every other criterion but cannot be authenticated by an agent provides no agent utility.


AU1Non-Interactive Authentication Methods

Requirement: MUST | Weight: Critical (×2)

Whether the tool supports at least one authentication method that agents can complete without human interaction.

ScoreDescription
0. FailingOnly browser-based OAuth requiring human interaction; CAPTCHA-gated; 2FA with no bypass for service accounts
1. BasicAPI keys available; basic documentation for key usage
2. GoodAPI keys + Client Credentials grant + M2M documentation + Device Flow for delegated access
3. ExcellentMultiple non-interactive methods + brokered credentials + programmatic key creation/rotation via API

AU1 is a MUST gate. If a tool requires authentication, support for at least one non-interactive authentication method is required. A score of 0 on AU1 disqualifies the tool from the Agent Ready or Agent Native band regardless of overall score.


AU2Permission Granularity

Requirement: SHOULD | Weight: Standard (×1)

How finely the tool allows scoping what an agent can access and do.

ScoreDescription
0. FailingSingle admin key with full access; no scoping mechanism
1. BasicRead/write separation available
2. GoodPer-resource scoped keys + fine-grained OAuth scopes + insufficient permissions error includes required scope
3. ExcellentPer-resource per-operation scoping + machine-readable permission manifests + deny-by-default for destructive operations

AU3Credential Lifecycle Management

Requirement: SHOULD | Weight: Standard (×1)

Whether the tool supports automated credential rotation, refresh, expiry signaling, and per-agent revocation.

ScoreDescription
0. FailingManual rotation only; no programmatic credential management
1. BasicAPI for key creation/rotation + refresh tokens
2. GoodAutomatic rotation + zero-downtime overlap + per-key revocation + expiry metadata
3. ExcellentBrokered credentials + dual-secret rotation + proactive refresh guidance + per-key audit trail

AU4Agent Identity Support

Requirement: MAY | Weight: Standard (×1)

Whether the tool treats AI agents as a distinct identity type.

ScoreDescription
0. FailingShared credentials only; no way to distinguish agent from human
1. BasicService accounts with some scoping
2. GoodM2M auth with client_credentials + agent-specific rate limits
3. ExcellentAgent as first-class identity type + Token Vault + CIBA + per-action audit trail

8.5 Module: CLI

Trigger: Does the tool have a command-line interface?

4 criteria. Evaluates agent-specific CLI concerns: non-interactive execution, structured output, cross-platform behavior, and configuration safety.


CLI1Non-Interactive Execution

Requirement: SHOULD | Weight: Standard (×1)

The ability to run a tool without any human interaction, no confirmation prompts, no editor invocations, no TTY-dependent output.

ScoreDescription
0. FailingTool hangs or crashes without TTY; interactive prompts with no bypass
1. BasicSome non-interactive flags exist (--yes, --no-input); some prompts remain
2. GoodNon-interactive flags for most prompts; CI mode detection; --json output mode
3. ExcellentAuto-detects non-TTY environment; flags for all interactive points; JSON output implies non-interactive; NO_COLOR=1 support; separate stderr/stdout

CLI2Structured Output Mode

Requirement: SHOULD | Weight: Standard (×1)

Whether the CLI provides machine-parseable output alongside human-readable output. Agents consuming CLI output require structured data that can be parsed reliably. Without structured output, agents fall back to parsing formatted text through regular expressions, which is brittle across tool versions and locales.

ScoreDescription
0. FailingText-only output; no --json or equivalent flag; ANSI colors/formatting in default output with no disable mechanism; exit code 0/non-zero only with no structured error information
1. Basic--json or --format json flag available for primary commands; basic exit codes (0 = success, non-zero = failure); stderr and stdout may be mixed
2. GoodJSON output available on all major commands; meaningful exit codes with descriptive stderr; stderr and stdout cleanly separated; NO_COLOR=1 or --no-color supported
3. ExcellentMultiple structured formats (JSON + YAML + custom templates); structured output implies non-interactive mode (Terraform pattern: --json implies --input=false); --porcelain stability guarantee across versions (Git pattern); semantic exit codes (distinct codes for distinct failure modes); consistent JSON schema across CLI versions

CLI3Cross-Platform Consistency

Requirement: SHOULD | Weight: Standard (×1)

Whether the CLI behaves identically across Linux, macOS, and Windows. Agents trained primarily on Linux/macOS generate commands that fail silently on Windows: path separators, line endings, shell syntax, and temp directory locations all differ. A tool that works on one platform but behaves differently on another creates unpredictable agent failures.

ScoreDescription
0. FailingSingle-platform only (e.g., bash-only scripts); hard-coded platform-specific paths (/tmp/, C:\); no Windows support
1. BasicAvailable on Linux, macOS, and Windows; but behavior or output may differ across platforms; platform-specific installation instructions
2. GoodCross-platform binary distribution or container; consistent output format across platforms; path handling works with both / and \; no platform-specific shell syntax required
3. ExcellentCI tests on all three major platforms; byte-identical output across platforms; single static binary or zero-dependency install; devcontainer or Nix support for environment reproducibility; platform-specific differences documented

CLI4Configuration Format Safety

Requirement: MAY | Weight: Standard (×1)

Whether the tool's configuration format is safe for agent generation. YAML's whitespace sensitivity and implicit type coercion produce subtle, silent failures in agent-generated configuration: single-space indentation errors change data structure without raising syntax errors, and implicit type coercion (the "Norway problem" in which NO becomes false) silently corrupts data. JSON Schema validation materially reduces these failures by enabling agents to validate configuration before applying it.

ScoreDescription
0. FailingYAML-only config with no schema validation; no config validation command; implicit type coercion undocumented
1. BasicConfig format documented; basic structure validation (file parses without error); YAML accepted but JSON alternative available
2. GoodJSON or TOML as primary config format; JSON Schema exists for config files; standalone validation command available (validate, check, lint); actionable error messages on misconfiguration
3. ExcellentJSON or TOML primary with published JSON Schema; schema-driven IDE and agent autocompletion; validation runs automatically before any destructive action; error messages include specific fix suggestions; no implicit type coercion; config secrets isolated from main config file

9. Domain Modules

Domain modules add criteria based on a tool's functional domains. A tool may trigger one or more domain modules: for example, Supabase (Databases + Auth Providers + Hosting) or Firebase (Databases + Auth Providers + Communications). Each activated domain module adds its criteria to the tool's evaluation, expanding both the numerator and denominator like complexity modules.


9.1 Module: Payments & Financial

Applies to: Payment processors, billing platforms, financial APIs

IDCriterionReqWeight
PM1Idempotency DepthSHOULDStandard
PM2Test Simulation FidelitySHOULDStandard
PM3Compliance AutomationMAYStandard
PM4Currency & Amount SafetySHOULDStandard

PM1Idempotency Depth

Requirement: SHOULD | Weight: Standard (×1)

Whether the payment API provides deep idempotency beyond basic key acceptance: including parameter validation, key persistence windows, and concurrent request serialization. Agents retry failed requests frequently; without robust idempotency, retries create duplicate charges.

ScoreDescription
0. FailingNo idempotency support; retried requests create duplicate charges
1. BasicIdempotency key header accepted; duplicate requests return cached response
2. GoodKey acceptance + documented persistence window (e.g., 24 hours) + concurrent request serialization
3. ExcellentParameter validation (same key + different params → error/409), documented key lifetime, concurrent locking, idempotency across all POST/write endpoints

PM2Test Simulation Fidelity

Requirement: SHOULD | Weight: Standard (×1)

How comprehensively the platform simulates real payment scenarios in test mode: including decline codes, dispute flows, subscription lifecycle, and webhook events. Agents cannot safely learn payment integration on live data.

ScoreDescription
0. FailingNo test mode; or test mode limited to basic success/fail with no scenario simulation
1. BasicTest/sandbox environment with key separation; basic test card numbers for success and generic decline
2. GoodMultiple test cards covering specific decline codes and card brands; webhook forwarding/simulation; isolated test data
3. Excellent30+ test cards with specific scenarios; Test Clocks API for time-dependent flows (subscriptions, trials); dispute/refund simulation; CLI event triggering and replay

PM3Compliance Automation

Requirement: MAY | Weight: Standard (×1)

Whether the platform automates regulatory compliance burdens (tax calculation, PCI scope reduction, 3DS/SCA flows) so agents don't need jurisdiction-specific knowledge. An agent creating a payment flow should not need to understand VAT rules for 200 countries.

ScoreDescription
0. FailingNo compliance automation; agent must manually implement tax calculation, PCI handling, and 3DS flows
1. BasicHosted checkout or client-side tokenization reduces PCI scope; basic 3DS support via redirects
2. GoodBuilt-in tax engine (enable via API); automatic 3DS/SCA handling with machine-readable requires_action status; PCI scope fully eliminated via hosted flows
3. ExcellentMerchant of Record model (platform handles all tax, compliance, remittance); or built-in tax engine covering 200+ markets with threshold monitoring and VAT ID validation

PM4Currency & Amount Safety

Requirement: SHOULD | Weight: Standard (×1)

Whether the API prevents currency-related agent errors through clear unit documentation, smallest-unit enforcement, zero-decimal currency handling, and validation of ambiguous amounts. Currency math errors are among the highest-impact agent errors in payment systems.

ScoreDescription
0. FailingAmbiguous amount units (unclear if cents or dollars); no zero-decimal currency handling; no minimum amount enforcement
1. BasicDocumentation states amounts are in smallest currency unit; minimum charge amount enforced
2. GoodExplicit unit in API responses; zero-decimal currencies (JPY) and three-decimal currencies (BHD) documented; amount validation with clear error messages
3. ExcellentCurrency-aware validation rejecting ambiguous amounts; explicit decimal count per currency in API metadata; auth-capture pattern support for human review before charge

9.2 Module: Communications

Applies to: Email, SMS, messaging, notification platforms

IDCriterionReqWeight
CM1Irreversibility SafeguardsSHOULDStandard
CM2Delivery VerificationSHOULDStandard
CM3Webhook/Event InfrastructureSHOULDStandard

CM1Irreversibility Safeguards

Requirement: SHOULD | Weight: Standard (×1)

Whether the platform provides safety mechanisms to prevent agents from sending irreversible communications without review: including sandbox/test modes, draft-then-send patterns, batch limits, and scheduled send with cancellation. Sent messages cannot be recalled.

ScoreDescription
0. FailingNo sandbox mode; no batch limits; no draft/preview capability; agent can send unlimited messages immediately
1. BasicTest/sandbox mode available (messages validated but not delivered); basic rate limiting on outbound sends
2. GoodSandbox mode + batch send limits (≤1,000 per call) + rate limiting; draft/preview API or scheduled send with cancellation window
3. ExcellentSandbox mode validating full request format; per-second rate limits as safety brakes; draft-then-send pattern with human approval gate; scheduled send with cancellation; loop prevention circuit breaker

CM2Delivery Verification

Requirement: SHOULD | Weight: Standard (×1)

Whether the platform provides structured, machine-readable delivery status tracking: including bounce categorization (hard/soft), complaint tracking, and suppression list management. Agents need programmatic feedback to know if messages were actually delivered.

ScoreDescription
0. FailingNo delivery status feedback; fire-and-forget sending with no bounce or complaint data
1. BasicBasic delivery/bounce webhooks; suppression list exists but is not API-accessible
2. GoodStructured delivery receipts (delivered/bounced/complained); bounce categorization (hard/soft); API-accessible suppression lists; unsubscribe handling
3. ExcellentFull event lifecycle (processed → delivered → opened → clicked → unsubscribed → complained); automatic suppression management; per-recipient status tracking; bounce type classification with machine-readable codes

CM3Webhook/Event Infrastructure

Requirement: SHOULD | Weight: Standard (×1)

Whether the platform supports programmatic webhook configuration, cryptographic signature verification, event replay, and structured event payloads. Agents managing communication workflows need reliable, verifiable event delivery, not dashboard-only webhook setup.

ScoreDescription
0. FailingNo webhook support; or webhooks require dashboard-only configuration with no signature verification
1. BasicWebhook URLs configurable via API; events delivered as structured JSON payloads
2. GoodAPI-managed webhooks + cryptographic signature verification (HMAC or ECDSA); standard event types across the delivery lifecycle
3. ExcellentFull CRUD webhook management via API; signature verification; event replay capability; batched event delivery; per-stream webhook URLs; inbound message processing via webhooks

9.3 Module: Databases

Applies to: Databases, data platforms, ORMs

IDCriterionReqWeight
DB1Safe ExperimentationSHOULDStandard
DB2Schema Introspection QualitySHOULDStandard
DB3Query Interface SafetySHOULDStandard
DB4Connection ManagementMAYStandard

DB1Safe Experimentation

Requirement: SHOULD | Weight: Standard (×1)

Whether the database provides mechanisms for agents to experiment without risking production data: including branching, read-only replicas, point-in-time recovery, and copy-on-write environments. Agents occasionally issue destructive operations in error. The database must render those operations reversible.

ScoreDescription
0. FailingNo branching, snapshots, or recovery mechanism; destructive operations are permanent
1. BasicPoint-in-time recovery (PITR) available; manual backup/restore process
2. GoodRead-only replicas available; PITR with reasonable granularity; snapshot/clone capability (minutes to create)
3. ExcellentInstant copy-on-write branching (<1s creation); branch reset to parent state; schema-only and full-data branch modes; PITR with fine granularity

DB2Schema Introspection Quality

Requirement: SHOULD | Weight: Standard (×1)

Whether the database exposes machine-readable schema metadata: including table/column types, relationships, constraints, and semantic descriptions. Agents generating SQL require accurate schema context, but full schema exports are prohibitively large for direct LLM context injection.

ScoreDescription
0. FailingNo programmatic schema discovery; agent must guess table structure or rely on documentation alone
1. BasicStandard schema discovery (e.g., information_schema, SHOW TABLES); table and column names with types exposed
2. GoodFull schema with foreign key/relationship metadata; constraint enumeration; schema accessible via HTTP API (not just SQL)
3. ExcellentSemantic catalog with natural-language column/table descriptions (e.g., COMMENT ON); token-efficient schema representation; schema caching with DDL-change invalidation

DB3Query Interface Safety

Requirement: SHOULD | Weight: Standard (×1)

Whether the database enforces safe query patterns: including parameterized queries, row-level security, query validation before execution, and protection against the high error rate of agent-generated SQL. Agent-generated SQL fails at higher rates than human-written SQL in measured studies.

ScoreDescription
0. FailingRaw SQL string concatenation accepted; no parameterized query enforcement; no row-level security
1. BasicParameterized queries supported; basic SQL injection prevention
2. GoodParameterized queries enforced by default; row-level security (RLS) available; query explain/validation before execution
3. ExcellentRLS enabled by default on new tables; query validation with cost estimation; read-only query mode for exploration; guardrails against broad DELETE/UPDATE without WHERE clauses

DB4Connection Management

Requirement: MAY | Weight: Standard (×1)

Whether the database provides HTTP/REST access, managed connection pooling, and edge-compatible drivers. Agents running in serverless and edge environments (Vercel Edge, Cloudflare Workers) cannot establish TCP connections. HTTP-based access is the only viable path.

ScoreDescription
0. FailingTCP-only access; no connection pooling; no serverless-compatible drivers
1. BasicManaged connection pooling available; standard database drivers with connection management
2. GoodHTTP/REST API available alongside TCP; serverless-compatible drivers; connection pooling with scale-to-zero
3. ExcellentAuto-generated HTTP/REST API (e.g., PostgREST); WebSocket support for multi-statement transactions; edge-compatible drivers; scale-to-zero with sub-second cold starts

9.4 Module: Hosting & Infrastructure

Applies to: Cloud platforms, PaaS, serverless, container services

IDCriterionReqWeight
HI1Cost GuardrailsSHOULDStandard
HI2Deployment Lifecycle CompletenessSHOULDStandard
HI3Preview/Staging DeploymentsMAYStandard

HI1Cost Guardrails

Requirement: SHOULD | Weight: Standard (×1)

Whether the platform provides spending limits, auto-stop/scale-down, cost estimation, and usage tracking that agents can use programmatically. Agents do not reliably model the cost impact of provisioning decisions. Without guardrails, they may allocate resources and fail to release them.

ScoreDescription
0. FailingNo spending limits or cost controls; no usage tracking API; API defaults more permissive than console defaults
1. BasicUsage-based pricing with basic spending alerts; manual cost controls available via dashboard
2. GoodSpending limits configurable via API; auto-stop for idle resources; usage tracking API; cost alerts with configurable thresholds
3. ExcellentCost estimation before deployment; per-project spending caps via API; auto-scale-down to zero when idle; real-time cost tracking; API defaults match or are more restrictive than console defaults

HI2Deployment Lifecycle Completeness

Requirement: SHOULD | Weight: Standard (×1)

Whether the full deployment lifecycle (build, deploy, rollback, scale, log access, and environment management) is available via API, CLI, or MCP. An agent that can deploy but cannot roll back, or that can deploy but cannot access logs, has an operationally unsafe capability gap.

ScoreDescription
0. FailingDashboard-only deployment; no API or CLI for triggering deploys or reading logs
1. BasicDeploy and status check available via API/CLI; log access available but limited
2. GoodBuild, deploy, log access, and environment variable management via API; rollback available (redeploy previous version)
3. ExcellentFull lifecycle via API/CLI/MCP: deploy, rollback, scale, streaming logs, environment management; MCP server covering read and write operations across the lifecycle

HI3Preview/Staging Deployments

Requirement: MAY | Weight: Standard (×1)

Whether the platform supports creating isolated preview/staging environments via API: including branch deployments, ephemeral environments, and automatic cleanup. Agents deploying directly to production without preview create unrecoverable failures.

ScoreDescription
0. FailingNo preview or staging environment support; all deployments go directly to production
1. BasicManual staging environment available; preview deployments require dashboard configuration
2. GoodPreview deployments creatable via API; branch-based deployments; rollback to previous deployment via API
3. ExcellentAutomatic preview deployment per branch/PR via API; ephemeral environments with automatic cleanup; instant rollback by promoting previous deployment; progressive rollout support (canary/blue-green)

9.5 Module: Auth Providers

Applies to: Identity/authentication platforms (Auth0, Clerk, Firebase Auth, etc.)

IDCriterionReqWeight
AP1Agent-as-End-User SupportSHOULDStandard
AP2Social/External Connection APISHOULDStandard
AP3Token Architecture TransparencyMAYStandard

AP1Agent-as-End-User Support

Requirement: SHOULD | Weight: Standard (×1)

Whether the auth platform supports flows where the end user is an agent, not a human with a browser. Standard OAuth redirects, approval interfaces, and email-based verification do not function when the end user is a non-interactive agent. CIBA, Device Flow, Client Credentials, and dedicated agent identity types address this gap.

ScoreDescription
0. FailingAll auth flows require browser-based interaction (redirects, consent screens); no machine-to-machine support
1. BasicClient Credentials grant supported for M2M authentication; basic service account support
2. GoodClient Credentials + Device Flow or CIBA for async human approval; token vault or credential delegation for agents acting on behalf of users
3. ExcellentDedicated agent identity type (not retrofitted service accounts); credential vault with 35+ integrations; async authorization (CIBA) with push notification approval; scoped, time-bounded agent credentials with full audit trail

AP2Social/External Connection API

Requirement: SHOULD | Weight: Standard (×1)

Whether social/OAuth provider connections, redirect URIs, email templates, and session settings can be configured entirely via API: without requiring dashboard interaction. An agent bootstrapping auth for a new project must be able to complete setup programmatically.

ScoreDescription
0. FailingSocial connections and auth configuration require dashboard-only setup; no Management API
1. BasicCore auth settings configurable via API; some provider setup (e.g., social connections) still requires dashboard
2. GoodSocial connections configurable via API for most providers; email templates accessible via API; redirect URI management via API
3. ExcellentAll configuration API-driven (social providers, email templates, branding, custom domains); 60+ social providers configurable via API; dynamic client registration support

AP3Token Architecture Transparency

Requirement: MAY | Weight: Standard (×1)

Whether the platform clearly documents delegation chains, token lifetimes, refresh semantics, and trust boundaries: and supports emerging standards for agent-to-app authorization. Opaque token architectures prevent agents from reasoning about their own permissions and capabilities.

ScoreDescription
0. FailingOpaque token architecture; no documentation of token lifetimes, refresh semantics, or delegation chains
1. BasicToken lifetimes and refresh semantics documented; basic scope documentation
2. GoodDelegation chain documentation; Rich Authorization Requests (RAR) support; fine-grained authorization (FGA); credential rotation via API
3. ExcellentSupport for agent-to-app protocols (XAA or equivalent); per-action authorization logging; dual-secret rotation without downtime; brokered credentials preventing LLM token exposure; EU AI Act-ready audit trail

9.6 Module: Frameworks & Libraries

Applies to: Web frameworks, ORMs, UI libraries, build tools

IDCriterionReqWeight
FL1Type System QualitySHOULDStandard
FL2Scaffolding & Code GenerationMAYStandard
FL3Configuration ValidationSHOULDStandard

FL1Type System Quality

Requirement: SHOULD | Weight: Standard (×1)

Whether the framework provides strong, expressive types (TypeScript types, Python type hints, or equivalent) that constrain agent-generated code at compile time. Type systems provide the tightest feedback loop for agents: milliseconds to detect errors versus seconds or minutes for runtime failures.

ScoreDescription
0. FailingNo type definitions; untyped JavaScript, untyped Python, or equivalent; agents get no compile-time feedback
1. BasicType definitions available (e.g., @types/ package, basic type hints); core API surface typed
2. GoodComprehensive types across full API surface; generated types from schema (e.g., Prisma, GraphQL codegen); type inference support reducing annotation burden
3. ExcellentBranded/nominal types preventing ID confusion (e.g., BuildingID vs. CustomerID); generated types with schema-first design; types covering edge cases and error states; type-check performance stable as schema grows

FL2Scaffolding & Code Generation

Requirement: MAY | Weight: Standard (×1)

Whether the framework provides CLI generators, project templates, and code scaffolding that work non-interactively. Agents benefit from scaffolding over hand-constructing project structure, but scaffolding tools that depend on interactive prompts (arrow-key menus, confirmation dialogs) are inaccessible to agents.

ScoreDescription
0. FailingNo scaffolding tools; or scaffolding requires interactive prompts with no CLI flag bypass
1. BasicProject scaffolding CLI available; can generate basic project structure with default options via flags
2. GoodProject + component/module generators; templates for common patterns; all prompts bypassable via CLI flags
3. ExcellentFull non-interactive scaffolding with --yes/--defaults flags; generates project-specific configuration (e.g., AGENTS.md, type definitions); template library covering common patterns; generator output is immediately buildable/runnable

FL3Configuration Validation

Requirement: SHOULD | Weight: Standard (×1)

Whether the framework validates configuration files with actionable error messages and provides validation as a standalone command (not just at runtime). Agents generate configuration frequently and require immediate feedback on misconfiguration; deferred failures at application startup or runtime are operationally costly.

ScoreDescription
0. FailingNo configuration validation; silent misconfiguration; runtime crashes on bad config with unhelpful errors
1. BasicRuntime validation with error messages on misconfiguration; configuration file format documented
2. GoodJSON Schema for configuration files enabling editor validation; actionable error messages with suggested fixes; validation runs at startup before executing
3. ExcellentStandalone validation command (lint, check, validate) runnable without starting the application; JSON Schema published for IDE/agent integration; error messages include specific fix suggestions; type-safe configuration with compile-time checking

Appendix A: Criteria Quick Reference

Base Standard (15 criteria)

IDCriterionGroupReqWeight
B1Machine-Readable Documentation FormatsDocumentation & UsabilitySHOULDCritical
B2Code Example Coverage & QualityDocumentation & UsabilitySHOULDStandard
B3Documentation Structure & Self-ContainmentDocumentation & UsabilitySHOULDStandard
B4Documentation Accuracy & SynchronizationDocumentation & UsabilitySHOULDStandard
B5Getting Started CompletenessDocumentation & UsabilitySHOULDStandard
B6Changelog & Migration GuidanceDocumentation & UsabilityMAYStandard
B7Installation & Configuration SimplicityDocumentation & UsabilitySHOULDStandard
B8Supply Chain IntegritySafetySHOULDStandard
B9Vulnerability Disclosure & Security ContactSafetySHOULDStandard
B10Project SustainabilityLifecycleSHOULDCritical
B11Maintenance HealthLifecycleSHOULDStandard
B12Semver Adherence & Version StabilityLifecycleSHOULDStandard
B13Governance & ContinuityLifecycleMAYStandard
B14Security Track RecordLifecycleSHOULDStandard
B15Terms & Licensing StabilityLifecycleSHOULDStandard

Module: Programmatic Interface (16 criteria)

IDCriterionReqWeight
PI1Interface Reference CompletenessMUSTCritical
PI2Tool/Endpoint Description QualityMUSTCritical
PI3Tool Count & Surface Area ManagementSHOULDCritical
PI4Input Schema DesignSHOULDCritical
PI5Output Quality & Token EfficiencySHOULDStandard
PI6Response Envelope ConsistencySHOULDStandard
PI7Naming & NamespacingSHOULDStandard
PI8Behavioral Metadata & AnnotationsSHOULDStandard
PI9MCP Implementation QualityMAYStandard
PI10Programmatic Setup / TTFCSHOULDCritical
PI11API Workflow CoverageSHOULDStandard
PI12Versioning & API StabilitySHOULDStandard
PI13SDK Availability & QualitySHOULDStandard
PI14Agent Protocol AvailabilitySHOULDStandard
PI15Input Sanitization & Injection ResistanceMUSTStandard
PI16Prompt Injection ResistanceSHOULDStandard

Module: Network Service (8 criteria)

IDCriterionReqWeight
NS1Error Response Quality & StructureMUSTCritical
NS2Rate Limit CommunicationSHOULDCritical
NS3Health & Status CommunicationSHOULDStandard
NS4Audit & ObservabilitySHOULDStandard
NS5Test/Sandbox Environment SupportSHOULDStandard
NS6Environment SeparationSHOULDStandard
NS7Asynchronous Operation SupportMAYStandard
NS8Data Portability & Pricing TransparencyMAYStandard

Module: Write Operations (4 criteria)

IDCriterionReqWeight
WO1Destructive Operation SafetySHOULDCritical
WO2Dry-Run / Validation CapabilityMAYStandard
WO3Idempotency & Safe Retry SupportSHOULDStandard
WO4Workflow Error CommunicationSHOULDStandard

Module: Authentication (4 criteria)

IDCriterionReqWeight
AU1Non-Interactive Authentication MethodsMUSTCritical
AU2Permission GranularitySHOULDStandard
AU3Credential Lifecycle ManagementSHOULDStandard
AU4Agent Identity SupportMAYStandard

Module: CLI (4 criteria)

IDCriterionReqWeight
CLI1Non-Interactive ExecutionSHOULDStandard
CLI2Structured Output ModeSHOULDStandard
CLI3Cross-Platform ConsistencySHOULDStandard
CLI4Configuration Format SafetyMAYStandard

⚑ = Open-source/commercial split rubric.

Domain Module: Payments & Financial (4 criteria)

IDCriterionReqWeight
PM1Idempotency DepthSHOULDStandard
PM2Test Simulation FidelitySHOULDStandard
PM3Compliance AutomationMAYStandard
PM4Currency & Amount SafetySHOULDStandard

Domain Module: Communications (3 criteria)

IDCriterionReqWeight
CM1Irreversibility SafeguardsSHOULDStandard
CM2Delivery VerificationSHOULDStandard
CM3Webhook/Event InfrastructureSHOULDStandard

Domain Module: Databases (4 criteria)

IDCriterionReqWeight
DB1Safe ExperimentationSHOULDStandard
DB2Schema Introspection QualitySHOULDStandard
DB3Query Interface SafetySHOULDStandard
DB4Connection ManagementMAYStandard

Domain Module: Hosting & Infrastructure (3 criteria)

IDCriterionReqWeight
HI1Cost GuardrailsSHOULDStandard
HI2Deployment Lifecycle CompletenessSHOULDStandard
HI3Preview/Staging DeploymentsMAYStandard

Domain Module: Auth Providers (3 criteria)

IDCriterionReqWeight
AP1Agent-as-End-User SupportSHOULDStandard
AP2Social/External Connection APISHOULDStandard
AP3Token Architecture TransparencyMAYStandard

Domain Module: Frameworks & Libraries (3 criteria)

IDCriterionReqWeight
FL1Type System QualitySHOULDStandard
FL2Scaffolding & Code GenerationMAYStandard
FL3Configuration ValidationSHOULDStandard

Totals: 15 base + 16 interface + 8 network + 4 write + 4 auth + 4 CLI = 51 base + complexity criteria | 20 domain-specific across 6 modules (4+3+4+3+3+3) | 5 MUST gates (all in complexity modules) | A tool may trigger multiple domain modules