
## Table of Contents

1. [Evaluation Logic](#1-evaluation-logic)
2. [How the Standard Works](#2-how-the-standard-works)
3. [How to Read This Document](#3-how-to-read-this-document)
4. [Scoring System](#4-scoring-system)
5. [Assessment Methodology](#5-assessment-methodology)
6. [Versioning & Governance](#6-versioning--governance)
7. [Scope & Limitations](#7-scope--limitations)
8. [Base Standard](#8-base-standard)
9. [Complexity Modules](#9-complexity-modules)
    - [9.1 Programmatic Interface](#91-module-programmatic-interface)
    - [9.2 Network Service](#92-module-network-service)
    - [9.3 Write Operations](#93-module-write-operations)
    - [9.4 Authentication](#94-module-authentication)
    - [9.5 CLI](#95-module-cli)
10. [Domain Modules](#10-domain-modules)

---

## 1. Evaluation Logic

The design principles behind the Zaira Standard. These explain why the standard is structured the way it is and guide interpretation of the criteria.

### The standard scales with tool complexity

The simplest tools face the simplest evaluation. A library installed via `npm install` and used locally gets evaluated on the Base Standard alone: documentation, lifecycle health, supply chain integrity, installation simplicity. No authentication criteria. No API surface area management. No rate limit communication. Those criteria don't apply because the tool doesn't have those concerns.

As a tool's complexity increases (it exposes an API, runs as a hosted service, handles destructive operations, requires credentials) complexity modules activate and add criteria proportional to that complexity. A cloud database with auth, writes, and a REST API gets evaluated on the Base Standard plus four complexity modules. The evaluation matches the tool's surface area.

This means every criterion in a tool's evaluation genuinely applies to that tool. There are no N/A markings, no skipped criteria, no denominator adjustments. If a criterion is activated in a tool's evaluation, it is relevant to that evaluation.

### Three tiers of evaluation output

Not all tools require the same depth of evaluation. The standard produces three levels of output based on complexity and score:

- **Agent-Ready Base** for simple tools that pass the Base Standard with no complexity modules triggered.
- **Scorecards** for tools that don't meet their applicable tier thresholds, providing detailed per-criterion and per-module scores.
- **Agent Ready / Agent Native** designations for complex tools that meet the defined score and gate requirements. See §4 for tier definitions and thresholds.

The bar for agent-readiness scales with a tool's actual complexity. A utility library with good docs and healthy maintenance signals is agent-ready at the Base level. A payment processing API with auth, write operations, and a network service has a much larger surface to evaluate.

### Health over activity

The standard measures whether a tool works, not whether it's being actively developed. A stable, feature-complete library with no commits in two years but zero open CVEs, passing CI, and accurate docs is healthier than a tool with weekly commits and a backlog of unanswered issues. Criteria that could penalize inactivity (documentation accuracy (B4), maintenance health (B11)) are written to measure empirical health signals (do the docs match reality? are vulnerabilities patched? does it install on current runtimes?) rather than recency signals (when was the last commit?). "Done" is not a state declared through version bumps or announcements: it is a state demonstrated through continued health despite inactivity.

### Discoverability is outside scope

The Zaira Standard evaluates whether a tool is ready for agents to *use*, not whether agents can *find* it. Discoverability (registry presence, search engine indexing, structured metadata) is a separate concern. Conflating "is this tool findable?" with "is this tool agent-ready?" would penalize excellent tools with poor marketing and reward mediocre tools with good SEO.

### Agent capability floor

The Zaira Standard defines a minimum agent capability threshold: the **capability floor**. Agents scoring below the floor are not target consumers of Zaira Standard evaluation results.

The capability floor for Zaira Standard v0.9 is **35% on SWE-Bench Pro V9**, administered by SEAL. This benchmark evaluates base model and sub-agent performance on software engineering tasks with tool usage. It strips wrapper scaffolding and vendor-specific enhancements.

The floor is defined by benchmark score, not by model name. The capability floor is a normative parameter of each standard version, reviewed with each minor revision (see §6) following the standard change process.

### The evaluation itself must be agent-executable

A standard that evaluates whether tools are ready for agents should itself be evaluable by agents. Every Zaira Standard criterion is designed to be scored without human intervention: through a combination of deterministic automated checks and structured rubric evaluation. This is not a convenience feature; it's a scaling requirement. Evaluating 500+ tools with humans in the loop creates a bottleneck that makes the standard impractical.

Humans appear only at defined escalation points: dispute resolution, where adversarial context requires human authority, and edge cases where automated checks flag ambiguity. Any criterion that cannot be evaluated without a human must present an extreme justification for its existence.

---

## 2. How the Standard Works

### Module Activation

Every tool starts with the **Base Standard**: 15 criteria that apply universally. Then, based on what the tool does and how it's accessed, **complexity modules** activate:

| Module | Trigger Question | Criteria Added |
|--------|-----------------|----------------|
| **Programmatic Interface** | Does the tool expose an API, SDK, or MCP server? | 16 |
| **Network Service** | Is the tool a hosted/remote service? | 8 |
| **Write Operations** | Can the tool create, modify, or delete data/resources? | 4 |
| **Authentication** | Does the tool require credentials or tokens? | 4 |
| **CLI** | Does the tool have a command-line interface? | 4 |

After complexity modules, **one or more domain modules** may apply based on the tool's functional domains (Payments, Databases, Communications, etc.). A multi-domain tool like Supabase triggers every domain module that applies to its feature set.

Each trigger question is a simple yes/no. A tool may trigger zero, one, or many complexity modules. The triggers are independent: a tool can be a Network Service without having a CLI, or have Write Operations without requiring Authentication.

### Example Evaluations

| Tool | Base | Interface | Network | Write | Auth | CLI | Domain(s) | Total Criteria |
|------|:----:|:---------:|:-------:|:-----:|:----:|:---:|:---------:|:--------------:|
| SQLite | ✓ | | | | | | | 15 |
| lodash | ✓ | | | | | | | 15 |
| React | ✓ | | | | | | Frameworks | 18 |
| git CLI | ✓ | | | ✓ | | ✓ | | 23 |
| Terraform CLI | ✓ | | | ✓ | ✓ | ✓ | Hosting | 30 |
| Neon (DB) | ✓ | ✓ | ✓ | ✓ | ✓ | | Databases | 51 |
| Stripe | ✓ | ✓ | ✓ | ✓ | ✓ | | Payments | 51 |
| Supabase | ✓ | ✓ | ✓ | ✓ | ✓ | | DB + Auth + Hosting | 57 |
| AWS Redshift | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Databases | 55 |

The simplest tools have the simplest evaluation. The most complex tools get everything.

---

## 3. How to Read This Document

### Criterion Structure

Each criterion includes:

| Field | Meaning |
|-------|---------|
| **ID** | Unique identifier (e.g., `B1`, `AU3`, `NS5`) |
| **Name** | Short descriptive name |
| **Description** | What this criterion evaluates |
| **Requirement Level** | `MUST` (binary gate), `SHOULD` (expected), or `MAY` (aspirational) |
| **Weight** | `Critical` (×2) or `Standard` (×1): determines point multiplier |
| **Scoring Gradient** | 0 (Failing), 1 (Basic), 2 (Good), 3 (Excellent) |

Criterion IDs use module-based prefixes: **B** (Base Standard), **PI** (Programmatic Interface), **NS** (Network Service), **WO** (Write Operations), **AU** (Authentication), **CLI** (CLI), **PM** (Payments), **CM** (Communications), **DB** (Databases), **HI** (Hosting), **AP** (Auth Providers), **FL** (Frameworks). IDs are sequential within each module: a criterion's ID indicates which module it belongs to.

### Requirement Levels (RFC 2119)

- **MUST:** Binary gate. Score ≥1 required for Agent Ready or Agent Native. A tool that scores 0 on any MUST criterion in its activated modules cannot achieve either designation. MUST gates only appear in complexity modules: the Base Standard has no MUST gates.
- **SHOULD:** Expected for meaningful agent readiness. Scored and weighted. Low scores reduce tier eligibility.
- **MAY:** Aspirational. Demonstrates excellence. Primarily differentiates the highest tier.

### Weight Categories

- **Critical (×2):** Criteria with the strongest documented impact on agent success rates, including error handling, description quality, and authentication.
- **Standard (×1):** All other criteria. Important but with less dramatic measured impact or less universal applicability.

### Open-Source vs. Commercial Tool Splits

Where a criterion measures fundamentally different things depending on whether the tool is open-source or commercial, two scoring gradients are provided. An open-source library's sustainability risk is contributor concentration; a commercial SaaS tool's sustainability risk is corporate strategy and sunset policy. Both matter, but they require different evidence.

- **Open-source tools:** Use the open-source rubric (listed first).
- **Commercial/closed-source tools:** Use the commercial rubric instead.
- **Hybrid tools** (open-source core with commercial hosted offering): Evaluate against the open-source rubric for the open-source artifact. If the primary product being evaluated is the hosted service, use the commercial rubric.

Scores are directly comparable: a "2" on either rubric means the same thing: "Good."

---

## 4. Scoring System

### How Criteria Map to Points

Each criterion is scored on a 0-3 scale:

| Score | Label | Meaning |
|-------|-------|---------|
| 0 | Failing | Does not meet minimum requirements; actively harms agent usability |
| 1 | Basic | Minimum viable implementation; functional but limited |
| 2 | Good | Solid implementation; meaningfully supports agent workflows |
| 3 | Excellent | Best-in-class; designed with agents as a first-class user |

### Point Calculation

```
Criterion Score = Raw Score (0-3) × Weight (1 or 2)

Module Score = Sum of all criterion scores in module
Total Score = Sum of all module scores

Percentage = Total Score / Maximum Possible Score for activated modules
```

Because modules only activate when relevant, there are no N/A adjustments. The denominator is always the maximum possible score for the specific modules a tool activates.

### Tier Definitions

The Zaira Standard produces three possible designations based on score and complexity.

#### Agent-Ready Base

For tools where the Base Standard is all that applies (no complexity modules triggered). Agent-Ready Base means the tool's fundamentals (documentation, lifecycle health, supply chain integrity, installation) are solid for agent consumption. The evaluation is complete because the Base Standard fully covers the tool's complexity surface.

| Requirement | Threshold |
|------------|-----------|
| **Applies to** | Tools that trigger zero complexity modules |
| **Overall Score** | ≥60% of Base Standard |
| **Zero Scores** | No more than 3 SHOULD criteria score 0 |

#### Agent Ready

For tools that trigger one or more complexity modules. Agent Ready means the tool is substantively usable by agents for standard workflows across its full complexity surface.

| Requirement | Threshold |
|------------|-----------|
| **Applies to** | Tools that trigger 1+ complexity modules |
| **Overall Score** | ≥60% |
| **MUST Criteria** | All score ≥1 |
| **Zero Scores** | No more than 5 MUST or SHOULD criteria score 0 |

#### Agent Native

The highest designation. Agent Native includes all Agent Ready requirements plus additional thresholds. A tool that meets Agent Native automatically satisfies every Agent Ready requirement.

| Requirement | Threshold |
|------------|-----------|
| **Applies to** | Tools that trigger 1+ complexity modules |
| **Overall Score** | ≥80% |
| **MUST Criteria** | All score ≥2 |
| **Zero Scores** | No MUST or SHOULD criterion scores 0 |

### Anti-Gaming Mechanisms

1. **MUST gates.** MUST criteria in activated modules are binary gates: score 0 on any and the tool cannot achieve Agent Ready or Agent Native regardless of total score.
2. **Critical weighting.** Criteria with strongest empirical impact on agent success count double, preventing gaming through easy wins.
3. **Zero-score limits.** Agent Ready allows no more than 5 SHOULD/MUST criteria at 0; Agent Native allows none. This prevents broad neglect across any part of the evaluation.
4. **Evidence requirements.** Every score requires documented evidence (automated test output, URL, or evaluator rationale). No score without evidence.

These mechanisms work together to ensure that tier designations reflect genuine agent-readiness across a tool's full surface area, not selective optimization of easy criteria.

### Scoring Examples

**Example 1: Simple library (lodash)**

Modules: Base only (15 criteria)
- 2 Critical × 3 × 2 = 12
- 13 Standard × 3 × 1 = 39
- Max: 51

No complexity modules triggered. Evaluated for Agent-Ready Base.

Agent-Ready Base check:
- ≥60%? (need ≥31/51)
- ≤3 SHOULD criteria score 0?
- If both pass → **Agent-Ready Base** designation.

**Example 2: CLI tool with write operations (Terraform CLI)**

Modules: Base (15) + Write Operations (4) + Authentication (4) + CLI (4) = 27 criteria
- Critical criteria: B1, B10, WO1, AU1 = 4 Critical
- 4 Critical × 3 × 2 = 24
- 23 Standard × 3 × 1 = 69
- Max: 93

MUST gates: AU1
Tier-eligible.

**Example 3: Full SaaS payment platform (Stripe)**

Modules: Base (15) + Programmatic Interface (16) + Network Service (8) + Write Operations (4) + Authentication (4) + Payments (4) = 51 criteria
- Critical criteria: B1, B10, PI1, PI2, PI3, PI4, PI10, NS1, NS2, WO1, AU1 = 11 Critical
- 11 Critical × 3 × 2 = 66
- 40 Standard × 3 × 1 = 120
- Max: 186

MUST gates: PI1, PI2, PI15, NS1, AU1
Tier-eligible.

Example scores:
- Base: 34/51 (67%)
- Programmatic Interface: 42/63 (67%)
- Network Service: 20/30 (67%)
- Write Operations: 11/15 (73%)
- Authentication: 11/15 (73%)
- Payments: 8/12 (67%)

**Total: 126/186 = 68%**

Tier check:
- ≥60%? Yes → Agent Ready candidate
- All MUSTs ≥1? (PI1 ✓, PI2 ✓, PI15 ✓, NS1 ✓, AU1 ✓) Yes ✓
- ≤5 SHOULD/MUST criteria score 0? Yes (2 zeros) ✓
- **Result: Agent Ready**

---

## 5. Assessment Methodology

### Evaluation Approach

Every Zaira Standard criterion is designed to be scored without human intervention, through a combination of deterministic automated checks and structured rubric evaluation.

The evaluation combines two layers:

- **Automated checks** verify objective criteria: documentation format presence, API response structure, rate limit headers, security metadata, lifecycle health signals, and similar machine-verifiable properties.
- **Structured rubric evaluation** assesses criteria requiring judgment: documentation quality, error message actionability, description effectiveness, and workflow coverage. Each criterion has a documented rubric specifying exactly what a 0, 1, 2, and 3 looks like.

For Agent Native candidates, a human escalation path exists for edge cases where automated evaluation produces ambiguous results. Human review covers only flagged criteria, not the entire evaluation.

### Evidence Requirements

Every score requires documented evidence:

| Evaluation Method | Evidence Type |
|-----------------|--------------|
| Automated | Test execution log (what was tested, expected result, actual result) |
| Rubric Evaluation | Evaluation output with rubric reference and scoring rationale |
| Human Escalation | Written assessment with specific observations and override rationale |

### Score Tagging

Each criterion's score is tagged with its evaluation method:

- `[A]`. Automated (machine-verified, reproducible)
- `[AI]`. Rubric evaluation (structured, reproducible)
- `[E]`. Human-escalated (reviewed by human, exception path only)

### Evaluated Version Policy

Evaluation is conducted against the latest stable release at the time evaluation begins, unless the tool maintainer designates a specific supported release. Pre-release, beta, and release candidate versions are not eligible for evaluation. If a new stable release ships during evaluation, the evaluator may incorporate changes at their discretion but is not required to restart. Tools with multiple actively-supported major versions (e.g., LTS and Current tracks) may be evaluated separately; each version receives its own evaluation record.

### Dispute Process

When a score is disputed:

1. **Submit dispute** with evidence (URL, screenshot, API response, or explanation)
2. **Re-evaluation** of the disputed criteria with the provided context included as additional input
3. **Human review**: a reviewer examines the original score, the re-evaluation result, and the submitted evidence, then makes a final determination
4. **Decision** documented publicly with rationale
5. **Score adjusted** if evidence warrants

---

## 6. Versioning & Governance

The Zaira Standard is public and stable. Revisions are made only when durable shifts in agent-tool interaction warrant them, not in response to short-term trends. Certified tools receive at least 30 days advance notice before any version change takes effect.

### Version Scheme

- **v0.9:** Current version. Scoring thresholds and weights may be refined based on evaluation data before v1.0.
- **v1.0:** Stable release with finalized thresholds.
- **v1.x (Minor):** Additive criteria, refined scoring, weight adjustments. Backward compatible.
- **v2.0 (Major):** Structural changes (module additions/removals, tier restructuring).

---

## 7. Scope & Limitations

The Zaira Standard evaluates agent usability. It does not evaluate:

- **Tool quality or fitness for purpose.** Whether the tool is good at what it does.
- **Security guarantees.** A tool can meet all Safety criteria and still have undiscovered vulnerabilities. The standard evaluates evidence and practices, not absence of risk.
- **Performance benchmarks.** Response time, throughput, and uptime are not evaluated (though health endpoints are).
- **Pricing fairness.** Whether pricing is transparent and machine-readable is evaluated, not whether it's competitive.
- **Training data representation.** How well current models know a tool is not under the tool's control.
- **Compliance certifications.** SOC 2, ISO 27001, HIPAA, etc. are noted when present but not replicated.
- **Runtime enforcement.** The standard publishes evaluation data; enforcement is the responsibility of agent runtimes and policy engines.
- **Human developer experience.** Criteria are evaluated from the agent's perspective.
- **Discoverability.** Whether agents can find a tool is outside the standard's scope.

---

## 8. Base Standard

*15 criteria that apply to every tool, regardless of type.*

The Base Standard evaluates the fundamentals: whether an agent can learn to use a tool, whether it is safe to depend on, and whether it will remain available in the future. Every tool (from a 50-line utility library to a cloud platform) is evaluated against these criteria.

The Base Standard has no MUST gates. Simple tools either meet the standard or they don't, based on their overall score. MUST gates appear in complexity modules where the stakes of failure are higher.

---

### Documentation & Usability

*Can an agent learn to use this tool and get it running?*

Documentation and installation are the entry point. An agent encountering a tool for the first time needs machine-readable, structured, self-contained information to understand capabilities and usage, and a friction-free path to get it installed and working. More documentation is not better. Irrelevant docs actively harm agent performance. Quality, structure, and machine-readability matter more than volume.

---

#### B1. Machine-Readable Documentation Formats

**Requirement:** SHOULD | **Weight:** Critical (×2)

Whether documentation is available in formats agents can directly consume, not just human-rendered HTML requiring JavaScript.

| Score | Description |
|-------|-------------|
| 0. Failing | Interactive-only docs (Swagger UI without downloadable spec); video tutorials; CSS-styled HTML requiring JS rendering |
| 1. Basic | Docs available as static HTML or Markdown; README with usage instructions |
| 2. Good | Comprehensive Markdown docs; `llms.txt` present; docs available via content negotiation or direct download |
| 3. Excellent | `llms.txt` + `AGENTS.md` + content negotiation (`Accept: text/markdown`) + version-matched bundled docs or MCP documentation server |

---

#### B2. Code Example Coverage & Quality

**Requirement:** SHOULD | **Weight:** Standard (×1)

The density, quality, realism, and progressive complexity of code examples.

| Score | Description |
|-------|-------------|
| 0. Failing | No code examples; or examples use placeholder data (`"string"`, `123`) |
| 1. Basic | At least one example per major feature; examples use realistic data |
| 2. Good | Examples for most features/methods; include both success and error scenarios; copy-pasteable |
| 3. Excellent | Progressive complexity (minimal → common → advanced); 1–5 examples per feature focusing on ambiguous cases; error recovery examples included |

---

#### B3. Documentation Structure & Self-Containment

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether documentation sections are self-contained (extractable independently), structured for agent consumption, and appropriately chunked.

| Score | Description |
|-------|-------------|
| 0. Failing | Documentation is a single monolithic page; no logical sections; requires full-document context to understand any part |
| 1. Basic | Logical section divisions with headings; most sections are readable independently |
| 2. Good | Answer-first format; descriptive headings that function as queries; self-contained sections of 100–200 words; cross-references include inline context |
| 3. Excellent | All sections independently extractable; tables for structured data; critical information in first 30% of each section; optimized for retrieval-augmented generation |

---

#### B4. Documentation Accuracy & Synchronization

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether documentation accurately reflects actual tool behavior. Accuracy matters more than recency: docs that haven't changed in three years but perfectly match the tool's behavior score higher than docs updated last week with errors.

| Score | Description |
|-------|-------------|
| 0. Failing | Documented behavior contradicts actual tool behavior; or documented methods/functions don't exist; or docs describe a different version than the current release |
| 1. Basic | No known major inaccuracies; documented examples produce expected results |
| 2. Good | Docs-as-code (versioned alongside product); documented examples tested; version-matched (docs specify which version they describe) |
| 3. Excellent | Docs updated with every release; CI/CD blocks deployment without doc updates; automated drift detection |

---

#### B5. Getting Started Completeness

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether an agent can go from zero to working usage using only the documentation.

| Score | Description |
|-------|-------------|
| 0. Failing | No getting-started guide; or guide requires significant external knowledge |
| 1. Basic | Getting-started guide exists; covers basic setup |
| 2. Good | Guide is completable by following docs alone without external knowledge; includes installation, first usage, and expected output |
| 3. Excellent | Agent-specific quickstart or integration guide; includes common pitfalls; minimal viable example under 20 lines of code; first successful usage achievable in <5 minutes |

---

#### B6. Changelog & Migration Guidance

**Requirement:** MAY | **Weight:** Standard (×1)

Whether changes are communicated in structured, parseable formats.

| Score | Description |
|-------|-------------|
| 0. Failing | No changelog; or changelog is unstructured prose buried in blog posts |
| 1. Basic | Changelog exists with dated entries |
| 2. Good | Structured and parseable changelog (consistent format); semantic versioning; breaking changes clearly marked |
| 3. Excellent | Changelog available as structured data (JSON, RSS/Atom feed); migration guides for breaking changes; `deprecated` markers in code or specs |

---

#### B7. Installation & Configuration Simplicity

**Requirement:** SHOULD | **Weight:** Standard (×1)

How easily an agent can install, set up, and start using a tool.

| Score | Description |
|-------|-------------|
| 0. Failing | GUI installer required; complex multi-step build process; missing executables with no clear resolution |
| 1. Basic | Package manager install (`npm install`, `pip install`); basic documentation |
| 2. Good | Single command install; environment variable configuration; clear error messages on misconfiguration |
| 3. Excellent | Single static binary or zero-dependency install; zero-config usage possible; scaffolding tools for project setup; JSON Schema for configuration validation |

---

### Safety Fundamentals

*Is this tool safe to depend on?*

Two safety criteria apply to every tool regardless of type: supply chain integrity and vulnerability disclosure. These are the baseline: "can I trust where this came from, and is there a way to report problems?"

---

#### B8. Supply Chain Integrity

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether publisher identity is verified, releases are signed, and the supply chain is tamper-evident.

**Open-source tools: evaluate this section:**

| Score | Description |
|-------|-------------|
| 0. Failing | Anonymous or unclear maintainer identity; no integrity signals |
| 1. Basic | Verified publisher on npm/PyPI; GitHub org verification; some identity signals present |
| 2. Good | Signed tags or releases; dependency pinning (lock files); verified domain → repo → artifact chain |
| 3. Excellent | Sigstore/cosign signing; SLSA provenance attestation; reproducible builds; SBOM published; complete identity chain (domain → repo → artifact maintainer match) |

**Trust signal hierarchy:** Reproducible builds > SLSA attestation > Code signing > Verified domain > GitHub org verification > npm/PyPI verified publisher > SBOM > security.txt

**Commercial/closed-source tools: evaluate this section instead:**

| Score | Description |
|-------|-------------|
| 0. Failing | No verifiable publisher identity; SDK or agent distributed through unofficial channels; no integrity signals |
| 1. Basic | Verified company domain; SDK published under verified org on npm/PyPI; official distribution channels clearly identified |
| 2. Good | SDKs signed or published with verified provenance; official distribution channels documented; dependency pinning in SDK; checksums for downloadable artifacts |
| 3. Excellent | SDKs with code signing and provenance attestation; SOC 2 Type II or equivalent supply chain controls; SBOM for SDK dependencies; documented build and release security practices |

---

#### B9. Vulnerability Disclosure & Security Contact

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether a clear, machine-readable path exists for reporting security vulnerabilities.

| Score | Description |
|-------|-------------|
| 0. Failing | No clear vulnerability reporting path |
| 1. Basic | Generic security contact exists (email address, contact form) |
| 2. Good | Published vulnerability disclosure policy with clear process and timeline commitments |
| 3. Excellent | `security.txt` present (RFC 9116) at `/.well-known/security.txt`; published vulnerability disclosure policy; bug bounty program; clear response commitments (acknowledgment, assessment, and fix timelines) |

---

### Lifecycle Health

*Will this tool be reliable over time?*

These criteria evaluate long-term viability: whether a tool will continue to work, be maintained, and remain safe to depend on. A tool that scores well today but is abandoned next year is a poor recommendation.

Open-source and commercial tools have fundamentally different risk profiles. An open-source tool's risk is contributor abandonment; a commercial tool's risk is corporate sunset or acquisition. Where this distinction matters, criteria provide separate rubrics. For open-source tools, most criteria are fully automatable from public data (GitHub API, package registries, OpenSSF Scorecard).

---

#### B10. Project Sustainability

**Requirement:** SHOULD | **Weight:** Critical (×2)

The likelihood that this tool will continue to be maintained and supported over time.

**Open-source tools: evaluate this section:**

| Score | Description |
|-------|-------------|
| 0. Failing | Bus factor of 1 (single contributor accounts for >50% of contributions); or no commits in 12+ months with open issues |
| 1. Basic | Bus factor of 2–3; some contributor diversity |
| 2. Good | Bus factor of 4–10; multiple active contributors; no single contributor >50% of recent commits |
| 3. Excellent | Bus factor >10; organizational backing; contributor pipeline visible (new contributors joining) |

**Commercial/closed-source tools: evaluate this section instead:**

| Score | Description |
|-------|-------------|
| 0. Failing | No visible team or organization; single-person operation with no stated continuity plan |
| 1. Basic | Established company; identifiable team; product actively marketed |
| 2. Good | Company with public funding or revenue signals; dedicated product team; published product roadmap |
| 3. Excellent | Publicly traded or well-funded company; product is a core revenue line (not a side project); published sunset/migration policy; data export API |

---

#### B11. Maintenance Health

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the tool shows signs of active health, not activity for its own sake, but evidence that it still works and someone is home if something breaks. A project with zero open issues and no recent commits is healthy. A project with 50 unanswered issues and no recent commits is abandoned. The distinction is measurable.

| Score | Description |
|-------|-------------|
| 0. Failing | Unanswered issues accumulating (>10 open issues with no maintainer response in 90+ days); or unpatched known vulnerabilities >90 days old; or fails to install on current LTS runtimes |
| 1. Basic | Open issues receive some response; no unpatched critical vulnerabilities; installs and runs on current platforms |
| 2. Good | <7 days median issue response when issues exist; dependencies up to date or pinned to non-vulnerable versions; CI passing on current runtimes |
| 3. Excellent | <48 hours issue triage; proactive dependency updates; CI tests against multiple runtime versions; clear triage labels; published response time commitments. OR for mature stable projects: zero unpatched vulnerabilities; CI passing on current LTS runtimes; <7 day response on the last 5 issues filed (whenever they were filed); dependencies pinned to non-vulnerable versions; no open issues older than 180 days without maintainer response |

---

#### B12. Semver Adherence & Version Stability

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether breaking changes are confined to major versions and the tool follows predictable versioning.

| Score | Description |
|-------|-------------|
| 0. Failing | No versioning strategy; breaking changes in minor/patch releases; or perpetually pre-1.0 with breaking changes |
| 1. Basic | Versioned releases exist; some semver adherence |
| 2. Good | Semver-compliant; breaking changes in major versions only; pre-1.0 tools clearly labeled as unstable |
| 3. Excellent | Strict semver; documented API stability guarantees; machine-readable compatibility matrices; LTS versions for production use |

---

#### B13. Governance & Continuity

**Requirement:** MAY | **Weight:** Standard (×1)

Whether the tool has governance structures that reduce single-entity risk and provide continuity assurance.

**Open-source tools: evaluate this section:**

| Score | Description |
|-------|-------------|
| 0. Failing | Single individual maintainer with no organizational backing; no succession plan |
| 1. Basic | Multiple maintainers with informal governance; or backed by a single company |
| 2. Good | Open governance model; contributor guidelines; decision-making process documented; multiple organizational contributors |
| 3. Excellent | Foundation governance (CNCF, Apache, Linux Foundation); formal succession planning; multiple organizational contributors with commit rights |

**Commercial/closed-source tools: evaluate this section instead:**

| Score | Description |
|-------|-------------|
| 0. Failing | No public information about the company or team; no terms of service addressing continuity |
| 1. Basic | Established company with identifiable leadership; standard terms of service |
| 2. Good | Published data portability/export mechanisms; documented SLA; company financials or funding publicly known |
| 3. Excellent | Publicly traded or independently audited financials; published sunset policy with migration timeline commitments; data escrow or open-source fallback clause; contractual SLA with uptime guarantees |

---

#### B14. Security Track Record

**Requirement:** SHOULD | **Weight:** Standard (×1)

Vulnerability response speed and proactive security practices.

**Open-source tools: evaluate this section:**

| Score | Description |
|-------|-------------|
| 0. Failing | Known unpatched vulnerabilities >90 days old; no security response history; OpenSSF Scorecard <3/10 |
| 1. Basic | Vulnerabilities patched within 90 days; some security practices visible; OpenSSF Scorecard 3–5/10 |
| 2. Good | Vulnerabilities patched within 30 days; code review enforced; branch protection enabled; OpenSSF Scorecard 5–7/10 |
| 3. Excellent | Vulnerabilities patched within 14 days; comprehensive security practices; CI security scanning; OpenSSF Scorecard >7/10; Code-Review check passing |

**Commercial/closed-source tools: evaluate this section instead:**

| Score | Description |
|-------|-------------|
| 0. Failing | No public security information; no evidence of security practices; known incidents with no public response |
| 1. Basic | Security contact or security.txt exists; incidents acknowledged publicly; some security practices described on website |
| 2. Good | Published security practices page; SOC 2 Type I or equivalent; vulnerability disclosure policy with timeline commitments; incident post-mortems published |
| 3. Excellent | SOC 2 Type II or ISO 27001 certified; bug bounty program; incident post-mortems with root cause analysis; proactive security advisories; SDK dependencies regularly audited |

---

#### B15. Terms & Licensing Stability

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the terms under which the tool is available are stable and free from change risk signals.

**Open-source tools: evaluate this section:**

| Score | Description |
|-------|-------------|
| 0. Failing | No license specified; or non-standard/proprietary license with no stability commitment |
| 1. Basic | OSI-approved license |
| 2. Good | Stable OSI license with no change risk signals (no single-company >80% commits + broad CLA + cloud competition pattern) |
| 3. Excellent | Stable license + none of the known change risk indicators; or irrevocable license grant; foundation-held copyright |

**Commercial/closed-source tools: evaluate this section instead:**

| Score | Description |
|-------|-------------|
| 0. Failing | No published terms of service; or terms allow unilateral changes with no notice |
| 1. Basic | Published terms of service; clear commercial licensing terms |
| 2. Good | Pricing commitments of 12+ months; terms require 90+ days notice for material changes; grandfathering policy for existing customers |
| 3. Excellent | Multi-year pricing commitments or published pricing history demonstrating stability; contractual protection against adverse term changes; machine-readable pricing API; published API deprecation policy with 12+ month windows |

---

## 9. Complexity Modules

Complexity modules add criteria based on how the tool is accessed and what it does. Each module is activated by a yes/no trigger question. A tool may activate zero, one, or many complexity modules: the triggers are independent.

Activating at least one complexity module makes the tool eligible for tiered certification (Agent Ready / Agent Native). Tools evaluated on the Base Standard alone receive an evaluation scorecard but not a tier.

---

### 9.1 Module: Programmatic Interface

**Trigger:** Does the tool expose an API (REST, GraphQL, gRPC), SDK, or MCP server?

*16 criteria. Evaluates the quality and safety of the agent-facing programmatic interface: descriptions, schemas, outputs, naming, protocol support, and interface-level security.*

How the tool presents itself to agents through its programmatic interface. Tool description quality alone was called "the single most critical factor" by 13+ independent sources. Less is consistently more: fewer tools, tighter schemas, smaller outputs, and more precise names all improve agent performance. Interface-level security (input sanitization, prompt injection resistance) applies to any programmatic interface: read-only or read-write.

---

#### PI1. Interface Reference Completeness

**Requirement:** MUST | **Weight:** Critical (×2)

Whether the programmatic interface (API endpoints, SDK methods, MCP tools) is documented with sufficient detail for an agent to use without guessing.

> **Dual-interface tools:** For tools with multiple programmatic interfaces (e.g., REST API and MCP server), the MUST gate evaluates the primary interface's documentation coverage. The primary interface is determined automatically during classification based on the tool's documented recommended integration path (typically REST API for tools with both REST and MCP). Poor documentation coverage on a secondary interface is captured in the criterion's overall score and in PI9 (MCP Implementation Quality), but does not independently trigger the MUST gate failure. This prevents penalizing tools for offering an additional interface: a tool should not be disincentivized from publishing an MCP server by the risk that its MCP documentation triggers a MUST gate failure when its REST API documentation is comprehensive.

| Score | Description |
|-------|-------------|
| 0. Failing | No interface documentation; or docs exist but cover <50% of methods/endpoints |
| 1. Basic | Methods/endpoints are documented; >50% have basic descriptions |
| 2. Good | >80% of methods/endpoints have request/response examples; parameter types and constraints documented |
| 3. Excellent | 100% coverage with examples, edge cases, and error scenarios documented per method/endpoint; parameter constraints include formats, ranges, and valid values |

---

#### PI2. Tool/Endpoint Description Quality

**Requirement:** MUST | **Weight:** Critical (×2)

The completeness, specificity, and actionability of descriptions attached to tools, functions, or API endpoints.

| Score | Description |
|-------|-------------|
| 0. Failing | Missing descriptions, name restatement ("Gets data"), no parameter descriptions |
| 1. Basic | Basic description of what tool/endpoint does; some parameter documentation |
| 2. Good | Specific descriptions with when-to-use guidance; all parameters described with types and examples; 1–2 usage examples per tool |
| 3. Excellent | When-to-use AND when-NOT-to-use; inline examples with realistic data; enum values listed; return format documented; 1–5 examples focusing on ambiguous cases; edge cases noted |

---

#### PI3. Tool Count & Surface Area Management

**Requirement:** SHOULD | **Weight:** Critical (×2)

The number of tools/endpoints exposed to an agent at once, and whether mechanisms exist to manage surface area.

| Score | Description |
|-------|-------------|
| 0. Failing | 50+ undifferentiated tools; each tool = one REST endpoint (API-mirroring pattern); or 31–49 tools with no meaningful grouping or surface area management |
| 1. Basic | ≤30 tools; some logical grouping |
| 2. Good | 5–15 focused tools designed around user outcomes (not API operations); logical grouping |
| 3. Excellent | 5–15 tools + dynamic discovery/deferred loading for larger catalogs; code-mode pattern for complex APIs; semantic search over tool catalog |

---

#### PI4. Input Schema Design

**Requirement:** SHOULD | **Weight:** Critical (×2)

The degree to which input validation prevents common agent errors through strict schemas, constrained formats, and actionable feedback.

| Score | Description |
|-------|-------------|
| 0. Failing | No schema validation; loose types; accepts malformed input silently |
| 1. Basic | Basic JSON Schema with types; some required field marking |
| 2. Good | Strict schemas with enums for constrained fields; format examples; ≤3 nesting levels; all properties documented |
| 3. Excellent | Flat top-level primitives; comprehensive enum/default/description; ≤500 tokens per tool schema; `strict: true` compatible; `additionalProperties: false` |

---

#### PI5. Output Quality & Token Efficiency

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether responses contain high-signal, bounded-size data with pagination and format optimization.

| Score | Description |
|-------|-------------|
| 0. Failing | Full data dumps; unbounded responses (100K+ tokens possible); opaque UUIDs without context; no pagination |
| 1. Basic | Pagination exists; typical responses <10K tokens |
| 2. Good | Paginated with cursor metadata (`has_more`, `next_cursor`); compact summaries; semantic identifiers; filtering parameters |
| 3. Excellent | Token-budgeted responses; `outputSchema` defined; concise mode available; cursor-based pagination; CSV/TSV option for tabular data; response size bounded by default |

---

#### PI6. Response Envelope Consistency

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether all API endpoints return responses in the same structural shape.

| Score | Description |
|-------|-------------|
| 0. Failing | Variable response shapes across endpoints; inconsistent naming; fields omitted when null; type instability (field is sometimes string, sometimes array) |
| 1. Basic | Mostly consistent; some endpoints deviate; same general pattern |
| 2. Good | Consistent envelope structure; consistent naming convention (snake_case or camelCase, not mixed); null fields included as `null` (scalars) or `[]` (collections) |
| 3. Excellent | Identical envelope everywhere; automated linting enforces consistency; type stability guaranteed; published response schema |

---

#### PI7. Naming & Namespacing

**Requirement:** SHOULD | **Weight:** Standard (×1)

The predictability, distinctiveness, and collision-resistance of tool, function, or endpoint names.

| Score | Description |
|-------|-------------|
| 0. Failing | Generic names like `search`, `get_data`, `doThing`; inconsistent casing; no namespace prefix |
| 1. Basic | Descriptive names; consistent casing (snake_case preferred); no service prefix |
| 2. Good | Service-prefixed snake_case (e.g., `stripe_create_charge`, `github_list_issues`) |
| 3. Excellent | Service-prefixed + self-descriptive + unique within multi-server ecosystem + predictable pattern across all tools |

---

#### PI8. Behavioral Metadata & Annotations

**Requirement:** SHOULD | **Weight:** Standard (×1)

Machine-readable metadata declaring whether a tool is read-only, destructive, idempotent, and whether it interacts with external entities.

| Score | Description |
|-------|-------------|
| 0. Failing | No behavioral annotations; all tools appear equivalent |
| 1. Basic | `readOnlyHint` set on obvious read-only tools |
| 2. Good | All four MCP annotations set accurately (`readOnlyHint`, `destructiveHint`, `idempotentHint`, `openWorldHint`) on every tool |
| 3. Excellent | Full annotations + output annotations (audience, priority) + risk ratings + HTTP method semantics matching behavior (GET = read-only, DELETE = destructive) |

---

#### PI9. MCP Implementation Quality

**Requirement:** MAY | **Weight:** Standard (×1)

When an MCP server exists (official or community), the quality of that implementation.

| Score | Description |
|-------|-------------|
| 0. Failing | No MCP server exists (official or community); OR MCP server exists but has critical quality issues: 50+ undifferentiated tools, no descriptions, command injection vulnerabilities, no error handling |
| 1. Basic | Tools have descriptions; auth documented; read and write operations present |
| 2. Good | 5–20 focused tools designed around outcomes (not API mirroring); all four MCP annotations set accurately; read-only and read-write tools clearly separated; documented auth |
| 3. Excellent | All of above + deferred loading / dynamic discovery for large catalogs; safety-tiered tools (read/write/destructive separated); read-only mode available; output annotations; supported clients documented; maintained alongside product releases |

**Always evaluated:** PI9 is always included in the evaluation for tools that trigger the Programmatic Interface module (it is never excluded from the denominator. Tools without an MCP server score 0 on PI9. Because PI9 is a MAY criterion, this score-0 has limited impact on Agent Ready eligibility (where MAY criteria primarily contribute to the overall percentage) but meaningfully affects Agent Native eligibility (where Agent Native requires no SHOULD or MUST criterion scores 0) and MAY criteria still contribute to the ≥80% overall threshold). The practical effect: tools can reach Agent Ready without MCP, but Agent Native demands either an MCP server or enough excellence elsewhere to absorb the PI9 zero. This creates a directional incentive toward MCP adoption without penalizing non-adoption at the Agent Ready certification tier.

---

#### PI10. Programmatic Setup / Time to First API Call

**Requirement:** SHOULD | **Weight:** Critical (×2)

The amount of tool-specific configuration required after an agent has valid credentials (or no credentials are needed) before it can make a successful API call. This criterion measures post-credential setup quality: how well the tool minimizes friction between "I have access" and "I made a successful call."

**Separation of concerns with AU1:** Credential acquisition (account creation, key generation, OAuth setup) is evaluated under AU1 (Non-Interactive Authentication Methods). PI10 measures everything *after* authentication is solved. A tool that requires browser-based account creation is already penalized on AU1; PI10 does not double-count that friction.

| Score | Description |
|-------|-------------|
| 0. Failing | >10 minutes post-credential setup; multiple dashboard-only configuration steps required before first API call; tool-specific configuration requires human interaction |
| 1. Basic | 5–10 minutes; 1–2 tool-specific configuration steps (project creation, API enablement, webhook setup) |
| 2. Good | 2–5 minutes; single environment variable or config file; sandbox/test mode available immediately; clear error on misconfiguration |
| 3. Excellent | <2 minutes; zero-config possible for basic usage; test/sandbox works immediately with credentials alone; config validation with actionable errors; programmatic project setup via API |

---

#### PI11. API Workflow Coverage

**Requirement:** SHOULD | **Weight:** Standard (×1)

The percentage of common workflows completable entirely through the API without requiring web dashboard interaction.

| Score | Description |
|-------|-------------|
| 0. Failing | Core functionality requires dashboard; API covers <50% of common workflows |
| 1. Basic | Core CRUD operations available via API; some configuration requires dashboard |
| 2. Good | >80% of common workflows completable via API; dashboard-only steps documented |
| 3. Excellent | 100% of functionality available via API; no dashboard-only features for any common workflow |

---

#### PI12. Versioning & API Stability

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the API uses explicit versioning with adequate deprecation signals and managed breaking changes.

| Score | Description |
|-------|-------------|
| 0. Failing | No versioning strategy; unannounced breaking changes |
| 1. Basic | Version identifier exists (URL path, header, or parameter); some deprecation notices |
| 2. Good | Explicit versioning with documented deprecation policy; `deprecated: true` in specs; 6+ month deprecation windows |
| 3. Excellent | Semver-adherent; `Sunset` headers; machine-readable deprecation timeline; previous version maintained for 12+ months after deprecation |

---

#### PI13. SDK Availability & Quality

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether official SDKs exist in languages agents commonly use, and whether they're well-maintained.

| Score | Description |
|-------|-------------|
| 0. Failing | No SDK; raw HTTP only |
| 1. Basic | Official SDK in 1 major language (Python or TypeScript/JavaScript) |
| 2. Good | Official SDKs in 2+ major languages; idiomatic to each; typed interfaces |
| 3. Excellent | SDKs in 4+ languages; type-safe with branded types; auto-generated from OpenAPI spec; maintained in sync with API releases |

---

#### PI14. Agent Protocol Availability

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the tool provides high-quality programmatic interfaces for agent interaction. A well-designed REST API, an MCP server, or both are valid paths.

| Score | Description |
|-------|-------------|
| 0. Failing | No programmatic interface; GUI/dashboard only; or API exists but is undocumented |
| 1. Basic | REST API with basic documentation; OR community MCP server exists |
| 2. Good | Well-designed REST API with OpenAPI spec and SDKs in 2+ languages; OR official MCP server with documented auth and core operation coverage |
| 3. Excellent | Excellent REST API with comprehensive SDKs AND official MCP server; OR one interface executed at exceptional quality (e.g., Stripe-quality API without MCP, or best-in-class MCP without REST) |

---

#### PI15. Input Sanitization & Injection Resistance

**Requirement:** MUST | **Weight:** Standard (×1)

Whether the tool demonstrates evidence of input sanitization and defense against injection attacks through schema design, security infrastructure, documentation, and architectural patterns. A tool that exposes a programmatic interface with no evidence of input sanitization cannot be certified at any tier regardless of total score.

> **Evaluation approach:** PI15 is assessed through passive signals (schema strictness (from PI4), security infrastructure detection (WAF, CDN), documentation review, and OpenSSF Scorecard signals) not through active injection probing. Active adversarial testing (sending injection payloads) is not part of the standard evaluation pipeline. For Agent Native certification, vendors may optionally provide their own security assessment results or authorize penetration testing through a separate engagement. This approach avoids operational friction (WAF bans, security incident triggers, legal concerns) while still measuring whether the tool has the architectural properties that prevent injection.

| Score | Description |
|-------|-------------|
| 0. Failing | No evidence of input sanitization: no schema validation, no parameterized queries, no security infrastructure, no documentation of input handling practices |
| 1. Basic | Basic input validation evidenced: strict input schemas with type checking (from PI4); parameterized queries documented; or WAF/CDN security infrastructure detected |
| 2. Good | Comprehensive sanitization evidence: strict schemas with `additionalProperties: false` across all endpoints; parameterized operations documented throughout; security infrastructure present; input validation practices documented |
| 3. Excellent | All of above + allowlist-based input validation documented where feasible; security testing in CI (detected via workflow analysis); defense-in-depth architecture documented; vendor-provided security assessment or third-party audit results available |

---

#### PI16. Prompt Injection Resistance

**Requirement:** SHOULD | **Weight:** Standard (×1)

Tool-level defense-in-depth against prompt injection: strict schemas, output sanitization, injection-resistant designs.

| Score | Description |
|-------|-------------|
| 0. Failing | Unsafe patterns present or encouraged; no awareness of injection risks; tool descriptions contain narrative or references to other tools |
| 1. Basic | Strict input schema validation; parameterized operations; minimal description surface |
| 2. Good | Output structured with clear field boundaries (JSON); response size limits; descriptions concise and self-contained; no cross-tool references in descriptions |
| 3. Excellent | Explicit design mitigations documented; policy layer for untrusted content; output validation; structured action metadata; separation of untrusted content from control flow |

---

### 9.2 Module: Network Service

**Trigger:** Is the tool a hosted or remote service (SaaS, PaaS, cloud API)?

*8 criteria. Evaluates concerns specific to services that run remotely: error handling, rate limits, health endpoints, observability, sandboxing, environment separation, and data portability.*

---

#### NS1. Error Response Quality & Structure

**Requirement:** MUST | **Weight:** Critical (×2)

Whether error responses provide structured, machine-parseable information enabling agents to diagnose problems, determine retryability, and execute recovery actions.

| Score | Description |
|-------|-------------|
| 0. Failing | HTML error pages; empty responses; generic "Something went wrong"; silent failures (no error flag set) |
| 1. Basic | JSON errors with human-readable message and machine-readable error code; MCP errors set `isError: true` |
| 2. Good | RFC 9457 compliant (`type`, `title`, `status`, `detail`); all validation errors reported simultaneously (not one-at-a-time); field-level identification; `doc_url` per error type |
| 3. Excellent | All of above + `is_retriable` boolean + `retry_after_seconds` + suggested alternative actions + hierarchical error taxonomy (e.g., Stripe: type → code → decline_code) + numbered recovery steps |

---

#### NS2. Rate Limit Communication

**Requirement:** SHOULD | **Weight:** Critical (×2)

Whether rate limits are communicated proactively and include machine-actionable timing signals.

| Score | Description |
|-------|-------------|
| 0. Failing | No rate limit headers; no `Retry-After` on 429 responses; undocumented limits |
| 1. Basic | `Retry-After` on 429 responses; limits documented somewhere |
| 2. Good | Rate limit headers on every response (`X-RateLimit-Remaining`, `X-RateLimit-Limit`, `X-RateLimit-Reset`); per-key limits; scope declared (per-endpoint vs. global) |
| 3. Excellent | Full header suite on all responses; batch endpoints to reduce call count; resource-aware cost metadata (e.g., `operationCost: { credits: 5 }`); per-agent rate limits |

---

#### NS3. Health & Status Communication

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the tool provides structured health endpoints that agents can query to assess availability.

| Score | Description |
|-------|-------------|
| 0. Failing | No health endpoint; HTML status pages only |
| 1. Basic | `/health` endpoint returns JSON with aggregate up/down status |
| 2. Good | Component-level status; `Retry-After` on 503 responses; maintenance schedule available |
| 3. Excellent | Per-dependency status; degradation warnings in response metadata; `application/health+json` format (IETF Internet-Draft); planned maintenance pre-signaled |

---

#### NS4. Audit & Observability

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the tool logs agent interactions with sufficient detail for forensic analysis and compliance.

| Score | Description |
|-------|-------------|
| 0. Failing | No meaningful logs or observability for API/tool interactions |
| 1. Basic | Basic request/response logging; API key identified in logs |
| 2. Good | Audit logs with correlation IDs; sensitive-data redaction; rate limits enforced with logged violations; agent identity distinguished from human in logs |
| 3. Excellent | OpenTelemetry-compatible trace/span IDs; immutable append-only audit logs; delegation chain logging; anomaly detection or alerting; per-action risk tier logging |

---

#### NS5. Test/Sandbox Environment Support

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the tool provides sandbox environments, test keys, and safe experimentation modes.

| Score | Description |
|-------|-------------|
| 0. Failing | No test mode; no sandbox; every mistake is a production incident |
| 1. Basic | Test mode exists but with limited simulation |
| 2. Good | Separate sandbox environment + basic behavioral simulation + API-verifiable mode (test responses indicate test mode) |
| 3. Excellent | Structurally distinct test/live keys (prefixed like `sk_test_`); separate sandbox URLs; full behavioral simulation; multiple sandboxes; time simulation (Stripe test clocks, Neon database branching) |

---

#### NS6. Environment Separation

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the tool architecturally separates development, staging, and production environments.

| Score | Description |
|-------|-------------|
| 0. Failing | No environment separation; single set of credentials for all environments; test and production data co-mingled |
| 1. Basic | Separate environments exist but share credentials or configuration |
| 2. Good | Distinct credentials per environment; environment clearly indicated in API responses; preview/staging deployments available |
| 3. Excellent | Environment-specific URLs and credentials; database branching for isolated experimentation; deploy previews via API; environment promotion workflow (dev → staging → prod) |

---

#### NS7. Asynchronous Operation Support

**Requirement:** MAY | **Weight:** Standard (×1)

Whether long-running operations return immediately with a durable handle and provide status mechanisms.

| Score | Description |
|-------|-------------|
| 0. Failing | Long-running operations block until complete; timeouts cause retries with potential duplicates |
| 1. Basic | HTTP 202 Accepted pattern with task/job ID on some operations |
| 2. Good | Consistent async pattern across all long-running operations; polling endpoint with status; `estimated_seconds` in 202 response |
| 3. Excellent | Full async with lifecycle states (working → completed/failed/cancelled); both polling and webhook notification; blocking result endpoint for simple cases; progress reporting |

> **Tools with no long-running operations:** If all operations genuinely complete in under 5 seconds (both documented and empirically verified), score 1. The tool demonstrates appropriate response design for its operation profile: fast responses are the ideal, not a gap. Score 0 applies only when long-running operations exist and block without async patterns. Score 2+ requires long-running operations with progressively better async handling.

---

#### NS8. Data Portability & Pricing Transparency

**Requirement:** MAY | **Weight:** Standard (×1)

Whether the service provides programmatic access to pricing, usage tracking, and data export. Agents operating autonomously cannot parse marketing pages, "Contact Sales" buttons, or dashboard-only usage tracking: they need structured, machine-readable access to costs, consumption, and data portability.

| Score | Description |
|-------|-------------|
| 0. Failing | No data export capability; pricing only on marketing pages; no usage tracking API |
| 1. Basic | Manual data export (dashboard); published pricing page; basic usage visible in dashboard |
| 2. Good | Programmatic data export API; published pricing with clear unit costs; usage tracking API; billing alerts |
| 3. Excellent | Bulk export API with standard formats (CSV, JSON, Parquet); machine-readable pricing API or structured pricing page; real-time usage tracking; spending limit API; cost estimation before provisioning |

**Relationship to HI1 (Cost Guardrails):** HI1 evaluates mechanisms to *prevent* cost overruns (spending limits, auto-stop, cost caps). NS8 evaluates *information availability*: can agents determine what something costs, how much has been spent, and whether data can be extracted? A tool can score well on NS8 (transparent pricing, usage API) while scoring poorly on HI1 (no spending limits), or vice versa.

---

### 9.3 Module: Write Operations

**Trigger:** Can the tool create, modify, or delete data or resources?

*4 criteria. Evaluates safeguards for irreversible actions: destructive operation safety, dry-run capability, idempotency, and multi-step error handling.*

---

#### WO1. Destructive Operation Safety

**Requirement:** SHOULD | **Weight:** Critical (×2)

Mechanisms that prevent agents from executing irreversible destructive operations without appropriate safeguards.

| Score | Description |
|-------|-------------|
| 0. Failing | No guardrails; agent gets full read/write/delete access by default; no confirmation patterns |
| 1. Basic | Database-level or API-level permissions with agent-specific restricted roles; some operations require confirmation |
| 2. Good | Layered defenses: read-only modes + lexical blocklists (DROP, DELETE, TRUNCATE) + human confirmation gates for high-risk operations; soft delete support |
| 3. Excellent | Physical write prevention (read-only replicas); destructive ops excluded from agent-facing interfaces; structural prevention patterns (auth-capture for payments, plan-apply for infra); draft/preview/publish separation |

---

#### WO2. Dry-Run / Validation Capability

**Requirement:** MAY | **Weight:** Standard (×1)

Whether the tool provides mechanisms to validate requests without executing them.

| Score | Description |
|-------|-------------|
| 0. Failing | No dry-run or validation capability |
| 1. Basic | Validation endpoint exists for some operations |
| 2. Good | Dry-run parameter or validation endpoint for most mutating operations; returns what would happen without side effects |
| 3. Excellent | Dry-run executes full validation chain (Terraform plan, Kubernetes server-side dry-run); standardized parameter (e.g., `validate_only: true` per Google AIP-163); diff output showing proposed changes |

---

#### WO3. Idempotency & Safe Retry Support

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether mutating operations accept idempotency keys to prevent duplicate side effects when agents retry failed requests.

| Score | Description |
|-------|-------------|
| 0. Failing | No idempotency support; retries cause duplicate side effects |
| 1. Basic | Idempotency-Key accepted on critical mutating operations |
| 2. Good | Idempotency enforced with 24h+ key persistence; concurrent request handling via locking; documented key behavior |
| 3. Excellent | Comprehensive idempotency across all non-idempotent operations; conflict detection (same key, different params → 409); Stripe-model parameter validation |

---

#### WO4. Workflow Error Communication

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether multi-step operations communicate progress, partial success, and resumability.

| Score | Description |
|-------|-------------|
| 0. Failing | No step-level feedback; atomic success-or-fail with no intermediate state visibility |
| 1. Basic | Failed step identified in error response; no resume capability |
| 2. Good | Completed/failed/pending step enumeration; resume tokens or checkpoint IDs; severity indication (reversible vs. irreversible failure) |
| 3. Excellent | Full checkpoint-based recovery; draft/preview/publish separation; compensating transactions for partial failures; 202 Accepted + polling for multi-step workflows |

---

### 9.4 Module: Authentication

**Trigger:** Does the tool require credentials, API keys, OAuth, or any form of authentication?

*4 criteria. Authentication is "the hardest unsolved problem in agent-tool interaction." It functions as a binary gate: a tool with perfect everything else is agent-useless if the agent can't authenticate.*

---

#### AU1. Non-Interactive Authentication Methods

**Requirement:** MUST | **Weight:** Critical (×2)

Whether the tool supports at least one authentication method that agents can complete without human interaction.

| Score | Description |
|-------|-------------|
| 0. Failing | Only browser-based OAuth requiring human interaction; CAPTCHA-gated; 2FA with no bypass for service accounts |
| 1. Basic | API keys available; basic documentation for key usage |
| 2. Good | API keys + Client Credentials grant + M2M documentation + Device Flow for delegated access |
| 3. Excellent | Multiple non-interactive methods + brokered credentials + programmatic key creation/rotation via API |

*AU1 is a MUST gate. If a tool requires authentication, non-interactive auth is non-negotiable: score 0 on AU1 blocks certification regardless of total score.*

---

#### AU2. Permission Granularity

**Requirement:** SHOULD | **Weight:** Standard (×1)

How finely the tool allows scoping what an agent can access and do.

| Score | Description |
|-------|-------------|
| 0. Failing | Single admin key with full access; no scoping mechanism |
| 1. Basic | Read/write separation available |
| 2. Good | Per-resource scoped keys + fine-grained OAuth scopes + insufficient permissions error includes required scope |
| 3. Excellent | Per-resource per-operation scoping + machine-readable permission manifests + deny-by-default for destructive operations |

---

#### AU3. Credential Lifecycle Management

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the tool supports automated credential rotation, refresh, expiry signaling, and per-agent revocation.

| Score | Description |
|-------|-------------|
| 0. Failing | Manual rotation only; no programmatic credential management |
| 1. Basic | API for key creation/rotation + refresh tokens |
| 2. Good | Automatic rotation + zero-downtime overlap + per-key revocation + expiry metadata |
| 3. Excellent | Brokered credentials + dual-secret rotation + proactive refresh guidance + per-key audit trail |

---

#### AU4. Agent Identity Support

**Requirement:** MAY | **Weight:** Standard (×1)

Whether the tool treats AI agents as a distinct identity type.

| Score | Description |
|-------|-------------|
| 0. Failing | Shared credentials only; no way to distinguish agent from human |
| 1. Basic | Service accounts with some scoping |
| 2. Good | M2M auth with `client_credentials` + agent-specific rate limits |
| 3. Excellent | Agent as first-class identity type + Token Vault + CIBA + per-action audit trail |

---

### 9.5 Module: CLI

**Trigger:** Does the tool have a command-line interface?

*4 criteria. Evaluates agent-specific CLI concerns: non-interactive execution, structured output, cross-platform behavior, and configuration safety.*

---

#### CLI1. Non-Interactive Execution

**Requirement:** SHOULD | **Weight:** Standard (×1)

The ability to run a tool without any human interaction, no confirmation prompts, no editor invocations, no TTY-dependent output.

| Score | Description |
|-------|-------------|
| 0. Failing | Tool hangs or crashes without TTY; interactive prompts with no bypass |
| 1. Basic | Some non-interactive flags exist (`--yes`, `--no-input`); some prompts remain |
| 2. Good | Non-interactive flags for most prompts; CI mode detection; `--json` output mode |
| 3. Excellent | Auto-detects non-TTY environment; flags for all interactive points; JSON output implies non-interactive; `NO_COLOR=1` support; separate stderr/stdout |

---

#### CLI2. Structured Output Mode

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the CLI provides machine-parseable output alongside human-readable output. Agents consuming CLI output need structured data they can parse reliably: without it, they resort to regex-parsing tables and colored text, which breaks across versions and locales.

| Score | Description |
|-------|-------------|
| 0. Failing | Text-only output; no `--json` or equivalent flag; ANSI colors/formatting in default output with no disable mechanism; exit code 0/non-zero only with no structured error information |
| 1. Basic | `--json` or `--format json` flag available for primary commands; basic exit codes (0 = success, non-zero = failure); stderr and stdout may be mixed |
| 2. Good | JSON output available on all major commands; meaningful exit codes with descriptive stderr; stderr and stdout cleanly separated; `NO_COLOR=1` or `--no-color` supported |
| 3. Excellent | Multiple structured formats (JSON + YAML + custom templates); structured output implies non-interactive mode (Terraform pattern: `--json` implies `--input=false`); `--porcelain` stability guarantee across versions (Git pattern); semantic exit codes (distinct codes for distinct failure modes); consistent JSON schema across CLI versions |

---

#### CLI3. Cross-Platform Consistency

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the CLI behaves identically across Linux, macOS, and Windows. Agents trained primarily on Linux/macOS generate commands that fail silently on Windows: path separators, line endings, shell syntax, and temp directory locations all differ. A tool that works on one platform but behaves differently on another creates unpredictable agent failures.

| Score | Description |
|-------|-------------|
| 0. Failing | Single-platform only (e.g., bash-only scripts); hard-coded platform-specific paths (`/tmp/`, `C:\`); no Windows support |
| 1. Basic | Available on Linux, macOS, and Windows; but behavior or output may differ across platforms; platform-specific installation instructions |
| 2. Good | Cross-platform binary distribution or container; consistent output format across platforms; path handling works with both `/` and `\`; no platform-specific shell syntax required |
| 3. Excellent | CI tests on all three major platforms; byte-identical output across platforms; single static binary or zero-dependency install; devcontainer or Nix support for environment reproducibility; platform-specific differences documented |

---

#### CLI4. Configuration Format Safety

**Requirement:** MAY | **Weight:** Standard (×1)

Whether the tool's configuration format is safe for agent generation. YAML's whitespace sensitivity and implicit type coercion create subtle, silent failures when agents generate config files: a single-space indentation error changes data structure without a syntax error, and the "Norway problem" (`NO` → `false`) corrupts data silently. JSON Schema validation acts as a force multiplier, letting agents validate config before applying.

| Score | Description |
|-------|-------------|
| 0. Failing | YAML-only config with no schema validation; no config validation command; implicit type coercion undocumented |
| 1. Basic | Config format documented; basic structure validation (file parses without error); YAML accepted but JSON alternative available |
| 2. Good | JSON or TOML as primary config format; JSON Schema exists for config files; standalone validation command available (`validate`, `check`, `lint`); actionable error messages on misconfiguration |
| 3. Excellent | JSON or TOML primary with published JSON Schema; schema-driven IDE and agent autocompletion; validation runs automatically before any destructive action; error messages include specific fix suggestions; no implicit type coercion; config secrets isolated from main config file |

---

## 10. Domain Modules

Domain modules add criteria based on a tool's functional domains. A tool may trigger **one or more** domain modules: for example, Supabase (Databases + Auth Providers + Hosting) or Firebase (Databases + Auth Providers + Communications). Each activated domain module adds its criteria to the tool's evaluation, expanding both the numerator and denominator like complexity modules.

---

### 10.1 Module: Payments & Financial

**Applies to:** Payment processors, billing platforms, financial APIs

| ID | Criterion | Req | Weight |
|----|-----------|-----|--------|
| PM1 | **Idempotency Depth** | SHOULD | Standard | Automated |
| PM2 | **Test Simulation Fidelity** | SHOULD | Standard | AI-Assisted |
| PM3 | **Compliance Automation** | MAY | Standard | AI-Assisted |
| PM4 | **Currency & Amount Safety** | SHOULD | Standard | Automated |

---

#### PM1. Idempotency Depth

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the payment API provides deep idempotency beyond basic key acceptance: including parameter validation, key persistence windows, and concurrent request serialization. Agents retry failed requests frequently; without robust idempotency, retries create duplicate charges.

| Score | Description |
|-------|-------------|
| 0. Failing | No idempotency support; retried requests create duplicate charges |
| 1. Basic | Idempotency key header accepted; duplicate requests return cached response |
| 2. Good | Key acceptance + documented persistence window (e.g., 24 hours) + concurrent request serialization |
| 3. Excellent | Parameter validation (same key + different params → error/409), documented key lifetime, concurrent locking, idempotency across all POST/write endpoints |

---

#### PM2. Test Simulation Fidelity

**Requirement:** SHOULD | **Weight:** Standard (×1)

How comprehensively the platform simulates real payment scenarios in test mode: including decline codes, dispute flows, subscription lifecycle, and webhook events. Agents cannot safely learn payment integration on live data.

| Score | Description |
|-------|-------------|
| 0. Failing | No test mode; or test mode limited to basic success/fail with no scenario simulation |
| 1. Basic | Test/sandbox environment with key separation; basic test card numbers for success and generic decline |
| 2. Good | Multiple test cards covering specific decline codes and card brands; webhook forwarding/simulation; isolated test data |
| 3. Excellent | 30+ test cards with specific scenarios; Test Clocks API for time-dependent flows (subscriptions, trials); dispute/refund simulation; CLI event triggering and replay |

---

#### PM3. Compliance Automation

**Requirement:** MAY | **Weight:** Standard (×1)

Whether the platform automates regulatory compliance burdens (tax calculation, PCI scope reduction, 3DS/SCA flows) so agents don't need jurisdiction-specific knowledge. An agent creating a payment flow should not need to understand VAT rules for 200 countries.

| Score | Description |
|-------|-------------|
| 0. Failing | No compliance automation; agent must manually implement tax calculation, PCI handling, and 3DS flows |
| 1. Basic | Hosted checkout or client-side tokenization reduces PCI scope; basic 3DS support via redirects |
| 2. Good | Built-in tax engine (enable via API); automatic 3DS/SCA handling with machine-readable `requires_action` status; PCI scope fully eliminated via hosted flows |
| 3. Excellent | Merchant of Record model (platform handles all tax, compliance, remittance); or built-in tax engine covering 200+ markets with threshold monitoring and VAT ID validation |

---

#### PM4. Currency & Amount Safety

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the API prevents currency-related agent errors through clear unit documentation, smallest-unit enforcement, zero-decimal currency handling, and validation of ambiguous amounts. Currency math errors are among the most costly agent mistakes.

| Score | Description |
|-------|-------------|
| 0. Failing | Ambiguous amount units (unclear if cents or dollars); no zero-decimal currency handling; no minimum amount enforcement |
| 1. Basic | Documentation states amounts are in smallest currency unit; minimum charge amount enforced |
| 2. Good | Explicit unit in API responses; zero-decimal currencies (JPY) and three-decimal currencies (BHD) documented; amount validation with clear error messages |
| 3. Excellent | Currency-aware validation rejecting ambiguous amounts; explicit decimal count per currency in API metadata; auth-capture pattern support for human review before charge |

---

### 10.2 Module: Communications

**Applies to:** Email, SMS, messaging, notification platforms

| ID | Criterion | Req | Weight |
|----|-----------|-----|--------|
| CM1 | **Irreversibility Safeguards** | SHOULD | Standard | AI-Assisted |
| CM2 | **Delivery Verification** | SHOULD | Standard | Automated |
| CM3 | **Webhook/Event Infrastructure** | SHOULD | Standard | Automated |

---

#### CM1. Irreversibility Safeguards

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the platform provides safety mechanisms to prevent agents from sending irreversible communications without review: including sandbox/test modes, draft-then-send patterns, batch limits, and scheduled send with cancellation. Sent messages cannot be recalled.

| Score | Description |
|-------|-------------|
| 0. Failing | No sandbox mode; no batch limits; no draft/preview capability; agent can send unlimited messages immediately |
| 1. Basic | Test/sandbox mode available (messages validated but not delivered); basic rate limiting on outbound sends |
| 2. Good | Sandbox mode + batch send limits (≤1,000 per call) + rate limiting; draft/preview API or scheduled send with cancellation window |
| 3. Excellent | Sandbox mode validating full request format; per-second rate limits as safety brakes; draft-then-send pattern with human approval gate; scheduled send with cancellation; loop prevention circuit breaker |

---

#### CM2. Delivery Verification

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the platform provides structured, machine-readable delivery status tracking: including bounce categorization (hard/soft), complaint tracking, and suppression list management. Agents need programmatic feedback to know if messages were actually delivered.

| Score | Description |
|-------|-------------|
| 0. Failing | No delivery status feedback; fire-and-forget sending with no bounce or complaint data |
| 1. Basic | Basic delivery/bounce webhooks; suppression list exists but is not API-accessible |
| 2. Good | Structured delivery receipts (delivered/bounced/complained); bounce categorization (hard/soft); API-accessible suppression lists; unsubscribe handling |
| 3. Excellent | Full event lifecycle (processed → delivered → opened → clicked → unsubscribed → complained); automatic suppression management; per-recipient status tracking; bounce type classification with machine-readable codes |

---

#### CM3. Webhook/Event Infrastructure

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the platform supports programmatic webhook configuration, cryptographic signature verification, event replay, and structured event payloads. Agents managing communication workflows need reliable, verifiable event delivery, not dashboard-only webhook setup.

| Score | Description |
|-------|-------------|
| 0. Failing | No webhook support; or webhooks require dashboard-only configuration with no signature verification |
| 1. Basic | Webhook URLs configurable via API; events delivered as structured JSON payloads |
| 2. Good | API-managed webhooks + cryptographic signature verification (HMAC or ECDSA); standard event types across the delivery lifecycle |
| 3. Excellent | Full CRUD webhook management via API; signature verification; event replay capability; batched event delivery; per-stream webhook URLs; inbound message processing via webhooks |

---

### 10.3 Module: Databases

**Applies to:** Databases, data platforms, ORMs

| ID | Criterion | Req | Weight |
|----|-----------|-----|--------|
| DB1 | **Safe Experimentation** | SHOULD | Standard | Automated |
| DB2 | **Schema Introspection Quality** | SHOULD | Standard | Automated |
| DB3 | **Query Interface Safety** | SHOULD | Standard | Automated |
| DB4 | **Connection Management** | MAY | Standard | Automated |

---

#### DB1. Safe Experimentation

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the database provides mechanisms for agents to experiment without risking production data: including branching, read-only replicas, point-in-time recovery, and copy-on-write environments. Agents make destructive mistakes; the database must make those mistakes reversible.

| Score | Description |
|-------|-------------|
| 0. Failing | No branching, snapshots, or recovery mechanism; destructive operations are permanent |
| 1. Basic | Point-in-time recovery (PITR) available; manual backup/restore process |
| 2. Good | Read-only replicas available; PITR with reasonable granularity; snapshot/clone capability (minutes to create) |
| 3. Excellent | Instant copy-on-write branching (<1s creation); branch reset to parent state; schema-only and full-data branch modes; PITR with fine granularity |

---

#### DB2. Schema Introspection Quality

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the database exposes machine-readable schema metadata: including table/column types, relationships, constraints, and semantic descriptions. Agents generating SQL need accurate schema context, but raw schema dumps are too token-expensive for direct LLM injection.

| Score | Description |
|-------|-------------|
| 0. Failing | No programmatic schema discovery; agent must guess table structure or rely on documentation alone |
| 1. Basic | Standard schema discovery (e.g., `information_schema`, `SHOW TABLES`); table and column names with types exposed |
| 2. Good | Full schema with foreign key/relationship metadata; constraint enumeration; schema accessible via HTTP API (not just SQL) |
| 3. Excellent | Semantic catalog with natural-language column/table descriptions (e.g., `COMMENT ON`); token-efficient schema representation; schema caching with DDL-change invalidation |

---

#### DB3. Query Interface Safety

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the database enforces safe query patterns: including parameterized queries, row-level security, query validation before execution, and protection against the high error rate of agent-generated SQL. Agents produce incorrect SQL far more often than humans.

| Score | Description |
|-------|-------------|
| 0. Failing | Raw SQL string concatenation accepted; no parameterized query enforcement; no row-level security |
| 1. Basic | Parameterized queries supported; basic SQL injection prevention |
| 2. Good | Parameterized queries enforced by default; row-level security (RLS) available; query explain/validation before execution |
| 3. Excellent | RLS enabled by default on new tables; query validation with cost estimation; read-only query mode for exploration; guardrails against broad `DELETE`/`UPDATE` without `WHERE` clauses |

---

#### DB4. Connection Management

**Requirement:** MAY | **Weight:** Standard (×1)

Whether the database provides HTTP/REST access, managed connection pooling, and edge-compatible drivers. Agents running in serverless and edge environments (Vercel Edge, Cloudflare Workers) cannot establish TCP connections. HTTP-based access is the only viable path.

| Score | Description |
|-------|-------------|
| 0. Failing | TCP-only access; no connection pooling; no serverless-compatible drivers |
| 1. Basic | Managed connection pooling available; standard database drivers with connection management |
| 2. Good | HTTP/REST API available alongside TCP; serverless-compatible drivers; connection pooling with scale-to-zero |
| 3. Excellent | Auto-generated HTTP/REST API (e.g., PostgREST); WebSocket support for multi-statement transactions; edge-compatible drivers; scale-to-zero with sub-second cold starts |

---

### 10.4 Module: Hosting & Infrastructure

**Applies to:** Cloud platforms, PaaS, serverless, container services

| ID | Criterion | Req | Weight |
|----|-----------|-----|--------|
| HI1 | **Cost Guardrails** | SHOULD | Standard | Automated |
| HI2 | **Deployment Lifecycle Completeness** | SHOULD | Standard | AI-Assisted |
| HI3 | **Preview/Staging Deployments** | MAY | Standard | Automated |

---

#### HI1. Cost Guardrails

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the platform provides spending limits, auto-stop/scale-down, cost estimation, and usage tracking that agents can use programmatically. Agents lack cost intuition: without guardrails, they provision expensive resources and forget to deprovision them.

| Score | Description |
|-------|-------------|
| 0. Failing | No spending limits or cost controls; no usage tracking API; API defaults more permissive than console defaults |
| 1. Basic | Usage-based pricing with basic spending alerts; manual cost controls available via dashboard |
| 2. Good | Spending limits configurable via API; auto-stop for idle resources; usage tracking API; cost alerts with configurable thresholds |
| 3. Excellent | Cost estimation before deployment; per-project spending caps via API; auto-scale-down to zero when idle; real-time cost tracking; API defaults match or are more restrictive than console defaults |

---

#### HI2. Deployment Lifecycle Completeness

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the full deployment lifecycle (build, deploy, rollback, scale, log access, and environment management) is available via API, CLI, or MCP. An agent that can deploy but not rollback, or deploy but not read logs, has a dangerous capability gap.

| Score | Description |
|-------|-------------|
| 0. Failing | Dashboard-only deployment; no API or CLI for triggering deploys or reading logs |
| 1. Basic | Deploy and status check available via API/CLI; log access available but limited |
| 2. Good | Build, deploy, log access, and environment variable management via API; rollback available (redeploy previous version) |
| 3. Excellent | Full lifecycle via API/CLI/MCP: deploy, rollback, scale, streaming logs, environment management; MCP server covering read and write operations across the lifecycle |

---

#### HI3. Preview/Staging Deployments

**Requirement:** MAY | **Weight:** Standard (×1)

Whether the platform supports creating isolated preview/staging environments via API: including branch deployments, ephemeral environments, and automatic cleanup. Agents deploying directly to production without preview create unrecoverable failures.

| Score | Description |
|-------|-------------|
| 0. Failing | No preview or staging environment support; all deployments go directly to production |
| 1. Basic | Manual staging environment available; preview deployments require dashboard configuration |
| 2. Good | Preview deployments creatable via API; branch-based deployments; rollback to previous deployment via API |
| 3. Excellent | Automatic preview deployment per branch/PR via API; ephemeral environments with automatic cleanup; instant rollback by promoting previous deployment; progressive rollout support (canary/blue-green) |

---

### 10.5 Module: Auth Providers

**Applies to:** Identity/authentication platforms (Auth0, Clerk, Firebase Auth, etc.)

| ID | Criterion | Req | Weight |
|----|-----------|-----|--------|
| AP1 | **Agent-as-End-User Support** | SHOULD | Standard | AI-Assisted |
| AP2 | **Social/External Connection API** | SHOULD | Standard | Automated |
| AP3 | **Token Architecture Transparency** | MAY | Standard | AI-Assisted |

---

#### AP1. Agent-as-End-User Support

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the auth platform supports flows where the end user is an agent, not a human with a browser. Standard OAuth redirects, "Approve" buttons, and email-based verification fail when the user is software. CIBA, Device Flow, Client Credentials, and dedicated agent identity types address this gap.

| Score | Description |
|-------|-------------|
| 0. Failing | All auth flows require browser-based interaction (redirects, consent screens); no machine-to-machine support |
| 1. Basic | Client Credentials grant supported for M2M authentication; basic service account support |
| 2. Good | Client Credentials + Device Flow or CIBA for async human approval; token vault or credential delegation for agents acting on behalf of users |
| 3. Excellent | Dedicated agent identity type (not retrofitted service accounts); credential vault with 35+ integrations; async authorization (CIBA) with push notification approval; scoped, time-bounded agent credentials with full audit trail |

---

#### AP2. Social/External Connection API

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether social/OAuth provider connections, redirect URIs, email templates, and session settings can be configured entirely via API: without requiring dashboard interaction. An agent bootstrapping auth for a new project must be able to complete setup programmatically.

| Score | Description |
|-------|-------------|
| 0. Failing | Social connections and auth configuration require dashboard-only setup; no Management API |
| 1. Basic | Core auth settings configurable via API; some provider setup (e.g., social connections) still requires dashboard |
| 2. Good | Social connections configurable via API for most providers; email templates accessible via API; redirect URI management via API |
| 3. Excellent | All configuration API-driven (social providers, email templates, branding, custom domains); 60+ social providers configurable via API; dynamic client registration support |

---

#### AP3. Token Architecture Transparency

**Requirement:** MAY | **Weight:** Standard (×1)

Whether the platform clearly documents delegation chains, token lifetimes, refresh semantics, and trust boundaries: and supports emerging standards for agent-to-app authorization. Opaque token architectures prevent agents from reasoning about their own permissions and capabilities.

| Score | Description |
|-------|-------------|
| 0. Failing | Opaque token architecture; no documentation of token lifetimes, refresh semantics, or delegation chains |
| 1. Basic | Token lifetimes and refresh semantics documented; basic scope documentation |
| 2. Good | Delegation chain documentation; Rich Authorization Requests (RAR) support; fine-grained authorization (FGA); credential rotation via API |
| 3. Excellent | Support for agent-to-app protocols (XAA or equivalent); per-action authorization logging; dual-secret rotation without downtime; brokered credentials preventing LLM token exposure; EU AI Act-ready audit trail |

---

### 10.6 Module: Frameworks & Libraries

**Applies to:** Web frameworks, ORMs, UI libraries, build tools

| ID | Criterion | Req | Weight |
|----|-----------|-----|--------|
| FL1 | **Type System Quality** | SHOULD | Standard | Automated |
| FL2 | **Scaffolding & Code Generation** | MAY | Standard | Automated |
| FL3 | **Configuration Validation** | SHOULD | Standard | Automated |

---

#### FL1. Type System Quality

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the framework provides strong, expressive types (TypeScript types, Python type hints, or equivalent) that constrain agent-generated code at compile time. Type systems provide the tightest feedback loop for agents: milliseconds to detect errors versus seconds or minutes for runtime failures.

| Score | Description |
|-------|-------------|
| 0. Failing | No type definitions; untyped JavaScript, untyped Python, or equivalent; agents get no compile-time feedback |
| 1. Basic | Type definitions available (e.g., `@types/` package, basic type hints); core API surface typed |
| 2. Good | Comprehensive types across full API surface; generated types from schema (e.g., Prisma, GraphQL codegen); type inference support reducing annotation burden |
| 3. Excellent | Branded/nominal types preventing ID confusion (e.g., `BuildingID` vs. `CustomerID`); generated types with schema-first design; types covering edge cases and error states; type-check performance stable as schema grows |

---

#### FL2. Scaffolding & Code Generation

**Requirement:** MAY | **Weight:** Standard (×1)

Whether the framework provides CLI generators, project templates, and code scaffolding that work non-interactively. Agents should use scaffolding rather than building from scratch, but scaffolding tools that require interactive prompts (arrow-key menus, confirmation dialogs) are unusable by agents.

| Score | Description |
|-------|-------------|
| 0. Failing | No scaffolding tools; or scaffolding requires interactive prompts with no CLI flag bypass |
| 1. Basic | Project scaffolding CLI available; can generate basic project structure with default options via flags |
| 2. Good | Project + component/module generators; templates for common patterns; all prompts bypassable via CLI flags |
| 3. Excellent | Full non-interactive scaffolding with `--yes`/`--defaults` flags; generates project-specific configuration (e.g., `AGENTS.md`, type definitions); template library covering common patterns; generator output is immediately buildable/runnable |

---

#### FL3. Configuration Validation

**Requirement:** SHOULD | **Weight:** Standard (×1)

Whether the framework validates configuration files with actionable error messages and provides validation as a standalone command (not just at runtime). Agents generate configuration frequently and need immediate feedback on misconfiguration, not a runtime crash minutes later.

| Score | Description |
|-------|-------------|
| 0. Failing | No configuration validation; silent misconfiguration; runtime crashes on bad config with unhelpful errors |
| 1. Basic | Runtime validation with error messages on misconfiguration; configuration file format documented |
| 2. Good | JSON Schema for configuration files enabling editor validation; actionable error messages with suggested fixes; validation runs at startup before executing |
| 3. Excellent | Standalone validation command (`lint`, `check`, `validate`) runnable without starting the application; JSON Schema published for IDE/agent integration; error messages include specific fix suggestions; type-safe configuration with compile-time checking |

---
## Appendix A: Criteria Quick Reference

### Base Standard (15 criteria)

| ID | Criterion | Group | Req | Weight |
|----|-----------|-------|-----|--------|
| B1 | Machine-Readable Documentation Formats | Documentation & Usability | SHOULD | Critical |
| B2 | Code Example Coverage & Quality | Documentation & Usability | SHOULD | Standard |
| B3 | Documentation Structure & Self-Containment | Documentation & Usability | SHOULD | Standard |
| B4 | Documentation Accuracy & Synchronization | Documentation & Usability | SHOULD | Standard |
| B5 | Getting Started Completeness | Documentation & Usability | SHOULD | Standard |
| B6 | Changelog & Migration Guidance | Documentation & Usability | MAY | Standard |
| B7 | Installation & Configuration Simplicity | Documentation & Usability | SHOULD | Standard |
| B8 | Supply Chain Integrity | Safety | SHOULD | Standard |
| B9 | Vulnerability Disclosure & Security Contact | Safety | SHOULD | Standard |
| B10 | Project Sustainability | Lifecycle | SHOULD | Critical |
| B11 | Maintenance Health | Lifecycle | SHOULD | Standard |
| B12 | Semver Adherence & Version Stability | Lifecycle | SHOULD | Standard |
| B13 | Governance & Continuity | Lifecycle | MAY | Standard |
| B14 | Security Track Record | Lifecycle | SHOULD | Standard |
| B15 | Terms & Licensing Stability | Lifecycle | SHOULD | Standard |


### Module: Programmatic Interface (16 criteria)

| ID | Criterion | Req | Weight |
|----|-----------|-----|--------|
| PI1 | Interface Reference Completeness | MUST | Critical |
| PI2 | Tool/Endpoint Description Quality | MUST | Critical |
| PI3 | Tool Count & Surface Area Management | SHOULD | Critical |
| PI4 | Input Schema Design | SHOULD | Critical |
| PI5 | Output Quality & Token Efficiency | SHOULD | Standard |
| PI6 | Response Envelope Consistency | SHOULD | Standard |
| PI7 | Naming & Namespacing | SHOULD | Standard |
| PI8 | Behavioral Metadata & Annotations | SHOULD | Standard |
| PI9 | MCP Implementation Quality | MAY | Standard |
| PI10 | Programmatic Setup / TTFC | SHOULD | Critical |
| PI11 | API Workflow Coverage | SHOULD | Standard |
| PI12 | Versioning & API Stability | SHOULD | Standard |
| PI13 | SDK Availability & Quality | SHOULD | Standard |
| PI14 | Agent Protocol Availability | SHOULD | Standard |
| PI15 | Input Sanitization & Injection Resistance | MUST | Standard |
| PI16 | Prompt Injection Resistance | SHOULD | Standard |

### Module: Network Service (8 criteria)

| ID | Criterion | Req | Weight |
|----|-----------|-----|--------|
| NS1 | Error Response Quality & Structure | MUST | Critical |
| NS2 | Rate Limit Communication | SHOULD | Critical |
| NS3 | Health & Status Communication | SHOULD | Standard |
| NS4 | Audit & Observability | SHOULD | Standard |
| NS5 | Test/Sandbox Environment Support | SHOULD | Standard |
| NS6 | Environment Separation | SHOULD | Standard |
| NS7 | Asynchronous Operation Support | MAY | Standard |
| NS8 | Data Portability & Pricing Transparency | MAY | Standard |

### Module: Write Operations (4 criteria)

| ID | Criterion | Req | Weight |
|----|-----------|-----|--------|
| WO1 | Destructive Operation Safety | SHOULD | Critical |
| WO2 | Dry-Run / Validation Capability | MAY | Standard |
| WO3 | Idempotency & Safe Retry Support | SHOULD | Standard |
| WO4 | Workflow Error Communication | SHOULD | Standard |

### Module: Authentication (4 criteria)

| ID | Criterion | Req | Weight |
|----|-----------|-----|--------|
| AU1 | Non-Interactive Authentication Methods | MUST | Critical |
| AU2 | Permission Granularity | SHOULD | Standard |
| AU3 | Credential Lifecycle Management | SHOULD | Standard |
| AU4 | Agent Identity Support | MAY | Standard |

### Module: CLI (4 criteria)

| ID | Criterion | Req | Weight |
|----|-----------|-----|--------|
| CLI1 | Non-Interactive Execution | SHOULD | Standard |
| CLI2 | Structured Output Mode | SHOULD | Standard |
| CLI3 | Cross-Platform Consistency | SHOULD | Standard |
| CLI4 | Configuration Format Safety | MAY | Standard |

⚑ = Open-source/commercial split rubric.

### Domain Module: Payments & Financial (4 criteria)

| ID | Criterion | Req | Weight |
|----|-----------|-----|--------|
| PM1 | Idempotency Depth | SHOULD | Standard |
| PM2 | Test Simulation Fidelity | SHOULD | Standard |
| PM3 | Compliance Automation | MAY | Standard |
| PM4 | Currency & Amount Safety | SHOULD | Standard |

### Domain Module: Communications (3 criteria)

| ID | Criterion | Req | Weight |
|----|-----------|-----|--------|
| CM1 | Irreversibility Safeguards | SHOULD | Standard |
| CM2 | Delivery Verification | SHOULD | Standard |
| CM3 | Webhook/Event Infrastructure | SHOULD | Standard |

### Domain Module: Databases (4 criteria)

| ID | Criterion | Req | Weight |
|----|-----------|-----|--------|
| DB1 | Safe Experimentation | SHOULD | Standard |
| DB2 | Schema Introspection Quality | SHOULD | Standard |
| DB3 | Query Interface Safety | SHOULD | Standard |
| DB4 | Connection Management | MAY | Standard |

### Domain Module: Hosting & Infrastructure (3 criteria)

| ID | Criterion | Req | Weight |
|----|-----------|-----|--------|
| HI1 | Cost Guardrails | SHOULD | Standard |
| HI2 | Deployment Lifecycle Completeness | SHOULD | Standard |
| HI3 | Preview/Staging Deployments | MAY | Standard |

### Domain Module: Auth Providers (3 criteria)

| ID | Criterion | Req | Weight |
|----|-----------|-----|--------|
| AP1 | Agent-as-End-User Support | SHOULD | Standard |
| AP2 | Social/External Connection API | SHOULD | Standard |
| AP3 | Token Architecture Transparency | MAY | Standard |

### Domain Module: Frameworks & Libraries (3 criteria)

| ID | Criterion | Req | Weight |
|----|-----------|-----|--------|
| FL1 | Type System Quality | SHOULD | Standard |
| FL2 | Scaffolding & Code Generation | MAY | Standard |
| FL3 | Configuration Validation | SHOULD | Standard |

**Totals:** 15 base + 16 interface + 8 network + 4 write + 4 auth + 4 CLI = **51 base + complexity criteria** | 20 domain-specific across 6 modules (4+3+4+3+3+3) | 5 MUST gates (all in complexity modules) | A tool may trigger multiple domain modules

---


