Skip to content

Agent Outcomes — What It Means

Agent Outcomes

TL;DR: Agent Outcomes is a pattern where you define what “done” looks like (a rubric), and a separate grader evaluates whether the agent achieved it. The agent iterates until criteria are met or max attempts reached. This turns conversations into goal-oriented work.

Simple Explanation

Normally, an AI agent works until it thinks it’s done. But “thinks it’s done” is subjective — the agent might stop too early or miss requirements.

Outcomes add structure:

  1. You define success criteria (the rubric)
  2. A separate grader evaluates the work (not the agent itself)
  3. The agent iterates if criteria aren’t met
  4. Process stops when satisfied OR max iterations reached

It’s like having a QA reviewer built into the agent loop.

The Problem It Solves

Without outcomes:

User: "Build a financial model"
Agent: [works for a while]
Agent: "Done! Here's your model."
User: "But it's missing the sensitivity analysis..."
Agent: "Oh, let me add that..."
User: "And the formatting is wrong..."

With outcomes:

User: "Build a financial model"
Rubric:
- Contains DCF with 5-year projections
- Includes sensitivity analysis on discount rate
- Formatted with headers and cell references
Agent: [works]
Grader: "Missing sensitivity analysis" → needs_revision
Agent: [adds sensitivity analysis]
Grader: "All criteria met" → satisfied

How It Works (Technical)

In Claude Managed Agents, you send an outcome definition:

client.beta.sessions.events.send(
session_id=session.id,
events=[{
"type": "user.define_outcome",
"description": "Build a DCF model for Costco in .xlsx",
"rubric": {"type": "text", "content": RUBRIC},
"max_iterations": 5, # default 3, max 20
}],
)

The grader is a separate evaluation process:

  • Runs in its own context
  • Not influenced by the agent’s decisions
  • Returns per-criterion evaluation
  • Feedback passes back to agent

Evaluation Results

ResultWhat Happens
satisfiedWork complete, session goes idle
needs_revisionAgent gets feedback, starts new cycle
max_iterations_reachedFinal attempt, then idle
failedRubric doesn’t fit task (bad criteria)

Writing Good Rubrics

The rubric is the most important part. Good rubrics are specific and verifiable:

✅ Good Criteria

  • “CSV contains a ‘price’ column with numeric values”
  • “All functions have docstrings explaining parameters”
  • “Report includes at least 3 competitor comparisons”
  • “Total sums to within 0.01 of expected value”

❌ Bad Criteria

  • “Data looks good” (subjective)
  • “Code is clean” (vague)
  • “Report is comprehensive” (unmeasurable)
  • “User would be satisfied” (unknowable)

Rubric Template

## Required Outputs
- [ ] [Specific deliverable 1]
- [ ] [Specific deliverable 2]
## Quality Criteria
- [ ] [Measurable quality standard 1]
- [ ] [Measurable quality standard 2]
## Constraints
- [ ] [Specific constraint or limit]
- [ ] [Format requirement]

Why a Separate Grader?

The key insight is separation of concerns:

Self-EvaluationSeparate Grader
Agent judges own workIndependent evaluation
Prone to “I think I’m done”Objective criteria check
Single perspectiveFresh perspective
Can rationalize shortcutsNo context of effort spent

This is similar to why code review works — the author is too close to see issues.

Business Applications

Quality Assurance at Scale

  • Content generation with style guidelines
  • Data processing with validation rules
  • Report generation with completeness checks

Reducing Back-and-Forth

  • Agent self-corrects before human review
  • Fewer revision cycles
  • Consistent quality standards

Audit Trail

  • Clear record of what was requested
  • Evidence of criteria being met
  • Useful for compliance

Implementing Without Managed Agents

You can implement this pattern in DIY agents:

def agent_with_outcomes(task, rubric, max_iterations=3):
for i in range(max_iterations):
# Agent does work
result = agent.execute(task)
# Separate grader evaluates
evaluation = grader.evaluate(result, rubric)
if evaluation.satisfied:
return result
# Feed back to agent
task = f"{task}\n\nPrevious attempt feedback: {evaluation.feedback}"
return result # Best effort after max iterations

The key is that grader should be a separate LLM call with its own context, not just the agent evaluating itself.

Relationship to Other Concepts

ConceptRelationship
glossary/ai-agentOutcomes make agents goal-oriented
tools/claude-managed-agentsNative Outcomes support
automation/ai-agent-organizationOutcomes = one technique for reliability
glossary/prompt-engineeringRubric writing is a form of prompting

Key Takeaways

  • Outcomes turn agent conversations into goal-oriented work
  • A separate grader provides objective evaluation
  • Specific, verifiable rubrics are essential
  • Agents iterate until criteria met or max attempts reached
  • This pattern dramatically reduces revision cycles

Sources

  • Claude Managed Agents documentation (Anthropic, April 2026)
  • Outcomes feature is currently in research preview