Agent Outcomes — What It Means

Agent Outcomes

TL;DR: Agent Outcomes is a pattern where you define what “done” looks like (a rubric), and a separate grader evaluates whether the agent achieved it. The agent iterates until criteria are met or max attempts reached. This turns conversations into goal-oriented work.

Simple Explanation

Normally, an AI agent works until it thinks it’s done. But “thinks it’s done” is subjective — the agent might stop too early or miss requirements.

Outcomes add structure:

You define success criteria (the rubric)
A separate grader evaluates the work (not the agent itself)
The agent iterates if criteria aren’t met
Process stops when satisfied OR max iterations reached

It’s like having a QA reviewer built into the agent loop.

The Problem It Solves

Without outcomes:

User: "Build a financial model"
Agent: [works for a while]
Agent: "Done! Here's your model."
User: "But it's missing the sensitivity analysis..."
Agent: "Oh, let me add that..."
User: "And the formatting is wrong..."

With outcomes:

User: "Build a financial model"
       Rubric:
       - Contains DCF with 5-year projections
       - Includes sensitivity analysis on discount rate
       - Formatted with headers and cell references

Agent: [works]
Grader: "Missing sensitivity analysis" → needs_revision
Agent: [adds sensitivity analysis]
Grader: "All criteria met" → satisfied

How It Works (Technical)

In Claude Managed Agents, you send an outcome definition:

client.beta.sessions.events.send(
    session_id=session.id,
    events=[{
        "type": "user.define_outcome",
        "description": "Build a DCF model for Costco in .xlsx",
        "rubric": {"type": "text", "content": RUBRIC},
        "max_iterations": 5,  # default 3, max 20
    }],
)

The grader is a separate evaluation process:

Runs in its own context
Not influenced by the agent’s decisions
Returns per-criterion evaluation
Feedback passes back to agent

Evaluation Results

Result	What Happens
`satisfied`	Work complete, session goes idle
`needs_revision`	Agent gets feedback, starts new cycle
`max_iterations_reached`	Final attempt, then idle
`failed`	Rubric doesn’t fit task (bad criteria)

Writing Good Rubrics

The rubric is the most important part. Good rubrics are specific and verifiable:

✅ Good Criteria

“CSV contains a ‘price’ column with numeric values”
“All functions have docstrings explaining parameters”
“Report includes at least 3 competitor comparisons”
“Total sums to within 0.01 of expected value”

❌ Bad Criteria

“Data looks good” (subjective)
“Code is clean” (vague)
“Report is comprehensive” (unmeasurable)
“User would be satisfied” (unknowable)

Rubric Template

## Required Outputs
- [ ] [Specific deliverable 1]
- [ ] [Specific deliverable 2]

## Quality Criteria
- [ ] [Measurable quality standard 1]
- [ ] [Measurable quality standard 2]

## Constraints
- [ ] [Specific constraint or limit]
- [ ] [Format requirement]

Why a Separate Grader?

The key insight is separation of concerns:

Self-Evaluation	Separate Grader
Agent judges own work	Independent evaluation
Prone to “I think I’m done”	Objective criteria check
Single perspective	Fresh perspective
Can rationalize shortcuts	No context of effort spent

This is similar to why code review works — the author is too close to see issues.

Business Applications

Quality Assurance at Scale

Content generation with style guidelines
Data processing with validation rules
Report generation with completeness checks

Reducing Back-and-Forth

Agent self-corrects before human review
Fewer revision cycles
Consistent quality standards

Audit Trail

Clear record of what was requested
Evidence of criteria being met
Useful for compliance

Implementing Without Managed Agents

You can implement this pattern in DIY agents:

def agent_with_outcomes(task, rubric, max_iterations=3):
    for i in range(max_iterations):
        # Agent does work
        result = agent.execute(task)

        # Separate grader evaluates
        evaluation = grader.evaluate(result, rubric)

        if evaluation.satisfied:
            return result

        # Feed back to agent
        task = f"{task}\n\nPrevious attempt feedback: {evaluation.feedback}"

    return result  # Best effort after max iterations

The key is that grader should be a separate LLM call with its own context, not just the agent evaluating itself.

Relationship to Other Concepts

Concept	Relationship
glossary/ai-agent	Outcomes make agents goal-oriented
tools/claude-managed-agents	Native Outcomes support
automation/ai-agent-organization	Outcomes = one technique for reliability
glossary/prompt-engineering	Rubric writing is a form of prompting

Key Takeaways

Outcomes turn agent conversations into goal-oriented work
A separate grader provides objective evaluation
Specific, verifiable rubrics are essential
Agents iterate until criteria met or max attempts reached
This pattern dramatically reduces revision cycles

tools/claude-managed-agents — Platform with native Outcomes support
glossary/ai-agent — What agents are
automation/ai-agent-organization — Broader reliability techniques
comparisons/managed-agents-vs-diy — Platform comparison

Sources

Claude Managed Agents documentation (Anthropic, April 2026)
Outcomes feature is currently in research preview