Workflow Evals with Mocked Tools

Problem

Unit tests, linters, and typecheckers validate individual components but don't test agent workflows end-to-end. It's easy to create prompts that don't work well despite all underlying pieces being correct.

You need to validate that prompts and tools work together effectively as a system.

Solution

Implement workflow evals (simulations) that test complete agent workflows with mocked tools.

Core components (Sierra-inspired approach):

Dual tool implementations: Every tool has both true and mock versions

# True implementation - calls real APIs
def search_knowledge_base_true(query: str) -> str:
    return kb_api.search(query)

# Mock implementation - returns static/test data
def search_knowledge_base_mock(query: str) -> str:
    return TEST_KB_RESULTS.get(query, DEFAULT_RESULT)

Simulation configuration: Each eval defines:

Initial prompt: What the agent receives
Metadata: Situation context available to harness
Evaluation criteria: Success/failure determination

evals:
  - name: slack_reaction_jira_workflow
    initial_prompt: "Add a smiley reaction to the JIRA ticket in this Slack message"
    metadata:
      situation: "slack_message_with_jira_link"
    expected_tools:
      - slack_get_message
      - jira_get_ticket
      - slack_add_reaction
    evaluation_criteria:
      objective:
        - tools_called: ["slack_get_message", "jira_get_ticket", "slack_add_reaction"]
        - tools_not_called: ["slack_send_message"]
      subjective:
        - agent_judge: "Response was helpful and accurate"

Dual evaluation criteria:

Objective criteria:
- Which tools were called
- Which tools were NOT called
- Tags/states added to conversation (if applicable)
Subjective criteria:
- Agent-as-judge assessments (e.g., "Was response friendly?")
- LLM evaluations of qualitative outcomes

CI/CD integration: Run evals automatically on every PR

# GitHub Actions workflow
on: pull_request
steps:
  - run: python scripts/run_agent_evals.py
    # Posts results as PR comment

# Eval execution flow
1. Load eval configuration
2. Swap in mock implementations for all tools
3. Run agent with initial prompt + metadata
4. Track which tools agent calls
5. Evaluate against objective criteria (tool usage)
6. Run agent-as-judge for subjective criteria
7. Report pass/fail with details

How to use it

Best for:

Agent workflows where tools have side effects (APIs, databases)
CI/CD pipelines requiring workflow validation
Prompt engineering and optimization
Regression testing for agent behavior changes

Implementation approach:

1. Create mock layer for tools:

class MockToolRegistry:
    def __init__(self, mode: str = "mock"):
        self.mode = mode

    def get_tool(self, tool_name: str):
        if self.mode == "mock":
            return self.mocks[tool_name]
        return self.real_tools[tool_name]

    # Register mock implementations
    mocks = {
        "slack_send_message": mock_slack_send_message,
        "jira_create_ticket": mock_jira_create_ticket,
        # ...
    }

2. Define eval cases:

evals = [
    {
        "name": "login_support_flow",
        "prompt": "User can't log in, help them",
        "expected_tools": ["user_lookup", "password_reset"],
        "forbidden_tools": ["account_delete"],
        "subjective_criteria": "Response was empathetic and helpful"
    },
    # ... more evals
]

3. Run and evaluate:

def run_eval(eval_config):
    # Run agent with mocked tools
    result = agent.run(
        prompt=eval_config["prompt"],
        tools=mock_registry
    )

    # Check objective criteria
    tools_called = result.tools_used
    passed = all(t in tools_called for t in eval_config["expected_tools"])
    passed &= all(t not in tools_called for t in eval_config["forbidden_tools"])

    # Check subjective criteria
    if passed:
        judge_prompt = f"""
        Evaluate this agent response: {result.response}
        Criteria: {eval_config['subjective_criteria']}
        Pass/fail?
        """
        passed = llm_evaluator(judge_prompt) == "PASS"

    return {"passed": passed, "details": result}

4. Integrate with CI/CD:

# .github/workflows/agent_evals.yml
name: Agent Evals
on: pull_request
jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - run: python scripts/run_evals.py --format github
      - uses: actions/github-script@v6
        with:
          script: |
            const results = require('./eval_results.json');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: formatResults(results)
            });

Handling non-determinism:

The article notes evals are "not nearly as well as I hoped" due to non-determinism:

Strong signal: All pass or all fail
Weak signal: Mixed results
Mitigation: Retry failed evals (e.g., "at least once in three tries")

Trade-offs

Pros:

End-to-end validation: Tests prompts + tools together as a system
Fast feedback: Catch regressions before they reach production
Safe testing: Mocked tools avoid side effects during testing
Clear criteria: Both objective (tool calls) and subjective (quality) measures
CI/CD integration: Automated validation on every PR

Cons:

Non-deterministic: LLM variability makes flaky tests common
Mock maintenance: Need to keep mocks synced with real tool behavior
Prompt-driven fragility: Prompt-dependent workflows (vs code-driven) more flaky
Not blocking-ready: Hard to use as CI gate due to variability
Tuning overhead: Need continuous adjustment of prompts and mock responses
Limited signal: Mixed pass/fail results provide ambiguous guidance

Operational challenges:

"This is working well, but not nearly as well as I had hoped... there's very strong signal when they all fail, and strong signal when they all pass, but most runs are in between."

"Our reliance on prompt-driven workflows rather than code-driven workflows introduces a lot of non-determinism, which I don't have a way to solve without... prompt and mock tuning."

Improvement strategies:

Retry logic: "At least once in three tries" to reduce flakiness
Tune prompts: Make eval prompts more precise and deterministic
Tune mocks: Improve mock responses to be more realistic; keep synced with real tools
Code over prompts: Move complex workflows from prompt-driven to code-driven
Directional vs blocking: Use for context rather than CI gates

References

Building an internal agent: Evals to validate workflows - Will Larson (2025)
Sierra platform: Simulations approach for agent testing
LangSmith Evaluation Platform - Tool tracking and custom evaluators
Promptfoo - Mock API responses and assertion-based testing
Related: Stop Hook Auto-Continue Pattern - Post-execution testing
Related: Agent Reinforcement Fine-Tuning - Training on agent workflows