Problem
Unit tests, linters, and typecheckers validate individual components but don't test agent workflows end-to-end. It's easy to create prompts that don't work well despite all underlying pieces being correct.
You need to validate that prompts and tools work together effectively as a system.
Solution
Implement workflow evals (simulations) that test complete agent workflows with mocked tools.
Core components (Sierra-inspired approach):
-
Dual tool implementations: Every tool has both
trueandmockversions# True implementation - calls real APIs def search_knowledge_base_true(query: str) -> str: return kb_api.search(query) # Mock implementation - returns static/test data def search_knowledge_base_mock(query: str) -> str: return TEST_KB_RESULTS.get(query, DEFAULT_RESULT) -
Simulation configuration: Each eval defines:
- Initial prompt: What the agent receives
- Metadata: Situation context available to harness
- Evaluation criteria: Success/failure determination
evals: - name: slack_reaction_jira_workflow initial_prompt: "Add a smiley reaction to the JIRA ticket in this Slack message" metadata: situation: "slack_message_with_jira_link" expected_tools: - slack_get_message - jira_get_ticket - slack_add_reaction evaluation_criteria: objective: - tools_called: ["slack_get_message", "jira_get_ticket", "slack_add_reaction"] - tools_not_called: ["slack_send_message"] subjective: - agent_judge: "Response was helpful and accurate" -
Dual evaluation criteria:
Objective criteria:
- Which tools were called
- Which tools were NOT called
- Tags/states added to conversation (if applicable)
Subjective criteria:
- Agent-as-judge assessments (e.g., "Was response friendly?")
- LLM evaluations of qualitative outcomes
-
CI/CD integration: Run evals automatically on every PR
# GitHub Actions workflow on: pull_request steps: - run: python scripts/run_agent_evals.py # Posts results as PR comment
# Eval execution flow
1. Load eval configuration
2. Swap in mock implementations for all tools
3. Run agent with initial prompt + metadata
4. Track which tools agent calls
5. Evaluate against objective criteria (tool usage)
6. Run agent-as-judge for subjective criteria
7. Report pass/fail with details
How to use it
Best for:
- Agent workflows where tools have side effects (APIs, databases)
- CI/CD pipelines requiring workflow validation
- Prompt engineering and optimization
- Regression testing for agent behavior changes
Implementation approach:
1. Create mock layer for tools:
class MockToolRegistry:
def __init__(self, mode: str = "mock"):
self.mode = mode
def get_tool(self, tool_name: str):
if self.mode == "mock":
return self.mocks[tool_name]
return self.real_tools[tool_name]
# Register mock implementations
mocks = {
"slack_send_message": mock_slack_send_message,
"jira_create_ticket": mock_jira_create_ticket,
# ...
}
2. Define eval cases:
evals = [
{
"name": "login_support_flow",
"prompt": "User can't log in, help them",
"expected_tools": ["user_lookup", "password_reset"],
"forbidden_tools": ["account_delete"],
"subjective_criteria": "Response was empathetic and helpful"
},
# ... more evals
]
3. Run and evaluate:
def run_eval(eval_config):
# Run agent with mocked tools
result = agent.run(
prompt=eval_config["prompt"],
tools=mock_registry
)
# Check objective criteria
tools_called = result.tools_used
passed = all(t in tools_called for t in eval_config["expected_tools"])
passed &= all(t not in tools_called for t in eval_config["forbidden_tools"])
# Check subjective criteria
if passed:
judge_prompt = f"""
Evaluate this agent response: {result.response}
Criteria: {eval_config['subjective_criteria']}
Pass/fail?
"""
passed = llm_evaluator(judge_prompt) == "PASS"
return {"passed": passed, "details": result}
4. Integrate with CI/CD:
# .github/workflows/agent_evals.yml
name: Agent Evals
on: pull_request
jobs:
evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- run: python scripts/run_evals.py --format github
- uses: actions/github-script@v6
with:
script: |
const results = require('./eval_results.json');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: formatResults(results)
});
Handling non-determinism:
The article notes evals are "not nearly as well as I hoped" due to non-determinism:
- Strong signal: All pass or all fail
- Weak signal: Mixed results
- Mitigation: Retry failed evals (e.g., "at least once in three tries")
Trade-offs
Pros:
- End-to-end validation: Tests prompts + tools together as a system
- Fast feedback: Catch regressions before they reach production
- Safe testing: Mocked tools avoid side effects during testing
- Clear criteria: Both objective (tool calls) and subjective (quality) measures
- CI/CD integration: Automated validation on every PR
Cons:
- Non-deterministic: LLM variability makes flaky tests common
- Mock maintenance: Need to keep mocks synced with real tool behavior
- Prompt-driven fragility: Prompt-dependent workflows (vs code-driven) more flaky
- Not blocking-ready: Hard to use as CI gate due to variability
- Tuning overhead: Need continuous adjustment of prompts and mock responses
- Limited signal: Mixed pass/fail results provide ambiguous guidance
Operational challenges:
"This is working well, but not nearly as well as I had hoped... there's very strong signal when they all fail, and strong signal when they all pass, but most runs are in between."
"Our reliance on prompt-driven workflows rather than code-driven workflows introduces a lot of non-determinism, which I don't have a way to solve without... prompt and mock tuning."
Improvement strategies:
- Retry logic: "At least once in three tries" to reduce flakiness
- Tune prompts: Make eval prompts more precise and deterministic
- Tune mocks: Improve mock responses to be more realistic; keep synced with real tools
- Code over prompts: Move complex workflows from prompt-driven to code-driven
- Directional vs blocking: Use for context rather than CI gates
References
- Building an internal agent: Evals to validate workflows - Will Larson (2025)
- Sierra platform: Simulations approach for agent testing
- LangSmith Evaluation Platform - Tool tracking and custom evaluators
- Promptfoo - Mock API responses and assertion-based testing
- Related: Stop Hook Auto-Continue Pattern - Post-execution testing
- Related: Agent Reinforcement Fine-Tuning - Training on agent workflows