Agent Reinforcement Fine-Tuning (Agent RFT)

Problem

After optimizing prompts and task design, agents may still underperform on your specific business tasks because:

Domain shift: Your tools and business context differ from what the base model was trained on
Inefficient tool use: Agents make too many tool calls or use wrong tools, leading to high latency
Suboptimal reasoning: The model doesn't reason well across your specific tool outputs
Sample scarcity: Some domains (e.g., new GPU hardware, specialized finance) lack training data

Traditional fine-tuning approaches don't work well because they can't train the agent end-to-end on multi-step tool interactions with your environment.

Solution

Agent Reinforcement Fine-Tuning (Agent RFT) trains the model weights end-to-end on agentic tasks by allowing the model to:

Explore via actual tool calls: During training rollouts, the agent calls your real tool endpoints, learning from actual responses
Receive custom reward signals: You define what "good" looks like via flexible graders (model-based, endpoint-based, or string-based)
Learn multi-step reasoning: The agent learns to reason across tool outputs in the context window
Optimize for your metrics: Reduce tool calls, improve accuracy, or balance both based on your reward function

Key Components:

Tool Endpoints: Host your tools (same as production) that the model calls during training
Grader Endpoint: Define custom reward logic that evaluates final answers and/or tool call traces
Unique Rollout IDs: Each training rollout gets a unique ID for state management across tool calls

Grader Design Best Practices:

Use gradient rewards: Provide 0-1 floating point scores rather than binary 0/1 for clearer learning signals
Prevent reward hacking: Evaluate reasoning process, not just final answers; detect "lucky guesses"
Align with domain knowledge: Measure grader-human consistency (e.g., Cohen's Kappa) before training
Multi-dimensional evaluation: Consider correctness, format compliance, efficiency, and safety

# Agent RFT Training Setup
from openai import OpenAI

client = OpenAI()

# 1. Define your tools with hosted endpoints
tools = [
    {
        "name": "search",
        "url": "https://your-tools.modal.run/search",
        "headers": {"Authorization": "Bearer YOUR_TOKEN"}
    },
    {
        "name": "read_file",
        "url": "https://your-tools.modal.run/read_file",
        "headers": {"Authorization": "Bearer YOUR_TOKEN"}
    }
]

# 2. Define your grader (model-based or endpoint-based)
grader = {
    "type": "model",  # or "endpoint" for custom grading logic
    "model": "gpt-4o",
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "grader_response",
            "schema": {
                "type": "object",
                "properties": {
                    "score": {"type": "number"},  # 0.0 to 1.0
                    "reasoning": {"type": "string"}
                }
            }
        }
    },
    "prompt": """
    Evaluate the agent's answer based on:
    1. Correctness vs ground truth
    2. Completeness of reasoning

    Ground truth: {ground_truth}
    Agent answer: {final_answer}

    Provide score (0-1) and reasoning.
    """
}

# 3. Start fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file="file-abc123",
    model="gpt-4o-2024-08-06",
    method="rft",
    rft={
        "tools": tools,
        "grader": grader,
        "hyperparameters": {
            "n_epochs": 3,
            "batch_size": 16,
            "compute_multiplier": 1  # Exploration factor
        }
    }
)

How to use it

Prerequisites:

Well-specified, constrained task with consensus on correct answers
Non-zero baseline performance (model sometimes gets it right)
Quality training data (100-1000 samples, quality over quantity)
Hosted tool endpoints that mirror production behavior

Training Process:

Baseline evaluation: Run your base model multiple times per sample to measure variance
Host tools: Deploy tool endpoints (FastAPI, Modal, etc.) that handle bursty traffic
Design grader: Create reward function that's hard to game, provides gradient (not just binary)
Monitor training: Watch reward curves, tool call distributions, and reasoning token counts
Evaluate results: Compare fine-tuned model on validation set for accuracy and latency

What Agent RFT Optimizes:

ML Performance: Better final answer quality through improved reasoning and tool use
Latency: Fewer tool calls and reasoning tokens (e.g., 50% reduction common)
Sample Efficiency: Can achieve strong results with as few as 100 quality samples

Tool Call Optimization Patterns:

Models naturally learn to optimize tool use through exploration:

Parallelization: Make independent tool calls simultaneously rather than sequentially
Early termination: Stop exploration once sufficient information is gathered
Tool selection: Learn which tools are most effective for specific task types

graph TD A[Training Sample] --> B[Model Generates Rollout] B --> C{Tool Call?} C -->|Yes| D[Call Your Tool Endpoint] D --> E[Tool Response] E --> F[Add to Context] F --> B C -->|No| G[Final Answer] G --> H[Call Your Grader] H --> I[Reward Signal] I --> J[Update Model Weights] J --> K[Next Sample] style D fill:#fff3e0,stroke:#f57c00,stroke-width:2px style H fill:#e3f2fd,stroke:#1976d2,stroke-width:2px style J fill:#e8f5e9,stroke:#388e3c,stroke-width:2px

Trade-offs

Pros:

End-to-end optimization: Trains the entire agent loop, not just final outputs
Sample efficient: Can work with 100-1000 samples vs millions for pre-training
Flexible rewards: Support for complex, multi-criteria grading logic
Natural speedups: Models learn to use fewer tokens and tool calls organically
Domain adaptation: Closes distribution gap between base model and your business context

Cons:

Infrastructure complexity: Must host robust tool and grader endpoints
Bursty traffic: Training sends 100s of simultaneous requests at training step boundaries
Grader design effort: Requires careful reward engineering to avoid gaming
Training cost: More expensive than supervised fine-tuning due to exploration
Debugging difficulty: Hard to trace why model learned certain behaviors

References

OpenAI Build Hour: Agent RFT (November 2025)
OpenAI Fine-tuning Guide
ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., ICLR 2023)
Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., NeurIPS 2023)
DeepSeekMath: GRPO for Mathematical Reasoning (Shao et al., 2024)
Related patterns: RLAIF, Tool Use Incentivization via Reward Shaping, Inference-Healed Code Review Reward, Memory Reinforcement Learning