GitHub
Feedback Loops proposed

Inference-Healed Code Review Reward

By Nikola Balic (@nibzard)
Add to Pack
or

Saved locally in this browser for now.

Cite This Pattern
APA
Nikola Balic (@nibzard) (2026). Inference-Healed Code Review Reward. In *Awesome Agentic Patterns*. Retrieved March 11, 2026, from https://agentic-patterns.com/patterns/inference-healed-code-review-reward
BibTeX
@misc{agentic_patterns_inference-healed-code-review-reward,
  title = {Inference-Healed Code Review Reward},
  author = {Nikola Balic (@nibzard)},
  year = {2026},
  howpublished = {\url{https://agentic-patterns.com/patterns/inference-healed-code-review-reward}},
  note = {Awesome Agentic Patterns}
}
01

Problem

Simple reward functions that only check for "all tests passed" fail to capture nuanced code quality issues (e.g., performance regressions, style violations, missing edge-case handling). A single binary signal at the end cannot guide the agent to produce maintainable, high-quality code.

  • Verifying only final correctness misses suboptimal commits (e.g., a patch that removes error handling but still passes tests).
  • Reward models that produce a single scalar lack explainability to tell the agent which aspect of the code needs improvement.
02

Solution

Use an inference-healed reward model—a code-review critic that:

1. Decomposes Code Quality into Subcriteria

  • Correctness: Does the code pass all existing and newly added tests?
  • Style: Are linters (e.g., ESLint, pylint) satisfied (zero or minimal warnings)?
  • Performance: Are there clear performance regressions gauged by simple benchmarks?
  • Security: Does a static analyzer (e.g., Bandit, SonarQube) flag no critical issues?

2. Runs Internal Chain-of-Thought (CoT) Reasoning

  • If uncertain about a subcriterion (e.g., performance), the critic runs a short CoT inside itself:
    "Step: performance check. Baseline runtime: 50ms. New code runtime: 65ms.
    Regression > 20%. Score: 0.4."
    
  • This "inference healing" allows the reward model to explain each sub-score.
  • Use smaller critic models (1–2B parameters) to keep CoT generation cost-efficient.

3. Aggregates Subscores

  • Each subcriterion returns a float ∈ [0, 1].
  • A weighted sum (e.g., 0.4 × correctness + 0.2 × style + 0.2 × performance + 0.2 × security) yields the final code-review score.

4. Generates Human-Readable Feedback

  • Alongside a numerical score, return a short analysis:
    {
      "correctness": 1.0,
      "style": 0.8,
      "performance": 0.4,
      "security": 0.6,
      "comments": "Performance regression due to O(n²) loop."
    }
    
03

How to use it

  • Critic Dataset Collection: Gather examples of good vs. bad code patches, labeled along each subcriterion.
  • Critic Training: Fine-tune a small LLM (e.g., 1–2 B parameters) to produce sub-scores and CoT justifications.
  • Integration into RL Loop: Replace or augment the existing binary "tests-passed" reward with inference_healed_reward(patch).
  • Selective Healing: Generate CoT explanations only for subscores below a threshold (e.g., < 0.8) to reduce latency and cost.
  • Parallel Execution: Run tests, linters, benchmarks, and security scans concurrently to reduce total evaluation time.
  • Human-in-the-Loop Checkpoints: If a patch is borderline (e.g., final_score ∈ [0.5, 0.7]), route it for manual code review to generate better labels for future training.
04

Trade-offs

  • Pros:
    • Explainable Feedback: The agent knows why a patch scored poorly, allowing targeted improvements.
    • Higher Code Quality: Incorporates non-functional criteria (performance, security), leading to more robust code.
  • Cons/Considerations:
    • Compute Overhead: Each reward invocation may involve running tests, linters, benchmarks, and a static analysis, adding latency.
    • Critic Maintenance: As coding standards or security rules evolve, retrain or update the critic models and rubrics.
05

Example

# Pseudo-code for one code-review reward invocation
subscores = {
    "correctness": test_critic.score(patch),
    "style": linter_critic.score(patch),
    "performance": perf_critic.score(patch),
    "security": security_critic.score(patch),
}
final_score = sum(weight[k]*subscores[k] for k in subscores)
return final_score, subscores, comments
06

References

  • Derived from "inference healing" in reward modeling, as discussed in the Open Source Agent RL talk (May 2025) and by Will Brown (Prime Intellect).

  • Similar principles in "Criterion-Led Reward Models" (DeepMind blog, April 2025).

  • Related academic work: "CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning" (NeurIPS 2022) introduces critic-based reward signals for code generation.

  • Industry validation: Multi-criteria code review deployed at scale by Microsoft (600K+ PRs/month), Tencent (325M lines/month), and Tekion (60% faster time to merge).

  • Primary source: https://www.youtube.com/watch?v=Xkwok_XXQgw