Problem
As AI models generate increasing amounts of code, the bottleneck in software development shifts from code generation to code verification and review. Ensuring that AI-generated code is not only syntactically correct but also semantically correct, aligns with the intended functionality (especially if underspecified), and meets quality standards becomes crucial and time-consuming.
Solution
Develop and employ AI-powered tools and processes specifically designed to assist humans in reviewing and verifying code, whether it's AI-generated or human-written. This can involve:
- AI agents that analyze code changes and highlight potential issues, bugs, or deviations from best practices.
- Tools that help summarize the intent or impact of code changes, making it easier for human reviewers to understand.
- Interactive systems where reviewers can ask the AI to explain parts of the code or justify certain decisions made during generation.
- Mechanisms to ensure the AI's output aligns with the user's "mind's eye" or high-level intent, even if the initial specification was ambiguous.
- Multi-agent approaches where one agent generates code while another critiques and verifies it, iterating until convergence.
- Three-layer workflows stratifying tasks by complexity: AI-only for style/documentation, AI-human collaboration for logic/security, and human-only for architectural decisions.
The goal is to make the code review process more efficient and reliable, building confidence in the (AI-assisted) codebase.
How to use it
- Integrate AI verification tools into the PR review process.
- Prompt agents to explain their generated code or provide rationales for changes.
- Focus human review on verifying alignment with high-level intent and business logic.
Trade-offs
- Pros: Reduces time spent on routine review tasks; enables consistent enforcement of coding standards; can identify issues humans miss through multi-agent debate and self-critique loops.
- Cons: Risk of hallucination (reviewing non-existent code or making uncited claims); high false positive rates can lead to alert fatigue; teams heavily adopting AI assistants have seen significant increases in PR review time due to the volume of AI-generated changes requiring verification.
References
-
Aman Sanger (Cursor) at 0:09:12: "So I think we're going to need to figure out how to make it easier for people to review code, how to to be confident that the agent's making the changes that are not just correct... was it actually what you had in your mind's eye? And so making the process of review much, much better, I think will be really, really important."
-
Primary source: https://www.youtube.com/watch?v=BGgsoIgbT_Y
-
"Evaluating Large Language Models for Code Review" (arXiv 2505.20206, May 2025): GPT-4o achieved 68.50% classification accuracy with problem descriptions; performance declines significantly without context
-
"Automated Code Review In Practice" (arXiv 2412.18531, December 2024): Industry case study examining LLM-based automated code review tools including Qodo, GitHub Copilot, and Coderabbit