An AI agent, designed to manage a software project, is reviewing a code change submitted by another agent. Its goal is to ensure the change is safe to merge. It reads the code, and its internal monologue sounds something like this: “The code looks reasonable. The logic seems correct. It appears to follow our style guide. I don’t see any obvious security vulnerabilities. I’ll approve it.”
The pull request is merged. A week later, the site goes down. A subtle race condition in the “reasonable” code, which was not apparent from a surface-level semantic reading, was triggered under load.
The failure here was not in the agent’s ability to reason, but in the task it was assigned. We asked a semantic, probabilistic tool (an LLM) to perform a deterministic, formal verification task. We asked it to feel if the code was correct, instead of proving it.
To build reliable multi-agent systems, especially those where agents review each other’s work, we must be disciplined about what we ask them to do. Review tasks must be separated into two distinct categories: deterministic gating and semantic review.
Large language models are masters of semantics. They have a powerful, intuitive grasp of what “good code” looks like. They can spot stylistic errors, comment on clarity, and even identify common anti-patterns. This is the “semantic review.” It’s the equivalent of a human code review, focusing on readability, maintainability, and architectural soundness. It is incredibly valuable.
However, an LLM cannot, by its nature, prove that code is correct. It cannot exhaustively trace every logical path, simulate every possible concurrent state, or formally verify that the code adheres to a specification. These tasks belong to a different class of tools: the deterministic gates.
Deterministic gates are tools that provide a binary, provably correct answer to a specific question.
When we conflate these two types of review, we build fragile, unreliable systems. We ask the LLM to do the linter’s job, and it misses a formatting error. We ask it to do the test runner’s job, and it approves code with a failing test because the logic “seemed plausible.”
A dangerous anti-pattern is to set up a workflow where one agent’s work is approved or rejected based solely on the opinion of another agent. Agent A submits a pull request. Agent B is given the diff and a prompt: “Review this pull request. If it is high quality and correct, approve it. Otherwise, request changes.”
This is a recipe for disaster. Agent B’s approval is a semantic judgment, not a guarantee of quality. It’s a “vibe check” for code. This creates several problems:
A robust review process is a pipeline. The work must pass through a series of deterministic gates before it is ever presented to a semantic reviewer (human or AI).
The workflow for merging a change should look like this:
make build -> PASS/FAILmake lint -> PASS/FAILmake test -> PASS/FAILmake scan-vulnerabilities -> PASS/FAILIn this model, we are using tools for what they are good at.
In the world of AI, the old adage “trust, but verify” takes on a new meaning. We can trust our agents to be creative, to reason semantically, and to generate novel solutions. But we must verify their work with tools that are incapable of lying, guessing, or being “mostly sure.”
Stop asking your agents to perform tasks that require deterministic proof. Let the compiler be the compiler. Let the test runner be the test runner. Let your linter be your linter.
Reserve your powerful language models for the tasks that only they can do: understanding intent, providing architectural guidance, and judging the subjective elegance of a solution. By separating the provable from the plausible, we can build multi-agent systems that are not only fast and innovative but also robust, reliable, and worthy of our trust.
A reliable review process is not monolithic; it is a pipeline that separates provable, objective verification from subjective, semantic assessment. This separation of concerns is a foundational principle for trustworthy automation.
Separate Deterministic Gating from Semantic Review. A system’s first line of defense must be a series of deterministic gates—automated, binary checks like compilers, linters, and test suites. These tools answer objective questions of correctness. Only after work has passed through these gates should it be presented to a semantic reviewer (AI or human) for subjective assessment of clarity, maintainability, and strategic alignment.
Codify Policy in Tooling. Do not rely on an LLM’s interpretation of a prompt to enforce quality standards. If a standard can be checked by a linter, it must be enforced by the linter. If a behavior can be verified by a test, it must be covered by a test. Governance is most effective when it is encoded into the structure of the execution environment, not suggested in a natural language request.
Define the Scope of Review. The role of the semantic reviewer is not to re-do the work of the deterministic gates. Its purpose is to evaluate the aspects of the work that automated tools cannot, such as architectural elegance, alignment with long-term goals, or the exploration of alternative approaches. The prompt for a semantic review should explicitly state what has already been verified, freeing the LLM to focus on higher-level concerns.