One Data Set, Four Different Scores: The AI Consensus Problem

March 29, 2026

In our previous installment, we looked at the brutal $10-per-query economics of AI-EDA. But even if the cost of inference drops to near zero, a deeper technical question remains: Can you actually trust the answer?

Continuing our series from the DVCon US ’26 Birds of a Feather session, Yatin Trivedi, Head, Semiconductor Center of Excellence (CoE) at Capgemini Engineering, shared the results of a high-stakes experiment that serves as a reality check for the industry’s "AI-everything" race. It highlights a massive maturity gap that every verification lead needs to understand before they integrate LLMs into their sign-off flow.

The Experiment: Grading the Plan

Yatin’s team conducted a controlled trial: they took a final design specification and a manually crafted, human-verified verification plan. They then asked four leading AI platforms to "grade" that plan for completeness. Could the AI identify gaps in the testing strategy before tape-out?

If these models were "production-ready," you would expect consensus. Instead, the results were a wake-up call:

Platform A: 62%
Platform B: 68%
Platform C: 87%
Platform D: 90%

A 28-point spread on the exact same data isn't a margin of error—it’s a fundamental difference in how these models interpret engineering logic. While each AI provided a detailed rationale referencing DDR interfaces and UVM structures, their conclusions on "completeness" were worlds apart.

The Danger of the "Plan of Record"

The primary takeaway for engineering leads is the "Plan of Record" Trap. If a team relies on a single AI platform to validate their verification coverage, they are gambling. Choosing the "optimistic" 90% model might lead to a false sense of security, while the 62% model might trigger unnecessary re-work.

"Do not use a single AI platform as your Plan of Record. If you rely solely on one model's feedback, you risk proceeding with incomplete information that leads to costly bugs during the tape-out phase."

The Efficiency Paradox: Why We Can’t Ignore It

Despite this lack of consistency, Yatin argues that engineers cannot afford to stay on the sidelines. Even with the maturity gap, the AI’s ability to flag under-utilized interfaces and deficiencies helped his team improve the quality of their human-led plan. By not using these tools, companies are "leaving too much efficiency on the table."

The challenge is learning to use AI as a quality improver rather than a final judge.

The AsFigo Perspective: Normalizing the Maturity Gap

At AsFigo, we believe the solution to this "28-point spread" isn't just better prompts—it's Guardrails. When an AI gives a subjective grade, it’s just a "guess" based on text patterns. To turn that guess into a fact, the AI must be tethered to a logical floor.

Cross-Referencing: Using multiple models to catch edge cases, then filtering those results through open-source engines like SVALint or UCIS etc.
Objective Truth: Instead of asking an AI "is this complete?", we use AI to analyze regression logs (if done late in the game) to look for the assertions, coverage triggering etc.
Human-in-the-Loop: AI bridges the gap for human experts, but human expertise remains the bridge for AI maturity.

Watch the Full Segment: In this 4-minute video, Yatin Trivedi breaks down the Capgemini experiment and explains why Tier-1 firms are moving toward a multi-model strategy to navigate the current "AI wild west."

https://youtu.be/V9Vmb55mZuA

Stay tuned for our next installment, where we explore how the "EDA License Gate" is being bypassed to build a massive global talent pipeline for the AI era.

Search This Blog

AsFigo - enabling opensource chip design and verification.