RP-2026-0004Research

Calibrated trust: the number that matters

A high overall AI score with a low calibrated-trust coefficient is the worst kind of hire. Acta reports calibrated trust separately because the radar can hide it.

Published: May 15, 2026
Reading time: 9 min read
Author: Acta Research

Of all the things an Acta session measures, calibrated trust is the one we are most insistent gets reported separately. It does not fold into the AICOS radar. It does not contribute to the Acta score. It sits on its own card on the report with its own scale and its own caveat. The reason is simple: a high Acta score with a low calibrated-trust coefficient is the worst single signal a hiring panel can see, and we do not want it averaged into anything.

This article walks through what calibrated trust is, why it does not behave like the other sub-metrics, and why the literature on human-AI team performance keeps converging on it as the single best predictor of whether an AI-assisted worker will produce reliable output.

The Bansal definition, and why it matters#

The clearest published definition of calibrated trust as a measurable construct comes from Bansal et al.'s 2019 work on human-AI team performance. The core measurement is the Pearson correlation between a human user's decision (accept or reject) on each AI output and the ground-truth correctness of that output, computed across a population of decisions made under task conditions.

A coefficient of +1 means the human accepts every correct output and rejects every wrong one. Perfect calibration.
A coefficient of 0 means the human accepts and rejects at random with respect to correctness. The AI is contributing nothing the human can leverage.
A coefficient of -1 means the human systematically rejects correct outputs and accepts wrong ones. Anti-calibrated and actively destructive.

The two cases that look superficially similar but score very differently:

The candidate who accepts every AI output (high acceptance rate, no rejection) lands near 0; they are not actually evaluating, they are delegating.
The candidate who rejects every AI output (high rejection rate, no acceptance) also lands near 0; they are not evaluating either, they are refusing.

Both candidates produce work. Both look engaged in the chat log. Neither is calibrated.

0.00

Top-quartile calibrated trust, balanced acceptance

Bansal et al. (2019) found their top-performing human-AI teams clustered near r=0.62 on the calibrated-trust coefficient, distinctly higher than both the accept-everything and reject-everything strategies, which both clustered near 0.

What Buçinca added in 2025#

Bansal's framing was originally about classifier outputs in decision-support contexts: yes/no recommendations a human accepts or overrides. Buçinca's 2025 work extended the calibrated-trust framing to generative outputs in open-ended decision support: the case where the AI is not making a binary recommendation but producing a paragraph of analysis, a draft, or a multi-step plan.

That extension is what makes the construct applicable to Acta scenarios. A generative output is not binary, but the candidate's response to it nearly is: they accept and use the paragraph as-is, they reject and rewrite it, or they accept it with a targeted edit. Acta captures all three.

Buçinca's central finding, replicated across three studies in her dissertation: workers whose calibrated-trust coefficient sat above r=0.4 produced reliable downstream output. Workers below r=0.2 produced output that needed substantial review even when the AI was right, because the rejection pattern was uncorrelated with whether the AI was right. The gap between the two groups was sharp. Not a continuous distribution but a roughly bimodal one, with the modal worker either calibrated or uncalibrated, and the in-between zone unusually thin.

Why calibrated trust does not fit the radar#

The AICOS radar reports six sub-competencies that are intended to be conceptually independent: high in one does not predict high in another. They average meaningfully because they are measuring different things along comparable scales.

Calibrated trust does not behave that way. It is not a sub-competency along the AICOS axes; it is a meta-property of how a candidate uses several of them in combination. A candidate with strong Detect AI and strong Evaluate and Create AI will, in most cases, produce a high calibrated-trust coefficient, but the reverse is not enforced, and the gap is informative.

This is why we report calibrated trust separately:

Folding it into the radar would smooth over the very candidates the metric is designed to surface: the ones whose individual sub-competencies look acceptable but whose composite behavior is uncalibrated.
The scale is different. Calibrated trust runs −1 to +1; the AICOS axes run 0 to 100. Putting them on the same chart would force a normalization that hides the actual coefficient.
The interpretive treatment is different. A low Acta score is "weaker on the role." A low calibrated-trust coefficient with a high Acta score is "this candidate produces confident wrong work fast." The two are not the same finding.

How Acta measures it#

Every Acta session captures, per turn:

The AI's output.
The ground-truth correctness for that output, established when the scenario is built, so each AI claim can be checked against what is actually true.
The candidate's decision: accept (used the output unchanged or near-unchanged), reject (rewrote materially or did not use), or qualified accept (used with edits).

Across the session, calibrated trust is the Pearson correlation between the binary acceptance signal and the binary ground-truth signal, with qualified-accept folded into the higher-information acceptance bucket. The minimum useful sample is six decision events; sessions that fall below that are flagged on the report as low-confidence rather than reported as a coefficient.

The coefficient is reported on a five-band scale on the intelligence report:

+0.6 to +1.0 · Calibrated. The candidate accepts the right outputs and rejects the wrong ones at a rate well above chance.
+0.3 to +0.6 · Mostly calibrated. Functioning; minor noise in either direction.
−0.3 to +0.3 · Uncalibrated. Decisions are not predictably correlated with output correctness. Acceptance and rejection are not informative.
−0.3 to −0.6 · Inverted. The candidate systematically distrusts correct outputs or trusts wrong ones. Rare; almost always an artifact of misreading the task brief.
−0.6 to −1.0 · Anti-calibrated. Extremely rare; in early data we have seen one such case in roughly four hundred sessions. Worth investigating.

What a calibrated candidate actually looks like#

In our early scenario logs, calibrated candidates are not the ones who are skeptical of every output. They are the ones who push back on the specific outputs that turn out to be wrong, and who accept the specific outputs that turn out to be right, usually with a brief, well-targeted edit. They ask for a source when a claim does not sit right. They check a number against the data they were given. They take the parts that hold up and improve them.

A calibrated candidate is not a perfect candidate. The point is not that they catch everything. It is that their judgment tracks the truth: when they miss, they tend to miss on the genuinely hard calls, not the obvious ones, and the report treats that as informative rather than disqualifying.

The contrast with the over-trusting candidate is what the report makes legible. The over-trusting candidate accepts confident-looking claims without checking them and produces a deliverable that reads fine and is materially wrong. Their Acta score will be lower, because the mistakes they let through degrade the work, but their calibrated-trust coefficient is the indictment the score alone does not deliver: they did not see the problems come in.

Why this is the number we want hiring managers to read first#

We do not say this lightly. The Acta score is the headline. The radar is the decision. Calibrated trust is the flag.

A hiring manager who reads the report top-to-bottom sees, in order: score, radar, calibrated-trust band, scenario summary, per-decision log. The calibrated-trust band is placed third because the report has to scan in under a minute, and the manager's most consequential branch is whether to weight the radar or to weight the flag. If the flag is green (calibrated, mostly calibrated), the radar tells the story. If the flag is red (uncalibrated, inverted), the radar is unsafe to read alone.

We have considered burying calibrated trust deeper to make the report cleaner. We have considered surfacing only the qualitative band and hiding the coefficient. We have decided against both. The number is a load-bearing finding and the hiring manager needs to see it where they will not miss it.

A high score with a low calibrated-trust coefficient is the worst single signal a hiring panel can see. We do not want it averaged into anything.

, Why it gets its own cardActa · 2026

What this does not solve#

Calibrated trust is necessary, not sufficient. A candidate with high calibrated trust still has to clear the AICOS radar (particularly Detect AI and Evaluate and Create) to be a credible AI-era hire. Calibrated trust measures whether the candidate's judgment about specific outputs tracks ground truth. It does not measure whether the candidate can produce ambitious work, only whether the work they produce is the work they believed was correct.

The composite hiring decision is the radar plus the flag. Either alone is the wrong number.

In the next article, we look at why getting calibrated trust wrong is the single most expensive component of an AI-era mis-hire, and why the cost is locked in by week twelve.

References

01Bansal, G., Nushi, B., Kamar, E., Lasecki, W. S., Weld, D. S., & Horvitz, E. Beyond Accuracy: The Role of Mental Models in Human-AI Team Performance. Proceedings of the AAAI HCOMP 2019, 2019.Read source
02Buçinca, Z. Worker-Centric AI for Decision Support. Doctoral dissertation, Harvard University, 2025.
03Lee, J. D., & See, K. A. Trust in Automation: Designing for Appropriate Reliance. Human Factors, 46(1), 50-80, 2004.Read source

Contents

The Bansal definition, and why it matters
What Buçinca added in 2025
Why calibrated trust does not fit the radar
How Acta measures it
What a calibrated candidate actually looks like
Why this is the number we want hiring managers to read first
What this does not solve