Where Leadership Analytics Goes Wrong, and How to Fix It

11 min read

The question that should come first

I started building in this space because the evidence base for better leadership decisions existed and nobody was using it properly. That has not changed. What has changed is how often I now see systems claiming to improve those decisions. 

I now see more systems promising faster processing, wider coverage, more data, and better dashboards. Sometimes all of that is real. But none of it answers the question that should come first: what, exactly, does the evidence support, and where does the claim stop?

A system can be computationally sophisticated and still make weak claims. It can look transparent and still hide the choices that matter. I have sat in rooms where a board receives a beautifully presented leadership report, the data looks rich, the language sounds precise, and nobody around the table can say what was actually measured, how it was interpreted, or where the system chose not to answer. The report enters the succession discussion as though it were settled evidence. In practice, it is inference the board cannot inspect. That is not a technology problem. It is a governance failure.

I should be transparent about the position I am writing from. I built a system in this market. The principles in this article apply to my own work as much as to anyone else’s, and the methods appendix at the end applies the same standard to the system I designed. The five questions below are free. Any board can use them to evaluate any provider, including me.

Five questions every board should ask

A NomCo Chair can ask these in a procurement meeting. A General Counsel can build them into due diligence. A CHRO can use them to hold any provider accountable after deployment.

  1. What does the evidence actually support, and where does it stop? This forces the provider to separate description from inference. If the answer stays vague, the board has learned something important about the system and about the provider.


  2. When the evidence is insufficient, does the system say so? An abstaining system is often safer than a comprehensive one. If the provider cannot describe its abstention conditions, the output is overconfident by design.


  3. What would change the conclusion? If nothing were to change, the model is dogmatic. If everything were, it would be unstable. A serious system should be able to name the evidence, changed role requirements, or source corrections that would materially alter the result


  4. What has the system been wrong about, and what did it do about it? Boards do not need perfection. They need an error culture. If the provider cannot describe its error analysis, corrections, or known weak spots, the system is learning less than it claims.


  5. Can the governance trail survive scrutiny? If a regulator, a general counsel, an audit committee, or a sceptical non-executive asked for the chain of evidence from source to claim, could the provider present it clearly enough to withstand pressure?

These questions work because they are simple enough for any board member to ask and specific enough that a weak answer is immediately visible. Systems that speak in broader, cleaner language often look easier to buy. They appear to remove ambiguity. In practice, they move it out of sight. Governance quality is often inversely related to rhetorical certainty.

The remainder of this article explains why these questions are necessary, by identifying the five failure modes that make them so.

Five failure modes the scholarship identified before the tools did

Academics identified these problems years before the current generation of analytics tools existed. I watched each one reappear, in operational form, in the systems boards were being asked to trust.

  1. Halo contamination

Rosenzweig identified the problem with uncomfortable precision: once a company is seen as successful, observers read that success back onto every feature of the organisation, including the quality of its leaders (Rosenzweig, 2007). Outcomes contaminate the judgments meant to explain them.

Leadership analytics can reproduce the same error in a technical register. When a model is built on labels shaped by status, promotion, compensation, or prior reputation, it may not be measuring leadership capability at all. It may be learning the organisation’s historical reward patterns and presenting them as objective insight. Promotion history, title progression, compensation growth, board exposure, and network centrality: all of these correlate with later advancement, but none necessarily reflect leadership quality. They may reflect access, sponsorship, inherited opportunity, or organisational taste.

If the target variable is contaminated by the outcome it claims to explain, the system gets very good at reproducing a past narrative. That is not intelligence. It is a historical reinforcement with a better interface.

  1. Unfalsifiable claims

Pfeffer’s criticism of leadership research is not that leadership is unimportant. It is that too much of the field makes claims that are broad, flattering, and impossible to disconfirm. Antonakis and colleagues made a related point about charisma: studied intensely, still poorly defined, still poorly measured, still unable to inform practical decisions with any precision (Pfeffer, 2015; Antonakis et al., 2016).

That problem walks straight into applied systems. A provider says it measures leadership potential, executive presence, resilience, strategic gravitas, and future readiness. The board’s question should be simple: define how, against which observable indicators, and under what conditions would the claim fail? If those questions cannot be answered, the construct is doing reputational work, not evidential work.

  1. Method opacity

Simmons, Nelson, and Simonsohn showed that undisclosed flexibility in data collection and analysis can produce significant-looking findings from noise (Simmons et al., 2011). Their argument was about research practice, but the logic applies directly to commercial analytics.

What data were included and what was excluded? How was the target defined? Which benchmarks were chosen? What thresholds turned a messy distribution into a neat category? What narrative layer was added after the score appeared? Each individual choice can be reasonable. Together, they create enormous room for human judgment inside what presents itself as an objective output. The vendor knows how the system works. The board sees only the rendered conclusion.

That asymmetry matters. When a board uses an analytic output in a succession or promotion decision and cannot see the path from source to claim, it is trusting a process it cannot inspect. I have often watched boards skip that request. They assume the output has been through something rigorous. The presentation suggests it. The confidence of the language confirms it. But the governance question is not whether the provider sounds rigorous. It is whether the board can verify the rigour independently. If the answer is no, the board has a governance gap it may not recognise until the decision is challenged.

  1. Confidence that outruns evidence

In medicine, the GRADE framework became influential because it insisted on separating the quality of the evidence from the strength of the recommendation (Guyatt et al., 2008). Strong decisions can be made under weak evidence, but only when the weakness is named and governed. Trouble starts when thin evidence is paired with overconfidence.

This is where I have most often seen leadership analytics fail the board it is meant to serve. The evidence may be sparse, indirect, noisy, or only partially relevant to the role in question. The output arrives as though the uncertainty has been resolved rather than managed. Ovadia and colleagues showed that even model uncertainty estimates themselves require scrutiny (Ovadia et al., 2019). Not every system that reports confidence does so honestly.

An abstention is awkward but governable. A confident answer based on weak evidence is convenient but indefensible. That is why abstention was a non-negotiable design principle from the start of my own work: a system that says "not evidenced" is behaving more responsibly than one that turns every thin signal into a conclusion. Boards should reward that discipline rather than treating it as a limitation. A system that knows its limits is more trustworthy than one that claims not to have any.

  1. Attribution drift

This problem directly connects to the CEO attribution question, when I started sharing my thoughts earlier this month on evidence-based decisions in What Boards Can and Cannot Attribute to a CEO. Hambrick’s upper echelons theory established that top managers shape organisational outcomes through their cognitive frames and strategic choices (Hambrick & Mason, 1984; Hambrick, 2007). But translating that broad theoretical claim into a specific measurement of individual impact is far harder than the systems I have reviewed typically acknowledge. Scholars still disagree about how much firm performance can be attributed to the chief executive once chance, context, and model specification are accounted for (Fitza, 2014; Fitza, 2017; Quigley & Graffin, 2017; Bennedsen et al., 2020).

If a mature academic literature using long-run firm data cannot settle the attribution question, boards should be deeply sceptical of any system that implies it can isolate individual leadership contribution from public signals and organisational outcomes. Attribution drift happens when a system starts with descriptive evidence and ends by implying individual responsibility. Where context, team effects, and inherited conditions carry a large share of the explanatory burden, the system should narrow its claim rather than inflate it.

The Claim Boundary

The five failure modes above share a common root: the absence of a stated limit. Every failure mode describes a system that does something beyond what its evidence supports, without saying so. Halo contamination is a measurement claim that outruns the measurement. Unfalsifiable constructs are definition claims that cannot be tested. Method opacity is a process claim the board cannot verify. Overconfidence is a certainty claim the evidence does not warrant. Attribution drift is a responsibility claim the methodology cannot sustain. In each case, the missing object is the same: a published boundary between what the system can defensibly claim and what it cannot.

A Claim Boundary is that published statement: what a leadership system does claim, does not claim, and cannot yet claim on the evidence available.

The logic follows directly from the failures. If halo contamination means the target variable must be separated from the outcome (Rosenzweig), if unfalsifiable claims mean the construct must be defined sharply enough to fail (Pfeffer, Antonakis), if method opacity means the path from source to claim must be inspectable (Simmons), if overconfidence means the system must abstain when evidence is thin (GRADE, Ovadia), and if attribution drift means individual impact claims must be narrowed rather than inflated (Hambrick, Fitza, Quigley, Bennedsen), then the minimum credible response is a published boundary that makes all five disciplines visible.

So, how can you make it Operational?

First, the construct must be named tightly enough that a serious reviewer can understand what is being assessed. Not "leadership quality" in the abstract, but a narrower object defined against observable indicators. If the construct would not survive a question from Pfeffer, it is not sharp enough.

Second, the evidence classes must be stated. Public record, structured inputs, documented track record, role-relevant outcomes, human review. If the board cannot see the source classes, it cannot judge whether the claim is proportionate.

Third, the intended use must be explicit. Widening a search universe is a different use from pressure-testing a shortlist, and the system should not slide between the two without disclosure.

Fourth, excluded uses must be equally explicit. Not a replacement for committee judgment. Not a covert inference about personal qualities. Not a sole basis for an employment decision.

Fifth, abstention conditions must be visible. Low coverage, weak construct match, role ambiguity, sparse evidence, conflicting signals: these are routine conditions in real leadership work. A system that never abstains is not confident. It is undisciplined.

And finally, the audit trail must exist in a form the board can inspect. Not because every board member wants to read it. Because a general counsel, a regulator, or a sceptical non-executive may need to.

The Claim Boundary is not a technical audit. It is a governance object. A NomCo Chair does not need to understand the model to ask whether the provider has published one. A General Counsel does not need to evaluate the algorithm to check whether the abstention conditions are stated. The discipline works precisely because it does not require the board to become technical. It requires the provider to become honest about the limits of what its technology can support.

Written by James Nash.

First published on Substack, co-published on inBeta.io. March 2026
Series: The Seven™, by James Nash © inBeta™ Ltd 2026. All rights reserved.

The Author

James Nash

James is the founder of inBeta. He has spent fifteen years working with boards and senior leadership teams at global and publicly listed companies on succession, talent, capability, and leadership governance. He holds executive education from Saïd Business School, University of Oxford, in Artificial Intelligence (including Audit and Ethics), Executive Leadership, Strategic Innovation, and Executive Finance. He founded inBeta because he kept watching boards make their most important decisions on instinct, narrative, and incomplete information, and believed the evidence base existed to do it differently. James is a certified AI Auditor, AI Ethicist, and AI Professional (CAIA, CAIE, CAIP; Oxethica), and a certified practitioner in CliftonStrengths (Gallup), Hogan (including PBC 360), FIRO-B, and Cultural Intelligence (CQC).

Methods appendix. This article forms part of my thinking on evidence-based leadership, a series of thoughts I’m surfacing throughout Spring, to Summer of 2026, arguing for a governance standard, not a single technical method. The appendix discloses the principles behind that standard at a level appropriate for board review. It does not disclose scoring formulae, thresholds, or controlled parameters.

Construct. The Claim Boundary. A published, inspectable statement of what a leadership system does claim, does not claim, and cannot yet claim on the evidence available. A governance object, not a statistical technique.

My intended use. To help boards, governance committees, CHROs, and general counsel evaluate the evidential discipline of leadership analytics systems in procurement, renewal, and oversight.

My excluded uses. My writing and thought leadership are my own and do not evaluate any specific vendor or product. It does not prescribe a technical method. It does not claim that all leadership analytics is weak. It does not provide legal advice on AI Act compliance or employment law.

Abstention conditions. The standard I’ve written about applies to systems making claims about individual leaders in board-level decisions. It may not apply in the same form to low-stakes, non-individual, or purely descriptive analytics.

Source classes. Three classes of evidence. First, peer-reviewed research: halo contamination (Rosenzweig, 2007), unfalsifiable constructs (Pfeffer, 2015; Antonakis et al., 2016), analytic flexibility (Simmons et al., 2011), overconfidence (Guyatt et al., 2008; Ovadia et al., 2019), attribution (Hambrick & Mason, 1984; Hambrick, 2007; Fitza, 2014; Fitza, 2017; Quigley & Graffin, 2017; Bennedsen et al., 2020). Second, applied governance frameworks: model cards (Mitchell et al., 2019), internal auditing (Raji et al., 2020), hiring transparency (Raghavan et al., 2020), risk management (NIST, 2023). Third, regulatory reference: the EU AI Act and its high-risk classification structure. For leadership and employment uses specifically, high-risk obligations apply where the intended use matches a listed Annex III category; not every AI system used in an employment context qualifies automatically (European Union, 2024; European Commission, 2026).

Bibliography

Relevant to: Where Leadership Analytics Goes Wrong, and How to Fix It
Published: March 2026

Antonakis, J., Bastardoz, N., Jacquart, P., & Shamir, B. (2016). Charisma: An Ill-Defined and Ill-Measured Gift. Annual Review of Organizational Psychology and Organizational Behavior, 3, 293 to 319. https://doi.org/10.1146/annurev-orgpsych-041015-062305

Bennedsen, M., Pérez-González, F., & Wolfenzon, D. (2020). Do CEOs Matter? Evidence from Hospitalization Events. Journal of Finance, 75(4), 1877 to 1911. https://onlinelibrary.wiley.com/doi/abs/10.1111/jofi.12897

European Commission. (2025). AI Act application timeline. https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai

European Commission. (2026). Navigating the AI Act. https://digital-strategy.ec.europa.eu/en/faqs/navigating-ai-act

European Union. (2024). Regulation (EU) 2024/1689, Artificial Intelligence Act. https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng

Fitza, M. A. (2014). The Use of Variance Decomposition in the Investigation of CEO Effects: How Large Must the CEO Effect Be to Rule Out Chance? Strategic Management Journal, 35(12), 1839 to 1852. https://doi.org/10.1002/smj.2192

Fitza, M. A. (2017). How Much Do CEOs Really Matter? Reaffirming That the CEO Effect Is Mostly Due to Chance. Strategic Management Journal, 38(3), 802 to 811. https://doi.org/10.1002/smj.2597

Guyatt, G. H., Oxman, A. D., Vist, G. E., Kunz, R., Falck-Ytter, Y., Alonso-Coello, P., & Schünemann, H. J. (2008). GRADE: An Emerging Consensus on Rating Quality of Evidence and Strength of Recommendations. BMJ, 336(7650), 924 to 926. https://doi.org/10.1136/bmj.39489.470347.AD

Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Hutchinson, B., Spitzer, E., Raji, I. D., Vasserman, L., & Gebru, T. (2019). Model Cards for Model Reporting. FAT* '19. https://doi.org/10.1145/3287560.3287596

NIST. (2023). AI Risk Management Framework 1.0. https://airc.nist.gov/airmf-resources/airmf/0-ai-rmf-1-0/

Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J. V., Lakshminarayanan, B., & Snoek, J. (2019). Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty under Dataset Shift. NeurIPS. https://proceedings.neurips.cc/paper/2019/file/8558cb408c1d76621371888657d2eb1d-Paper.pdf

Pfeffer, J. (2015). Leadership BS: Fixing Workplaces and Careers One Truth at a Time. HarperBusiness. https://www.amazon.com/Leadership-BS-Workplaces-Careers-Truth/dp/0062383167

Quigley, T. J., & Graffin, S. D. (2017). Reaffirming the CEO Effect Is Significant and Much Larger Than Chance: A Comment on Fitza (2014). Strategic Management Journal, 38(3), 793 to 801. https://doi.org/10.1002/smj.2503

Raghavan, M., Barocas, S., Kleinberg, J., & Levy, K. (2020). Mitigating Bias in Algorithmic Hiring: Evaluating Claims and Practices. FAT* '20. https://doi.org/10.1145/3351095.3372828

Raji, I. D., Smart, A., White, R. N., Mitchell, M., Gebru, T., Hutchinson, B., Smith-Loud, J., Theron, D., & Barnes, P. (2020). Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing. FAT* '20. https://doi.org/10.1145/3351095.3372873

Rosenzweig, P. (2007). The Halo Effect: How Managers Let Themselves Be Deceived. Simon and Schuster. https://www.simonandschuster.co.uk/books/The-Halo-Effect/Phil-Rosenzweig/9781471137167

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22(11), 1359 to 1366. https://doi.org/10.1177/0956797611417632

Academic Journal

Our research.
Published.

We are preparing our first academic publication on leadership intelligence, board governance, and evidence-based succession. Register below for early access when it publishes.

Summer 2026

Academic Journal

FTSE100, Fortune 500, and ASX boards already
use it.

We are preparing our first academic publication on leadership intelligence, board governance, and evidence-based succession. Register below for early access when it publishes.

Summer 2026

Academic Journal

Our research.
Published.

We are preparing our first academic publication on leadership intelligence, board governance, and evidence-based succession. Register below for early access when it publishes.

Summer 2026

Our events

The Breakfast Table

By invitation. Where talent leaders come to think out loud. Evidence, AI, and the future of human capital. Informal. Connected. The conversations that shape how the best teams operate.

The Working Session

By invitation. Board chairs, CEOs, and senior directors exploring the leadership questions that don't have easy answers. Small rooms. Real evidence. Frank conversation.

inBeta Labs

By application. Leadership intelligence for the teams building it every day. What's changing, what's coming, and what the evidence says about how organisations will find, develop, and keep their best people.

The Breakfast Table

By invitation. Where talent leaders come to think out loud. Evidence, AI, and the future of human capital. Informal. Connected. The conversations that shape how the best teams operate.

The Working Session

By invitation. Board chairs, CEOs, and senior directors exploring the leadership questions that don't have easy answers. Small rooms. Real evidence. Frank conversation.

inBeta Labs

By application. Leadership intelligence for the teams building it every day. What's changing, what's coming, and what the evidence says about how organisations will find, develop, and keep their best people.

Apply to come to Labs.

Solutions

Pricing

Company

Legal

Resources

Social

© Copyright 2026 inBeta. inBeta, Optics, Divergence and The Seven are all trademarks of inBeta Ltd