Visit hbs.edu

The Promise and Pitfalls of AI in Strategic Decision-Making

As artificial intelligence (AI) continues to advance rapidly, its potential to transform strategic decision-making processes in business is becoming increasingly apparent, but how can strategists be sure their AI tools are getting it right? A recent study, “Generative Artificial Intelligence and Evaluating Strategic Decisions”,  by researchers Anil R. Doshi, Assistant Professor of Strategy and Entrepreneurship at the UCL School of Management, J Jason Bell, Associate Professor of Marketing at University of Oxford’s Saïd Business School, Emil Mirzayev, a research fellow at the UCL School of Management, and Bart S. Vanneste, Associate Professor in Strategy and Entrepreneurship at the UCL School of Management,  explores how generative AI, particularly large language models (LLMs), can be leveraged to evaluate strategic decisions like selecting business models. Their findings reveal both the current limitations and future promise of AI as a tool for strategic foresight.

The paper investigates generative AI’s use in strategic decision-making through two studies: Study 1 evaluates 60 AI-generated business models from various industries, while Study 2 assesses 60 competition-submitted models. Business models were paired within industries and assessed by AI, human experts, and non-experts. AI evaluations were aggregated across multiple LLMs (Anthropic, Google, Meta, Mistral, OpenAI), roles (e.g., founder, investor, industry expert), and prompts, to measure the effects of diversity and scale. The approach emphasized systematic comparison through consistent pairwise evaluation methods, comparing two options and selecting which business model was more likely to succeed.

Key Insight: AI Bias and Inconsistency

“We find that individual generative AI evaluations are inconsistent and biased.” [1]

The researchers found that when asked to evaluate business models individually, AI systems often produced inconsistent results. The order in which business models were presented could affect the AI’s choice, and there were systematic biases toward selecting either the first or second option. 

In Study 1, for example, the highest consistency—that is, when the evaluation of business models A and B yielded the same prediction as the evaluation of B and A—among LLMs was 80.9%, achieved by GPT-4 Turbo using the chain-of-thought prompt. Other models showed significantly lower consistency, such as Claude2 with the base prompt, which reached just 42.2%. Similarly, in Study 2, consistency varied widely, ranging from 29.9% for GPT-3.5 with the chain-of-thought prompt to 78.1% for Llama 3 with the base prompt.

Key Insight: Aggregating AI Evaluations Improves Accuracy

“[A]ggregating these [individual] evaluations results in increased agreement with human experts.” [2]

While individual AI evaluations were problematic, the researchers discovered that aggregating multiple AI evaluations produced results that aligned more closely with human expert judgments. In both studies, the comprehensive AI evaluator achieved a Pearson correlation1 of about 0.67 with human expert rankings, indicating a strong positive linear relationship. The Spearman correlation2 in Study 1 was lower, at 0.463, which was similar to non-human experts, but was higher in Study 2, at 0.72. 

The study also used “top choice” and “bottom choice” as measures to pick overall winners and losers. This metric is particularly relevant if the primary goal of the evaluation process is to select the most promising option, as in venture capital funding or incubator programs where identifying and supporting winners is key. In Studies 1 and 2, the aggregated AI evaluation matched human experts in 5 and 4 out of 10 industries, respectively, while human non-experts matched in only 2 of 10 in Study 1.

Key Insight: Diversity and Scale Both Contribute to Improved AI Performance

“The wisdom of the crowd, or the benefit of aggregating predictions, depends on two mechanisms: the crowd’s diversity and scale.” [3]

The study examines how diversity and scale influence the effectiveness of AI evaluations. Diversity, achieved by aggregating outputs from multiple LLMs, roles, and prompts, modestly improved alignment with human experts. Scaling, which involved increasing the number of evaluations aggregated, had a more substantial impact on agreement. The comprehensive AI evaluator, combining diversity and scaling, outperformed others. The findings emphasize that while diversity offsets errors through varied perspectives, scaling consistently enhances the predictive accuracy of aggregated AI evaluations.

Why This Matters

For business leaders, this research highlights both AI’s potential and its potential pitfalls in strategic decision-making. To leverage AI effectively, businesses should implement diverse and large-scale approaches rather than relying on a single model. This aggregated output can provide valuable data-driven insights that can be considered alongside human judgment and expertise. As AI continues to evolve, it will be increasingly able to augment human decision-making, offering a competitive edge by improving the quality and efficiency of critical business strategies.

Footnotes

(1) The Pearson correlation measures the linear association between two continuous variables. In this study, it reflects the strength and direction of the relationship between the win proportions assigned to business models by AI and by human experts.

    (2) The Spearman correlation measures the monotonic association between two variables. In this study, it examines the similarity in the rankings of business models based on their win proportions as assigned by AI and by human experts. This correlation considers only the rank order of the business models and not the actual magnitude of the differences in win proportions.

    References

    [1] Anil Doshi, J Jason Bell, Emil Mirzayev, and Bart Vanneste, “Generative Artificial Intelligence and Evaluating Strategic Decisions”, Strategic Management Journal (Forthcoming 2025, Available at SSRN: https://ssrn.com/abstract=4714776): 1-37, 27.

    [2] Doshi et al., “Generative Artificial Intelligence and Evaluating Strategic Decisions”, 27.

    [3] Doshi et al., “Generative Artificial Intelligence and Evaluating Strategic Decisions”, 10.

    Meet the Authors

    Anil R. Doshi is an Assistant Professor of Strategy and Entrepreneurship at the UCL School of Management. Anil earned his doctorate from the Technology and Operations Management unit at Harvard Business School. He received an A.B. in Economics and Government from Dartmouth College.

    J Jason Bell is an Associate Professor of Marketing at University of Oxford’s Saïd Business School. He works under the Future of Marketing Initiative and studies AI, perception, new products and choice processes. He uses Bayesian methods to model consumer demand and decision making and his work has been published in peer-reviewed journals such as Marketing Science and Journal of Marketing.

    Emil Mirzayev is a research fellow at UCL School of Management. He has a PhD in Management from SKEMA Business School and a PhD in Economics from Université Cote D’Azur.

    Bart S. Vanneste is an Associate Professor in Strategy and Entrepreneurship at the UCL School of Management. Bart’s research focuses on artificial intelligence and corporate strategy. He is the Program Director of the AI for Business executive education program at the UCL School of Management.


    Engage With Us

    Join Our Community

    Ready to dive deeper with the Digital Data Design Institute at Harvard? Subscribe to our newsletter, contribute to the conversation and begin to invent the future for yourself, your business and society as a whole.