Can You Spot the Bot? | Digital Data Design Institute at Harvard

New research reveals just how convincingly AI mimics humans

Listen to this article:

Alan Turing’s original “imitation game,” proposed in 1950, had an elegant simplicity: a human judge conducts a text-based conversation with two hidden parties—one human, one machine—and tries to guess which is which. Today, the question Turing posed has quietly expanded into territory he never mapped. Our digital existence is a kaleidoscope of multi-modal interactions. We don’t just “talk” to the internet, we upload snapshots of our morning coffee, interpret complex visual data in professional dashboards, estimate the mood of a room through a video call, and follow subtle cues of visual attention. “Can Machines Imitate Humans? Integrative Turing-like tests for Language and Vision Demonstrate a Narrowing Gap,” co-written by Hanspeter Pfister, D^3 Associate and An Wang Professor of Computer Science at Harvard SEAS, explains how a new large-scale study from researchers at 15 organizations around the globe drags the imitation game into the full complexity of how humans communicate, perceive, and describe the world. Are we already past the point where we can reliably tell machines from humans, and does it matter who’s doing the judging?

Key Insight: A Gauntlet of Language and Vision

“[W]e present an integrative benchmark encompassing a wide range of standard and well-established AI tasks across both language and vision.” [1]

Rather than testing imitation in a single domain, the researchers designed a six-task benchmark spanning language and vision. Language tasks included image captioning, word association, and open-ended conversation. Vision tasks covered color estimation (identifying the dominant color in a scene), object detection (naming three visible items), and attention prediction (comparing human eye-tracking data with AI-generated gaze sequences). The data collection was correspondingly ambitious: 36,499 responses from 636 human participants and 37 AI models, evaluated through 72,191 Turing-like tests administered to 1,916 human judges and 10 AI judges. A subtle but important design choice: the tests were not trying to determine accuracy, they were trying to quantify indistinguishability: a system can be wrong and still match human patterns, or be correct and still fail to pass as a human.

Key Insight: Measurement, Not Myth

“[W]e consider Turing-like tests as a quantitative evaluation of how well current AIs can imitate humans.” [2]

The Turing test has always been contentious. Critics argue that a machine could pass it without genuine understanding, that it measures performance rather than intelligence, or that it’s too narrow to be meaningful. The researchers behind this study sidestep that debate entirely. Their goal isn’t to adjudicate whether AI is intelligent , it’s to measure something more practical: the ability of machines to convincingly replicate human behavior, including our flaws and mistakes. By using imitation detectability, which is the ability of a judge to distinguish between a person and an algorithm, the researchers create a framework for measuring how “human-like” a machine has become. That shift matters because conventional benchmarks often assume a stable ground truth and score models by how accurately they match a single ‘correct’ answer predefined by a human. But many real tasks, especially open-ended ones, don’t have a single ‘correct’ output. And even when they do, being right isn’t the same as acting human; a truly human-like AI should make the exact same mistakes we do.

Key Insight: The Vanishing Threshold of Detection

“[C]urrent algorithms are not far from being able to imitate humans in these tasks.” [3]

The results of the study are both a testament to engineering prowess and a warning for the future of digital trust. Across the six tasks, the researchers found that the detectability of AI is plummeting. In a ‘perfect’ world where humans and machines are indistinguishable, a judge would have a 50% chance of being right—essentially a coin flip. The study found that for many tasks, the detectability scores are hovering remarkably close to that 50/50 chance level. In areas like image captioning and word associate, the gap has become so thin that even motivated human judges struggle to tell the difference between a person’s description and an AI’s. Yet the convergence is not complete: AI still occasionally trips over the uniqueness of human behavior. For instance, humans are surprisingly consistent in where they look in a picture, while machines sometimes focus on statistically likely but humanly irrelevant details.

At the same time, one of the paper’s most striking results showed that AI proves highly effective in a different role: judge rather than subject. Simple machine learning classifiers trained to distinguish human- from AI-responses outperformed human judges on most tasks. For word associations, the AI judge correctly identified machine-generated responses 91% of the time, compared to just 47% for humans. Machines, it seems, are far better than we are at spotting each other.

Why This Matters

For executives and business leaders, this research redraws the risk landscape in two directions. First, the near invisibility of AI responses in everyday tasks means fraud, disinformation, and impersonation are no longer theoretical risks, they are statistically plausible at scale, today. Second, because automated classifiers outperform human judges, detection cannot rely on human vigilance alone anymore. It requires infrastructure, and regulators in the EU and elsewhere are already moving toward mandatory AI disclosure requirements. This paper highlights the importance of building transparency tools now to be prepared for when they are required and to ensure you can maintain your customers’ trust.

Bonus

As AI systems get more capable, they’re also getting harder to understand. Another response to this challenge is to build clearer explanations for why models behave the way they do with a single, coherent framework. To go deeper on this initiative, check out “Unifying AI Attribution: A New Frontier in Understanding Complex Systems.”

References

[1] Mengmi Zhang et al., “Can Machines Imitate Humans? Integrative Turing-like tests for Language and Vision Demonstrate a Narrowing Gap,” arXiv preprint arXiv:2211.13087v3 (2025): 3. https://doi.org/10.48550/arXiv.2211.13087

[2] Zhang et al., “Can Machines Imitate Humans?”: 2.

[3] Zhang et al., “Can Machines Imitate Humans?”: 16.

Meet the Authors

Hanspeter Pfister is An Wang Professor of Computer Science at Harvard John A. Paulson School of Engineering and Applied Sciences and D^3 Associate.

Additional Authors: Mengmi Zhang, Elisa Pavarino, Xiao Liu, Giorgia Dellaferrera, Ankur Sikarwar, Caishun Chen, Marcelo Armendariz, Noga Mudrik, Prachi Agrawal, Spandan Madan, Mranmay Shetty, Andrei Barbu, Haochen Yang, Tanishq Kumar, Shui’Er Han, Aman Raj Singh, Meghna Sadwani, Stella Dellaferrera, Michele Pizzochero, Brandon Tang, Yew Soon Ong, Gabriel Kreiman