When Giants Stumble: What Multiplication Reveals about AI’s Capabilities

Despite its impressive capabilities in reasoning, planning, and content generation, GenAI still struggles with the kind of mathematics that grade school students are expected to learn and master. What influence do transformers, the core architecture behind Large Language Models (LLMs), have in this problem, and can it be solved? In the new paper “Why Can’t Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls,” a team including D^3 Associate Collaborators Martin Wattenberg and Fernanda Viégas built a transformer that did learn how to multiply, and then they took it apart to understand how.

Key Insight: The Architecture of Understanding

“We are interested in understanding the difference in a model trained with standard fine-tuning and ICoT.” [1]

Most LLMs excel at pattern matching, but in order to perform mathematics correctly, such as multi-digit multiplication, they need to gather, store, and reuse information. A model that fails to multiply correctly may be doing so because of these ‘long-range dependencies,’ regardless of the number of parameters in the model. A model trained with Standard Fine-Tuning (SFT) failed to correctly carry out steps like carry-over and partial products, but the researchers had success with a different model using Implicit Chain of Thought (ICoT) training. Instead of forcing the model to guess the final answer directly, ICoT had the model predict the running sum at each stage of multiplication and ‘cache’ the partial products. This small change guides the ICoT model to store and reuse intermediate information and thereby multiply correctly.

Key Insight: The AI That Learned to Multiply

“Mechanistically, the ICoT model encodes long-range dependencies by organizing its attention in a sparse, binary-tree-like-graph.” [2]

The researchers dissected the successful ICoT model to understand how it was doing its math. The model had essentially built its own layered memory network through a tree structure. Early layers focused on pairs of digits, storing their products. Later layers learned to read back from those stored points. This insight led the researchers back to the SFT model: by adding an auxiliary loss, an additional training signal designed to teach the model what intermediate information to care about, they were able to massively improve the model’s multiplication accuracy.

Why This Matters

This research illustrates another example of AI’s Jagged Frontier, just because AI produces impressive results on some tasks, it doesn’t guarantee competency across all domains, even seemingly simple ones. For executives and business leaders, this matters deeply. AI is already and increasingly being integrated into systems that analyze data and recommend actions. Strategies for making AI more logical, transparent, and trustworthy can help businesses plan with more confidence, but ultimately, leaders need to make decisions about how they will implement AI and own the risks when humans are out of the loop.Leaders who stay informed and engaged about these dynamics will be best positioned to separate hype from capability and deploy AI where it adds value responsibly.

References

[1] Xiaoyan Bai et al., “Why Can’t Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls.” arXiv preprint arXiv:2510.00184 (September 30, 2025): 2. Preprint DOI: https://doi.org/10.48550/arXiv.2510.00184

[2] Bai et al., “Why Can’t Transformers Learn Multiplication?”: 1.

Meet the Authors

Xiaoyan Bai is a PhD student in computer science at the University of Chicago.

Itamar Pres is a PhD student at MIT focusing on artificial intelligence.

Yuntian Deng is an assistant professor at the University of Waterloo.

Chenhao Tan is an associate professor in the Department of Computer Science and Data Science at the University of Chicago.

Stuart Shieber is James O. Welch, Jr. and Virginia B. Welch Professor of Computer Science at Harvard University.

Fernanda Viégas is Gordon McKay Professor of Computer Science at the Harvard John A. Paulson School of Engineering and Applied Sciences, and an Associate Collaborator at the Digital Data Design Institute at Harvard (D^3).

Martin Wattenberg is Gordon McKay Professor of Computer Science at the Harvard John A. Paulson School of Engineering and Applied Sciences, and an Associate Collaborator at the Digital Data Design Institute at Harvard (D^3).

Andrew Lee is a postdoctoral fellow at the Insight + Interaction Lab at the Harvard John A. Paulson School of Engineering and Applied Sciences.

Key Insight: The Architecture of Understanding

Key Insight: The AI That Learned to Multiply

Why This Matters

References

Meet the Authors

Join Our Community