Mastering Efficiency in AI Training: Insights from Critical Batch Size Research

As businesses increasingly adopt large-scale AI models, optimizing training efficiency is crucial. In “How Does Critical Batch Size Scale in Pre-training?”, Hanlin Zhang and a group of colleagues (see below for author details) explore critical batch size (CBS)—the threshold at which data parallelism, which distributes training data across multiple processors, stops yielding significant returns from larger batch sizes. Understanding the scaling behavior of the CBS is crucial for efficient pre-training of large models, enabling practitioners to balance computational efficiency and model performance, particularly when resources are limited. Thus, the paper’s findings provide actionable insights for enhancing the efficiency of large language model (LLM) training while effectively managing resources.

Key Insight: Data Size – The Key Driver of CBS

“Our results demonstrate that CBS scales primarily with data size rather than model size.” [1]

The researchers conducted extensive experiments training autoregressive language models ranging from 85 million to 1.2 billion parameters. Through careful control of factors like batch size, momentum, and learning rate scheduling, they made a surprising discovery—the CBS increases primarily as a function of the amount of training data, rather than the size of the model itself. This finding challenges previous assumptions about the relationship between model size and training efficiency.

Key insight: The Power of Exponential Weight Averaging

“EWA consistently improves model training efficiency. Training with EWA is slightly worse than Cosine learning rate decay for small batch sizes while outperforming Cosine for large batch sizes.” [2]

The researchers highlight the effectiveness of exponential weight averaging (EWA) (1) in improving model training efficiency, especially for large batch sizes. This technique proves particularly valuable in scenarios where a target loss must be achieved, but practitioners are uncertain about the exact maximum data size for setting up the learning rate schedule. EWA outperformed Cosine learning rate decay (2) for batch sizes larger than 0.52 million tokens, offering a more adaptable approach to long-duration training.

Key Insight: The Model Width vs. Depth Debate

“Increasing width and depth has similar effects in the increase of critical batch size for compute-optimal pre-training.” [3]

The research team explored different strategies for scaling model size, comparing the effects of increasing model width versus depth. Specifically, they compared 604 million parameter models scaled by increasing width (2x) and depth (4x), finding equivalent efficiency gains in terms of CBS. Their findings indicate that both approaches lead to similar increases in CBS for compute-optimal pre-training.

Why This Matters

Understanding the scaling laws of CBS is crucial for businesses and organizations investing in AI technologies. The insights from Zhang et al.’s research offer practical guidance for optimizing large-scale language model training, potentially leading to significant cost savings and improved performance. By focusing on data size rather than model size for increasing CBS, companies can make more informed decisions about resource allocation in AI development. Additionally, the findings on EWA and model scaling strategies provide actionable techniques for enhancing training efficiency. As AI continues to play an increasingly important role in business operations and decision-making, these insights will help executives and technical teams develop more effective and resource-efficient AI strategies, ultimately driving innovation and competitive advantage in the AI-driven marketplace.

Footnotes

(1) EWA (Exponentially Weighted Averages) is a technique used in model training that maintains a running average of model weights. It can be described by the formula ξt+1 = τ · ξt + (1 – τ) · θt, where ξt is the averaged weight, θt is the current model weight, and τ is the decay rate. This method helps to smooth out noise in the training process and can improve optimization, especially for large batch sizes and longer training durations.

(2) Cosine learning rate decay is a scheduling method for adjusting the learning rate during model training. It follows a cosine function to gradually decrease the learning rate over time. This approach is typically beneficial for small-batch training scenarios, as it helps the model converge more effectively.

References

[1] Hanlin Zhang, Depen Morwani, Nikhil Vyas, Jingfeng Wu, Difan Zou, Udaya Ghai, Dean Foster, and Sham Kakade, “How Does Critical Batch Size Scale in Pre-training?” arXiv:2410.21676 [cs.LG] (October 29, 2024): 1-30, 1.

[2] Zhang et al., “How Does Critical Batch Size Scale in Pre-training?”, 5.

[3] Zhang et al., “How Does Critical Batch Size Scale in Pre-training?”, 6.

Meet the Authors

Hanlin Zhang is a CS PhD student at Harvard ML Foundations Group, and is advised by Sham Kakade. He is interested in foundations and social implications of machine learning. He received his Master’s degree in Machine Learning at Carnegie Mellon University and Bachelor’s degree in Computer Science from South China University of Technology.

Depen Morwani is a third-year PhD student in the ML Foundations group Harvard, advised by Boaz Barak and Sham Kakade. His research focuses on understanding the inductive bias of optimization algorithms in deep learning, and utilizing it for developing better optimization algorithms. He completed his Master’s degree at the Indian Institute of Technology, Madras.

Nikhil Vyas is a Postdoctoral Research Fellow at Harvard’s School of Engineering and Applied Sciences. Before Harvard, he received his PhD from MIT in theoretical computer science. Vyas’ current research focuses on improving and understanding deep learning. He is interested in the topics of improving efficiency of neural network optimization, scaling laws, and understanding feature learning in neural networks.

Jingfeng Wu is a Postdoctoral Fellow at the Simons Institute at UC Berkeley. He is a part of the NSF/Simons Collaboration on the Theoretical Foundations of Deep Learning. He obtained his PhD in Computer Science at Johns Hopkins University, advised by Vladimir Braverman, and Bachelor’s and Master’s degrees at Peking University.

Difan Zou is an Assistant Professor in the Department of Computer Science at The University of Hong Kong and The HKU Musketeers Foundation Institute of Data Science. He is interested in machine learning, stochastic optimization, and graph learning, with a special focus on the theoretical/empirical understanding (or Physics) of deep learning (especially foundation models).

Udaya Ghai is a Senior Machine Learning Scientist at Amazon, where his research focuses on applying reinforcement learning to supply chain optimization. He is interested in theory and algorithms for machine learning, particularly reinforcement learning, control, optimization (convex and non-convex), and online learning. He completed his PhD in Computer Science at Princeton under the supervision of Professor Elad Hazan, where they worked at the intersection of online learning and control theory.

Dean Foster is the Marie and Joseph Melone Professor Emeritus of Statistics at the University of Pennsylvania Wharton School of Business. He received his PhD from the University of Maryland.

Sham Kakade joined the Harvard University faculty in spring 2022. He works on the mathematical foundations of machine learning and AI. He also focuses on the design of provably efficient and practical algorithms that are relevant for a broad range of paradigms. He earned his PhD at the Gatsby Computational Neuroscience Unit at the University College London and came to Harvard from the University of Washington, where he was a professor in computer science and statistics. He has also been a principal research scientist at Microsoft Research in New England and New York City.