The following insights are derived from a recent Assembly Talk featuring Sushant Tripathy, Research Scientist in Machine Learning at Google.
Hybrid AI has gained traction in recent years due to the increasing computational demands of advanced machine learning models. This article delves into the concept of hybrid AI, discussing its potential opportunities and limitations. The information presented is based on the recent D^3 Assembly presentation by Dr. Sushant Tripathy, AI expert and researcher, who explored various strategies to address the challenges of AI computation and deployment.
The Growing Compute Challenge
Let’s first look back on Moore’s Law, which posits that the number of transistors on a chip will double approximately every two years. While this law has held strong, the compute capacity required for machine learning models has seen exponential growth, especially with the advent of large language models and transformer models. As the complexity of these AI models increases, so does the compute power needed for both training and inference.
In the meantime, with the rise of AI adoption, cloud data centers face increasing pressure to keep up with the demand for compute capacity. This has resulted in rising operational costs, leading cloud providers like Google Cloud Platform (GCP) and Amazon Web Services (AWS) to increase their pricing. Consequently, investors have raised concerns about the cost-prohibitive nature of the AI field, which potentially hinders startups from entering the market.
Strategies to Address Compute Challenges
In his talk, Sushant discussed three potential strategies to tackle the AI computescarcity:
1. Better hardware on the same footprint: Deploy specialized hardware, such as TPU and their Edge variants, which can perform matrix computations faster with lower power consumption, when compared to contemporary GPUs and CPUs. As deep neural networks heavily rely on matrix multiplications, this would address the compute requirements of most modern AI Models, such as CNN, Transformers, LLM.
A second emerging trend in specialized hardware is the return to analog neural machines–that is, using analog photonic or electronic chips. These analog neural machines rely on the fact that, just like human brains, most neural networks’ (intermediate or output) signals don’t need to propagate or output exact values to be useful. (For instance, for a binary classifier neural network, it is standard practice to consider an output above 0.5 as positive and below 0.5 as negative. In practice, this means all values above 0.5 and all values below 0.5 have the same meaning.) Further, although they’re a bit noisier, analog neural machines are faster and more power efficient then digital machines, because they don’t need to switch between 0 (off) and 1 (on) states. In 2023, lab-prototypes and mass-produced analog chips have already demonstrated success in training and/or implementing inference for Deep Learning Models such as CNNs.
2. Better software: Optimize models through quantization and pruning to reduce memory footprint and computational requirements while maintaining performance.
(1) Quantization relies on the conversion of 32-bit floating point (FP32) mathematical operations, which are standard in most Deep Neural Networks, to either 16-bit floating point (FP16) operations, or 8-bit integer (INT8) operations. Replacing 32-bit floating point values with 16-bit floating point values halves the memory footprint of the neural model. On some specialized hardwares like GPUs, it can speed up computation by up to 3 times. However, while applying such a replacement, one must take into consideration whether the loss in precision has a deleterious effect on the model’s performance. For instance, FP32 (12.000000 * 13.000000) has the same value as FP16 (12.0000*13.0000), the extra zeros in the more precise FP32 representation makes no arithmetic difference whatsoever.
The conversion to 8-bit integer arithmetic from 32-bit floating point arithmetic further reduces the memory footprint of neural models by up to 4 times, and increases computation speed by more than 3 times on most hardware, in that traditionally CPUs and GPUs are faster at integral maths than floating point mathematics. This approach makes neural models suitable for running on personal devices (such as mobile phones, tablets, laptops, etc.).
However, this approach is also more complicated. Quantizing a neural model using this approach requires successful identification of the most likely ranges of input and intermediate activation values for the neural network. Next, you need to identify the most likely cluster centers within these value ranges and construct an equivalent “calibrated” integer neural model, so that the output values on these points match up between the int8 and fp32 neural networks.
(2) Pruning reduces the number of mathematical operations in the neural model by setting some weak “connections” to zero. For instance, if we consider the following equation of a single neuron:
Output = Tanh(0.5*input1 + 0.005 * input2 – 0.89 * input3 + 21.0)
where input1, input2, and input3 have similar value ranges, discarding the operation ‘0.005 * input2’ will not significantly alter the output in most cases, and will reduce the compute required (by approximately 30% for the single neuron).
Pruned Output = Tanh(0.5*input1 – 0.89 * input3 + 21.0)
Hence, after looking at the labeled data for model validation, weak connections can be identified and discarded.
3. Hybrid AI: This is a novel approach that involves processing AI computations on both the user’s device and the cloud, allowing for cost-saving, reduced latency, and improved data privacy.
The Promise of Hybrid AI
Hybrid AI addresses some of the challenges posed by traditional cloud-based approaches, where both inference and training occur in centralized data centers. By offloading part of the computation to users’ personal devices, it reduces the dependence on data centers, saving costs for both providers and users. Additionally, hybrid AI lowers network dependency and latency for certain applications, improving overall user experience.
How does it work? A particular form of Hybrid-AI for inference, involves model splitting. In this approach, the AI model is divided into two parts, with the less complex part running on the user’s device (touching the user’s data) and the more complex part executing in the cloud. This enables the execution of complicated models on the user’s data, while reducing data transfer latency and computation costs.
During his talk, Sushant presented two hypothetical case studies to demonstrate the practical applications of hybrid AI in an online marketplace. The first case involved unsafe listing filtration, where the AI model ran in the merchant’s browser. By instantly recognizing and warning against inappropriate content, it saves time and reduces server-side validation calls. The second case focused on relevant item recommendations, leveraging user’s search history and personal device computations to provide real-time, improved recommendations.
Limitations and Future Developments
Despite its promise, hybrid AI faces limitations, particularly in the availability and performance of mobile computer devices. Current mobile devices’ compute growth is slower than Moore’s Law projections due to thermal dissipation constraints. However, Sushant highlighted that cooling chip technologies are making advancements towards resolving these limitations.
The takeaway? Hybrid AI presents a unique and promising approach to address the growing computational demands of advanced AI models. By leveraging both cloud and personal device computations, hybrid AI can reduce costs, improve data privacy, and enhance user experience. While it faces challenges related to mobile compute limitations, ongoing research and advancements in hardware and software technologies may unlock the full potential of hybrid AI in the near future.