2023 was the year that showcased an impressive number of use cases powered by generative AI. This disruptive form of Artificial Intelligence (AI) technology is at the heart OpenAI's ChatGPT and Google’s Gemini AI model, with it demonstrating the opportunity to simplify work and advance education through generating text, images, or even audio content from user text prompts. Sounds impressive, doesn't it?
However, what’s the next step for generative AI as it proliferates across our favorite consumer devices? The answer is generative AI at the edge on mobile.
In this blog, we will demonstrate how Large Language Models (LLMs), a form of generative AI inference, can run on the majority of mobile devices built on Arm technology. We will discuss how the Arm CPU is well suited for this type of use case due the to the typical batch size and balance of compute and bandwidth that is required for this type of AI workload. We will also explain the AI capabilities of the Arm CPU and demonstrate how its flexibility and programmability enables clever software optimizations. This is resulting in great performance and opportunities for many LLM use cases.
There are a wide variety of different network architectures that can be used for generative AI. However, LLMs are certainly attracting a lot of interest due to their ability to interpret and generate text on a scale that has never been seen before.
As the LLM name suggests, these models are anything but small compared to what we were using up until last year. To give some numbers, they can easily have between 100 billion and 1 trillion trainable parameters. This means they are at least three orders of magnitude larger compared to BERT (Bidirectional Encoder Representations from Transformers), one of the largest state-of-the-art NLP (Natural Language Processing) models trained by Google in 2018.
But how does a 100 billion parameter model translate into RAM use? If we considered deploying the model on a processor using floating-point 16-bit acceleration, a 100B parameter model would require at least 200GB of RAM!
As a result, these large models end up running on the Cloud. However, this poses three fundamental challenges that could limit the adoption of this technology:
Towards the second half of 2023, we started to see some smaller, more efficient LLMs emerge that will unlock generative AI on mobile, making this technology more pervasive.
In 2023, Llama2 from Meta, Gemini Nano from Google and Phi-2 from Microsoft opened the door to mobile LLM deployment to solve the three challenges previously listed. In fact, these models have 7 billion, 3.25 billion, and 2.7 billion trainable parameters, respectively.
Today’s mobile devices have incredible computational power built on Arm technology that makes them capable of running complex AI algorithms in real-time. In fact, existing flagship and premium smartphones can already run LLMs. Yes, you read it correctly.
The deployment of LLMs on mobile is predicted to accelerate in the future, with the following likely use cases:
Across all these use cases, there will be vast amounts of user data that the model will need to process. However, the fact the LLM runs at the edge without an internet connection means the data does not leave the device. This helps to protect the privacy of individuals, as well as improving the latency and responsiveness of the user experience. These are certainly compelling reasons for deploying LLM at the edge on mobile.
Fortunately, almost all smartphones worldwide (around 99 percent) have the technology that is already capable of processing LLMs at the edge today: the Arm CPU.
This was demonstrated through an Arm demo at Mobile World Congress (MWC) 2024 that can be seen in the following video.
The video demonstrates the performance of running the LlamA2-7B LLM on existing Android phones using 3x Arm Cortex-A700 series CPU cores. The video runs at actual speed, and, as you can see, the virtual assistant in the Android application is very responsive and fast to reply. It has a very impressive time-to-first token response performance and a text generation rate of 9.6 tokens per second that is faster than the average human reading speed. This is due to existing CPU instructions for AI and dedicated software optimizations for LLMs. Perhaps most important, everything runs locally at the edge, on the mobile device.
However, new models continue to emerge and we at Arm continue to improve the LLM experience on Arm. When the latest Llama3 model from Meta and Phi-3 3.8B model from Microsoft came out recently, we worked quickly to run them on Arm CPUs on mobile. Llama3 and Phi-3 3.8B are larger than their predecessors. In terms of size, Llama2 was 7B while Llama 3 is 8B, and Phi-2 was 2.7B while Phi-3 is 3.8B. These new AI models are far more capable and can respond to a wider range of questions.
The new demo features ‘Ada’, a chatbot specifically trained to be a virtual teaching assistant for science and coding. The Phi-3 3.8B model running in the video,shows an equally impressive time-to-first token response performance and a text generation rate of just over 15 tokens per second. The demo is based on the pre-existing software optimizations developed by us for Llama2 and Phi-2. Though these models are bigger and more sophisticated, this clearly demonstrates they can run well on mobile devices powered by Arm CPUs today.
But how did we develop these demos? Well, we are happy to provide some tips for deploying LLM on Arm CPU in Android.
Firstly, it’s worth saying that the Arm CPU makes life easier for AI developers. Therefore, it’s unsurprising that 70 percent of AI in today’s third-party applications run on Arm CPUs. Due to the extensive flexibility in its programmability, AI developers can experiment with novel compression and quantization techniques to make these LLMs smaller and run faster everywhere. In fact, the key ingredient that allowed us to run a model with 7 billion parameters was the integer quantization, in this case int4.
Quantization is the crucial technique to make any AI and Machine Learning (ML) models compact enough to run efficiently on devices with limited RAM. Therefore, this technique is indispensable for LLMs, with billions of trainable parameters natively stored in floating-point data types, such as floating-point 32-bit (FP32) and floating-point 16-bit (FP16). For example, the Llama2-7B variant with FP16 weights needs at least ~14 GB of RAM, which is prohibitive in many mobile devices.
By quantizing an FP16 model to 4-bit, we can reduce its size by four times and bring the RAM use to roughly 4GB. Since the Arm CPU offers tremendous software flexibility, developers can also lower the number of bits per parameter value to obtain a smaller model. However, keep in mind that lowering the number of bits to three or two bits might lead to a significant accuracy loss.
When running workloads on the CPU, we suggest a straightforward tip for improving its performance: setting the CPU affinity to a thread.
Generally speaking, the operating system (OS) is responsible for choosing the core to run the thread on when deploying CPU applications. This decision is not always based on achieving the optimal performance.
However, if we have a performance-critical application, the developer can force the thread to run a specific core using thread affinity. This technique helped us to improve the latency speed by over 10 percent.
You can specify the thread affinity through the affinity mask, which is a bitmask where each bit represents a CPU core in your system. For example, let’s assume we have eight cores, four of which are the Arm Cortex-A715 CPUs that are assigned to the most significant bits of the bitmask (0b1111 0000).
To run each thread on each Cortex-A715 CPU core, we should pass the thread affinity mask to the system scheduler before executing the workload. This operation can be done in Android using the following syscall function:
#include <sys/syscall.h> uint8_t mask = 0; mask = 0x80; // 1000 0000 syscall(__NR_sched_setaffinity, pid, sizeof(mask), &mask);
For example, if we had two threads, we could have the following bitmask for each thread:
After executing the workload, we should always reset the affinity mask to the default state, as shown in the following snippet code:
#include <sys/syscall.h> uint8_t mask = 0; mask = 0xff; // 1111 1111 syscall(__NR_sched_setaffinity, pid, sizeof(mask), &mask);
Thread affinity is an effortless technique to improve the performance of any CPU workload. However, the int4 quantization and thread affinity alone are not enough to get the best performance for LLM. And we know how crucial low latency is for these models, as this can affect the overall user experience.
Therefore, the team at Arm has developed highly optimized int4 matrix-by-vector and matrix-by-matrix CPU routines to improve the performance dramatically.
The matrix-by-matrix and matrix-by-vector routines are performance-critical functions for LLMs. These routines have been optimized for Arm Cortex-A700 Series CPU using the SDOT and SMMLA instructions. Our routines (which are available soon) helped to improve the time-to-first token (encoder) by over 50 percent and text generation by 20 percent, compared to the native implementation in llama.cpp.
Using dedicated AI instructions, CPU thread affinity and software optimized routines, the demos showcase a great overall user experience for interactive use cases. The videos demonstrate the immediate time-to-first token response, and a text generation rate that is faster than the average human reading speed. Best of all, this performance is achievable on all Cortex-A700 enabled mobile devices.
We are also excited to see the developer open-source community engaged in working with models on Arm. The Arm CPU provides the AI developer community with opportunities to experiment with their own techniques to provide further software optimizations that make LLMs smaller, more efficient and faster. This was demonstrated by the fact that developers in the open-source community managed to have the new models up and running on Arm in around 48 hours. We look forward to seeing more open-source engagement with generative AI on Arm.
However, this is just the beginning of the LLM experience on Arm technology. As LLMs get smaller and more sophisticated, their performance on mobile devices at the edge will continue to improve. In addition, Arm and partners from our industry-leading ecosystem will continue to add new hardware advancements and software optimizations to accelerate the AI capabilities of the CPU Instruction Set, like the Scalable Matrix Extension (SME) for the Armv9-A architecture. These advancements will unlock the next era of use cases for LLMs on Arm-based consumer devices throughout 2024 and beyond.