The automotive industry is witnessing a transformative shift with the integration of artificial intelligence, particularly with generative AI (Gen AI). A recent McKinsey survey of automotive and manufacturing executives revealed that more than 40% of respondents are investing up to €5 million in Gen AI research and development, and over 10% are investing more than €20 million.
Automotive companies are using Gen AI services on AWS for a wide range of optimizations and productivity gains. For example, BMW Group developed a Gen AI assistant to help accelerate its infrastructure optimization on Amazon Web Services (AWS). Audi and Reply worked with AWS to help improve their enterprise search experience through a Generative AI chatbot, leveraging Amazon SageMaker. Ferrari leveraged Amazon Bedrock's LLMs alongside Amazon Personalize to create a car configurator, while also implementing a generative AI chatbot for after-sales support and technical assistance.
With the move towards software-defined vehicles (SDVs), the number of lines of code in vehicles is expected to increase from 100 million lines per vehicle to about 300 million lines by 2030. Gen AI for automotive, together with SDVs, is enabling in-vehicle use cases across performance and comfort that help enhance the driving and vehicle experience.
In this blog post, Arm and AWS will present one such in-vehicle Gen AI use case along with its implementation details.
As vehicles become increasingly sophisticated, with the ability to receive post-production feature updates such as parking assist or lane keeping, a new challenge has emerged: keeping vehicle owners informed about these changes and new capabilities. Traditional methods of updating printed or online manuals have proven inadequate, often leaving drivers unaware of the full potential of their vehicles.
To address this challenge, AWS developed a demonstration that uses the power of Gen AI, edge computing, and the Internet of Things (IoT). At the heart of this solution is an in-vehicle application powered by a Small Language Model (SLM), which is designed to enable drivers to access up-to-date vehicle information through natural voice interactions. The demo application is designed to operate offline after deployment, ensuring that drivers have access to critical information about their vehicle even without an internet connection.
The implementation of this solution brings together several advanced technologies to create a more seamless and efficient user experience. The demo application deploys a local SLM within the vehicle, optimized for performance using the Arm® KleidiAI optimized routines. The SLM inference achieved a response time of 1 to 3 seconds compared with systems without KleidiAI optimizations where the response time observed was 8 to about 19 seconds. The use of Arm® KleidiAI has also resulted in time savings of 6 weeks for developing the application, where the developer does not need to focus on low-level software optimizations.
Arm Virtual Hardware (AVH) provides access to many popular IoT development kits on AWS. Developing and testing on AVH provides time savings for embedded application development when the physical device is unavailable, or inaccessible by globally distributed teams. AWS successfully tested the demo application on the automotive virtual platform, where AVH provided a virtual instance of the Raspberry Pi device. The same KleidiAI optimizations are also available on AVH.
One of the key features of the Gen AI application running on the edge device is its ability to receive over-the-air updates using, in part, AWS IoT Greengrass Lite, helping to ensure the information provided to drivers is always current. AWS IoT Greengrass Lite is memory-efficient because it uses just 5 MB of RAM on the edge device where it is installed. Additionally, the solution incorporates an automated quality monitoring and feedback loop, which continuously evaluates the relevance and accuracy of the SLM's responses. This is achieved through a comparison system that flags responses falling outside the expected quality threshold, for review. The collected feedback data is then visualized in near real-time through a dashboard on AWS, allowing OEM quality assurance teams to review and identify areas for improvement and initiate updates as needed.
The benefits of this Gen AI-powered solution extend beyond just providing accurate information to drivers. It represents a paradigm shift in SDV lifecycle management, enabling a more continuous improvement cycle where OEMs can add new content based on user interactions, the SLM is fine-tuned with updated information that is seamlessly deployed over-the-air. This not only enhances the user experience by keeping vehicle information current but also opens up new possibilities for OEMs to introduce and educate users about new features or purchasable additions. By using the power of Gen AI, IoT, and edge computing, the approach shown in this Vehicle User Guide Gen AI application is paving the way for a more connected, informed, and adaptive driving experience in the age of SDVs.
The diagram below (Figure 1) illustrates the solution architecture for fine-tuning the model, testing it on AVH, and deploying the SLM to the edge device incorporating a feedback collection mechanism:
Figure 1: Solution architecture diagram for Gen AI based vehicle user guide
The numbered references in the previous diagram correspond to the following:
A demonstration of this in-vehicle Gen AI application, powered by a SLM, was showcased at CES 2025 by AWS on the Raspberry Pi 5 using the llama.cpp framework through the optimized KleidiAI routines.
The following sections will dive deeper into the details of KleidiAI and the quantization schema adopted by this demo.
KleidiAI is an open source library designed for AI framework developers. It offers optimized performance-critical routines for Arm® CPUs. Initially introduced in May 2024, the library now provides optimizations for matrix multiplication across various data types, including 32-bit floating point, Bfloat16, and extremely low-precision formats like 4-bit fixed-point. These optimizations support multiple Arm CPU technologies, such as SDOT and I8MM for 8-bit computation and MLA for 32-bit floating-point operations.
With its four Arm® Cortex-A76 cores, the Raspberry Pi 5 demo used KleidiAI’s SDOT optimizations, one of the earliest instructions designed for AI workloads on Arm® CPUs. In fact, SDOT was first introduced as part of Armv8.2-A, which was released in 2016.
The SDOT instruction demonstrates the Arm’s long-standing commitment to enhancing AI performance on CPUs. Following SDOT, Arm has progressively introduced new instructions for AI on CPUs, such as I8MM for more efficient 8-bit matrix multiplications and Bfloat16 support, to improve 32-bit floating-point performance while reducing memory usage by half.
For the demonstration with Raspberry Pi 5, KleidiAI was fundamental to the speed-up of the matrix multiplication using integer 4-bit quantization with per-block quantization (also known as. Q4_0 in llama.cpp).
The Q4_0 matrix multiplication in llama.cpp involves the following components:
Therefore, when referring to 4-bit integer matrix multiplication, it specifically applies to the format of the weights, which is visually represented in the following image:
At this point, how did KleidiAI leverage the SDOT instruction designed explicitly for 8-bit integer dot products when neither the LHS nor RHS matrices are in 8-bit format?
Both input matrices must be converted to 8-bit integer values.
For the LHS matrix, an additional step is required before the matrix multiplication routine: dynamic quantization to an 8-bit fixed-point format. This process dynamically quantizes the LHS matrix to 8-bit using per-block quantization, where the quantization scale is applied to blocks of 32 consecutive 8-bit integer values and stored as a 16-bit floating-point value, similar to the 4-bit quantization approach.
Dynamic quantization minimizes the risk of accuracy degradation because the quantization scale factor is computed at inference time based on the minimum and maximum values within each block. This approach contrasts static quantization, where the scale factor is predetermined and remains fixed.
For the RHS matrix, no extra steps are required before the matrix multiplication routine. In fact, the 4-bit quantization acts as a compressed format, while the actual computation is carried out in 8-bit. Therefore, before passing the 4-bit values to the dot product instruction, they are first converted to 8-bit.
The conversion from 4-bit to 8-bit is computationally inexpensive, as it only requires a simple shift/mask operation.
However, even if the conversion is so efficient, why not use 8-bit directly and eliminate the need for conversion?
There are two key advantages to using 4-bit quantization:
Easily. KleidiAI is already integrated into llama.cpp. Therefore, developers do not need additional dependencies to get the best performance from Arm® CPUs for v8.2 and above.
This integration means that developers running llama.cpp on mobile devices, embedded computing platforms, and servers based on Arm® processors can now experience better performance transparently.
While llama.cpp is a good option for running LLMs on Arm® CPUs, developers can use other highly performant frameworks for Gen AI that also embrace KleidiAI optimizations. For example (in alphabetical order): ExecuTorch, MediaPipe, MNN, and PyTorch. Simply select the latest version of the framework.
Therefore, if you are considering deploying Gen AI models on Arm CPUs, exploring these frameworks can help you achieve optimized performance and efficiency.
The convergence of SDVs and Gen AI is ushering in a new era of automotive innovation, where vehicles become increasingly intelligent and user-centric. The demonstration of an in-vehicle Gen AI application, powered by Arm® KleidiAI optimizations and AWS services, showcases how emerging technologies can help solve real-world challenges in the automotive industry. By achieving response times of 1-3 seconds and reducing development time by weeks, this solution proves more efficient, offline-capable Gen AI applications that are not only possible but are also practical for in-vehicle deployments.
The future of automotive technology lies in solutions that seamlessly blend edge computing, IoT capabilities, and AI. As vehicles continue to evolve with increasing software complexity, potential solutions like the one presented here will become crucial in bridging the gap between advanced vehicle capabilities, and the ability for the users to understandAuthor bios: