How Arm and AWS provide a pathway for the AI-defined vehicle

March 10, 2025

13 minute read time.

This blog post is co-authored by Gian Marco Iodice, Principal Software Engineer, Arm, Srini Raghavan, Senior Partner Solutions Architect at Amazon Web Services, and Stefano Marzani, Worldwide Tech Leader for Software-Defined Vehicles, Amazon Web Services.

The automotive industry is witnessing a transformative shift with the integration of artificial intelligence, particularly with generative AI (Gen AI). A recent McKinsey survey of automotive and manufacturing executives revealed that more than 40% of respondents are investing up to €5 million in Gen AI research and development, and over 10% are investing more than €20 million.

Automotive companies are using Gen AI services on AWS for a wide range of optimizations and productivity gains. For example, BMW Group developed a Gen AI assistant to help accelerate its infrastructure optimization on Amazon Web Services (AWS). Audi and Reply worked with AWS to help improve their enterprise search experience through a Generative AI chatbot, leveraging Amazon SageMaker. Ferrari leveraged Amazon Bedrock's LLMs alongside Amazon Personalize to create a car configurator, while also implementing a generative AI chatbot for after-sales support and technical assistance.

With the move towards software-defined vehicles (SDVs), the number of lines of code in vehicles is expected to increase from 100 million lines per vehicle to about 300 million lines by 2030. Gen AI for automotive, together with SDVs, is enabling in-vehicle use cases across performance and comfort that help enhance the driving and vehicle experience.

In this blog post, Arm and AWS will present one such in-vehicle Gen AI use case along with its implementation details.

The use case

As vehicles become increasingly sophisticated, with the ability to receive post-production feature updates such as parking assist or lane keeping, a new challenge has emerged: keeping vehicle owners informed about these changes and new capabilities. Traditional methods of updating printed or online manuals have proven inadequate, often leaving drivers unaware of the full potential of their vehicles.

To address this challenge, AWS developed a demonstration that uses the power of Gen AI, edge computing, and the Internet of Things (IoT). At the heart of this solution is an in-vehicle application powered by a Small Language Model (SLM), which is designed to enable drivers to access up-to-date vehicle information through natural voice interactions. The demo application is designed to operate offline after deployment, ensuring that drivers have access to critical information about their vehicle even without an internet connection.

The implementation of this solution brings together several advanced technologies to create a more seamless and efficient user experience. The demo application deploys a local SLM within the vehicle, optimized for performance using the Arm® KleidiAI optimized routines. The SLM inference achieved a response time of 1 to 3 seconds compared with systems without KleidiAI optimizations where the response time observed was 8 to about 19 seconds. The use of Arm® KleidiAI has also resulted in time savings of 6 weeks for developing the application, where the developer does not need to focus on low-level software optimizations.

Arm Virtual Hardware (AVH) provides access to many popular IoT development kits on AWS. Developing and testing on AVH provides time savings for embedded application development when the physical device is unavailable, or inaccessible by globally distributed teams. AWS successfully tested the demo application on the automotive virtual platform, where AVH provided a virtual instance of the Raspberry Pi device. The same KleidiAI optimizations are also available on AVH.

One of the key features of the Gen AI application running on the edge device is its ability to receive over-the-air updates using, in part, AWS IoT Greengrass Lite, helping to ensure the information provided to drivers is always current. AWS IoT Greengrass Lite is memory-efficient because it uses just 5 MB of RAM on the edge device where it is installed. Additionally, the solution incorporates an automated quality monitoring and feedback loop, which continuously evaluates the relevance and accuracy of the SLM's responses. This is achieved through a comparison system that flags responses falling outside the expected quality threshold, for review. The collected feedback data is then visualized in near real-time through a dashboard on AWS, allowing OEM quality assurance teams to review and identify areas for improvement and initiate updates as needed.

The benefits of this Gen AI-powered solution extend beyond just providing accurate information to drivers. It represents a paradigm shift in SDV lifecycle management, enabling a more continuous improvement cycle where OEMs can add new content based on user interactions, the SLM is fine-tuned with updated information that is seamlessly deployed over-the-air. This not only enhances the user experience by keeping vehicle information current but also opens up new possibilities for OEMs to introduce and educate users about new features or purchasable additions. By using the power of Gen AI, IoT, and edge computing, the approach shown in this Vehicle User Guide Gen AI application is paving the way for a more connected, informed, and adaptive driving experience in the age of SDVs.

End-to-end high-level implementation

The diagram below (Figure 1) illustrates the solution architecture for fine-tuning the model, testing it on AVH, and deploying the SLM to the edge device incorporating a feedback collection mechanism:

Figure 1: Solution architecture diagram for Gen AI based vehicle user guide

The numbered references in the previous diagram correspond to the following:

Model fine-tuning: The AWS demo application development team selected TinyLlama-1.1B-Chat-v1.0 as our base model, already pre-trained for conversational tasks. To optimize the vehicle user guide chat interface for drivers, we designed concise, focused responses that accommodate limited attention spans while driving. We created a custom dataset of 1,000 question-and-answer pairs and conducted fine-tuning using Amazon SageMaker Studio.

Storage: The fine-tuned SLM is stored in Amazon Simple Storage Service (Amazon S3).

Initial deployment: The SLM is initially deployed to an Ubuntu-based Amazon EC2 instance.

Development and optimization: We developed and tested the Gen AI application on the EC2 instance. We used llama-cpp for the SLM quantization, applying the Q4_0 schema. KleidiAI optimizations are pre-integrated with llama-cpp. We achieved significant model compression, reducing the file size from 3.8 GB to 607 MB.

Virtual testing: The application and SLM are transferred to AVH's virtual Raspberry Pi environment for initial testing.

Virtual validation: Comprehensive testing is performed in the virtual Raspberry Pi device to ensure functionality.

Edge deployment: The Gen AI application and SLM are deployed to the physical Raspberry Pi device using AWS IoT Greengrass Lite, leveraging AWS IoT Core jobs for deployment management.

Deployment orchestration: AWS IoT Core manages the deployment tasks to the edge Raspberry Pi device.

Installation process: AWS IoT Greengrass Lite handles the package downloads from S3 and completes the installation automatically.

User interface: The deployed application provides voice-based interaction capabilities for end users on the edge Raspberry Pi device.

Quality monitoring: The Gen AI application implements quality monitoring of user interactions. Data is collected through AWS IoT Core and processed via Amazon Kinesis Data Stream and Amazon Data Firehose before being stored in Amazon S3. OEM teams can monitor and analyze this data through Amazon QuickSight dashboards, helping them to identify and address any SLM quality issues promptly.

A demonstration of this in-vehicle Gen AI application, powered by a SLM, was showcased at CES 2025 by AWS on the Raspberry Pi 5 using the llama.cpp framework through the optimized KleidiAI routines.

The following sections will dive deeper into the details of KleidiAI and the quantization schema adopted by this demo.

Arm KleidiAI

KleidiAI is an open source library designed for AI framework developers. It offers optimized performance-critical routines for Arm® CPUs. Initially introduced in May 2024, the library now provides optimizations for matrix multiplication across various data types, including 32-bit floating point, Bfloat16, and extremely low-precision formats like 4-bit fixed-point. These optimizations support multiple Arm CPU technologies, such as SDOT and I8MM for 8-bit computation and MLA for 32-bit floating-point operations.

With its four Arm® Cortex-A76 cores, the Raspberry Pi 5 demo used KleidiAI’s SDOT optimizations, one of the earliest instructions designed for AI workloads on Arm® CPUs. In fact, SDOT was first introduced as part of Armv8.2-A, which was released in 2016.

The SDOT instruction demonstrates the Arm’s long-standing commitment to enhancing AI performance on CPUs. Following SDOT, Arm has progressively introduced new instructions for AI on CPUs, such as I8MM for more efficient 8-bit matrix multiplications and Bfloat16 support, to improve 32-bit floating-point performance while reducing memory usage by half.

For the demonstration with Raspberry Pi 5, KleidiAI was fundamental to the speed-up of the matrix multiplication using integer 4-bit quantization with per-block quantization (also known as. Q4_0 in llama.cpp).

The Q4_0 quantization format in llama.cpp

The Q4_0 matrix multiplication in llama.cpp involves the following components:

A left-hand side (LHS) matrix, which stores activations as 32-bit floating-point values.
A right-hand side (RHS) matrix, which contains weights in a 4-bit fixed-point format. In this format, the quantization scale is applied to blocks of 32 consecutive integer 4-bit values and is encoded using a 16-bit floating-point value.

Therefore, when referring to 4-bit integer matrix multiplication, it specifically applies to the format of the weights, which is visually represented in the following image:

4-bit integer matrix multiplication, it specifically applies to the format of the weights

At this point, how did KleidiAI leverage the SDOT instruction designed explicitly for 8-bit integer dot products when neither the LHS nor RHS matrices are in 8-bit format?

Both input matrices must be converted to 8-bit integer values.

For the LHS matrix, an additional step is required before the matrix multiplication routine: dynamic quantization to an 8-bit fixed-point format. This process dynamically quantizes the LHS matrix to 8-bit using per-block quantization, where the quantization scale is applied to blocks of 32 consecutive 8-bit integer values and stored as a 16-bit floating-point value, similar to the 4-bit quantization approach.

Dynamic quantization minimizes the risk of accuracy degradation because the quantization scale factor is computed at inference time based on the minimum and maximum values within each block. This approach contrasts static quantization, where the scale factor is predetermined and remains fixed.

For the RHS matrix, no extra steps are required before the matrix multiplication routine. In fact, the 4-bit quantization acts as a compressed format, while the actual computation is carried out in 8-bit. Therefore, before passing the 4-bit values to the dot product instruction, they are first converted to 8-bit.

The conversion from 4-bit to 8-bit is computationally inexpensive, as it only requires a simple shift/mask operation.

However, even if the conversion is so efficient, why not use 8-bit directly and eliminate the need for conversion?

There are two key advantages to using 4-bit quantization:

Reduced model size: Since 4-bit values require half the memory of 8-bit values, this is particularly beneficial for platforms with limited RAM availability.
Improved text generation performance: The text generation process relies on a sequence of vector-by-matrix operations, which are typically memory-bound. This means that performance is limited by the speed of data transfer between memory and the processor rather than by the processor's computational power. Since memory bandwidth is a limiting factor, reducing the data size significantly improves performance by minimizing memory traffic.

How can I use KleidiAI with llama.cpp?

Easily. KleidiAI is already integrated into llama.cpp. Therefore, developers do not need additional dependencies to get the best performance from Arm® CPUs for v8.2 and above.

This integration means that developers running llama.cpp on mobile devices, embedded computing platforms, and servers based on Arm® processors can now experience better performance transparently.

Are there alternatives to the llama.cpp with KleidiAI?

While llama.cpp is a good option for running LLMs on Arm® CPUs, developers can use other highly performant frameworks for Gen AI that also embrace KleidiAI optimizations. For example (in alphabetical order): ExecuTorch, MediaPipe, MNN, and PyTorch. Simply select the latest version of the framework.

Therefore, if you are considering deploying Gen AI models on Arm CPUs, exploring these frameworks can help you achieve optimized performance and efficiency.

Summary and conclusion

The convergence of SDVs and Gen AI is ushering in a new era of automotive innovation, where vehicles become increasingly intelligent and user-centric. The demonstration of an in-vehicle Gen AI application, powered by Arm® KleidiAI optimizations and AWS services, showcases how emerging technologies can help solve real-world challenges in the automotive industry. By achieving response times of 1-3 seconds and reducing development time by weeks, this solution proves more efficient, offline-capable Gen AI applications that are not only possible but are also practical for in-vehicle deployments.

The future of automotive technology lies in solutions that seamlessly blend edge computing, IoT capabilities, and AI. As vehicles continue to evolve with increasing software complexity, potential solutions like the one presented here will become crucial in bridging the gap between advanced vehicle capabilities, and the ability for the users to understand

Author bios:

	Gian Marco Iodice, Principal Software Engineer, Arm Gian Marco Iodice is an edge and mobile computing specialist at Arm, focused on machine learning (ML) and AI. In his role, he leads engineering developments for generative AI and co-created KleidiAI to optimize AI performance on Arm CPUs.
	Srini Raghavan, Senior Partner Solutions Architect, AWS At Amazon Web Services, Srini is responsible for the success and growth of many partners in the AWS automotive vertical, where he enables strategic partners to build, market, and sell their mutual innovative solutions, while leveraging the power of AWS cloud. When not building solutions on AWS, he loves to run as well as follow Cricket (the sport).
	Stefano Marzani, Worldwide Tech Leader for Software-Defined Vehicles, AWS Stefano is focused on helping to solve the biggest challenges in the automotive industry. His current focus is helping the automotive industry transition to software-defined vehicles, enabling autonomous functionalities, mobility fleet solutions, and delightful user experiences. Stefano’s technical expertise lies in IoT, Machine Learning, Vehicle Architecture, HMI, and Automotive Software Development & Tooling.

0 comments
0 members are here

Automotive blog

Streamlining software migration in automotive systems with Arm and INCHRON

Andrew C

Arm and INCHRON's collaborative model-based simulation methodology addresses challenges in migrating automotive software to Software-Defined Vehicles (SDVs).
- March 17, 2025
How Arm and AWS provide a pathway for the AI-defined vehicle

Gian Marco Iodice

In this blog post, Arm and AWS will present one such in-vehicle Gen AI use case along with its implementation details.
- March 10, 2025
Securing the future of automotive technology: Arm meets the ISO/SAE 21434 standard

Lucas Bressan

Arm meets ISO/SAE 21434 standard, learn more here.
- February 20, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

How Arm and AWS provide a pathway for the AI-defined vehicle

The use case

End-to-end high-level implementation

Arm KleidiAI

The Q4_0 quantization format in llama.cpp

How can I use KleidiAI with llama.cpp?

Are there alternatives to the llama.cpp with KleidiAI?

Summary and conclusion

Streamlining software migration in automotive systems with Arm and INCHRON

How Arm and AWS provide a pathway for the AI-defined vehicle

Securing the future of automotive technology: Arm meets the ISO/SAE 21434 standard