Building an on-device multimodal assistant for automobiles

September 18, 2025

14 minute read time.

Introduction: the era of AI-defined vehicles

The automotive industry is entering a new era: the AI-defined vehicle. At Arm, we envision embedding AI at the core of automotive compute. This enables vehicles to sense, reason, and adapt in real time to driver needs and the environment. With platforms like Zena CSS, powerful and secure AI processing can now happen directly inside the vehicle. Local intelligence not only enhances privacy and responsiveness but also unlocks new levels of safety, convenience, and personalization.

Our project is a step toward this vision. Building an on-device, agentic multimodal assistant that proves how modular, collaborative AI can enhance in-vehicle experiences. All running efficiently on local hardware.

In this paradigm, AI agents are integral, proactive partners for drivers. Assisting with everything from diagnostics to environmental control.

Glossary

Agent: A modular software component responsible for a specific capability. Each agent processes its own inputs and outputs and communicates with others through defined interfaces.

Supervisor: A coordinating agent that interprets user intent and routes tasks to the right agent or tool. It ensures each request is handled by the correct component.

Retrieval-Augmented Generation (RAG): A technique that improves accuracy by retrieving relevant passages from a local knowledge base (vector store). These passages are provided as context to the language model. This allows the assistant to answer vehicle-specific questions without internet access.

Vector store: A database that stores embeddings, which are numeric representations of text or images. This means semantically similar items can be retrieved quickly.

Quantization: A compression technique that represents model weights with fewer bits (e.g., 32‑bit to 4-bit). This reduces memory usage and speeds up inference, allowing large models to run on resource-constrained on-device hardware.

On-device/edge inference: Running AI models directly on the vehicle’s compute. Therefore, enhancing privacy and reliability.

System architecture: Modular intelligence on the edge

The assistant’s architecture is designed around modular agents. Each agent handles a specific capability, such as guardrail, vehicle control, retrieval, and vision. The Supervisor Agent coordinates these components, dispatching tasks based on user intent and system context.

Figure 1: High-level architecture of the agentic automobile assistant.

Components

Input module: Converts spoken queries to text via whisper.cpp and forwards them to the Guardrail Agent. It also pulls image data from the camera at a configurable frequency for the Vision Agent.

Guardrail agent: Validates user input. It rejects unsafe or malicious instructions, such as access system prompts or override safety-critical controls.

Supervisor agent: Receives validated input and decides which agent or tool to invoke. This interface scales by keeping a registry of available tools and agents, each described in a consistent format. This design allows new functionality to be plugged in easily without altering the Supervisor’s core logic.

Vision agent: Uses a Vision-Language Model (VLM) to analyze camera feeds. For example, whether the driver is following traffic rules.

Output queue: Acts as an intermediary between the core agent system and the output module. This ensures responses are delivered in the correct order, whether for speech synthesis or user interface. This maintains consistency and reliability in driver-facing outputs.

Utilities: Agents interact with a set of local tools exposed through the Model Context Protocol (MCP) server or as direct function calls. These tools are the assistant’s interface to the vehicle and its supporting systems. In the current prototype, they use mock data, but the design anticipates integration with real automotive subsystems.

Driving and Safety
- Contact emergency services in the event of a crash or detected hazard.
- Configure vehicle environment settings such as climate control, sunroof, and cabin lighting.
- Send notifications and alerts to the driver.
Personal Assistance
- Retrieve vehicle information from a local vector database (via RAG).
- Control infotainment features, such as Bluetooth connections and media playback.

By exposing these capabilities as modular tools, the system maintains a clean separation between reasoning (agents) and actuation (tools). This ensures extensibility and easier integration with new automotive features in the future.

Example input-output sequence

Suppose the driver says: “Pair my phone to the car’s Bluetooth.”

Input module: Transcribes the driver’s speech into text.
Core:
1. Guardrail Agent: Validates the request to ensure it is safe and within scope.
2. Supervisor Agent: Interprets the intent as an environment control task and routes it to the Vehicle Control Agent.
3. Vehicle Control Agent: Calls the Automobile MCP tool to handle Bluetooth pairing with the driver’s device and outputs the result.
4. Output queue: Buffers the result, ensuring they are delivered in order and not interrupted by concurrent tasks.
Output module: Sends responses to the TTS client and UI.

The assistant then confirms: “Your device has been paired successfully.”

The following logs show how the system executes the request in real time. Each entry corresponds to a step in the pipeline. Excluding the time required to connect the Bluetooth device, the agent workflow completes in under five seconds.

Figure 2: Agent logs showing Voice Transcription, Agent Handoff, and Tool Calling.

Model and hardware configuration

The system runs with two models in memory:

A large VLM (InternVL3-Instruct 14B) for image-text reasoning in the Vision Agent.

A smaller LLM (Jan-nano 4B), for all other text tasks, including retrieval and tool orchestration.

Both models are quantized to 4-bit and served through llama.cpp. We selected Unsloth’s Dynamic 2.0 GGUF format. It offered the best trade-off between model size and runtime performance among leading quantization methods. This choice allowed both models to fit within 14 GB of VRAM.

We developed the assistant prototype on an Amazon EC2 g5g.4xlarge instance. It is equipped with 16 Arm‑based Graviton2 vCPUs, 32 GB of RAM, and a T4G Tensor Core GPU with 16 GB of VRAM. The entire compute and memory footprint is largely dedicated to running ML workloads. This aligns with real-world vehicle constraints.

A car in 2025 contains 16 GB of DRAM on average, with this amount projected to triple by 2026. This parity ensures our development environment mirrors the resource budget available for production-capable AI agents. The use of Graviton2 also supports Arm’s SOAFEE framework. This enables development and testing of containerized automotive workloads in the cloud before deploying to in-vehicle systems.

Design implications

This architecture supports multimodal interaction. It operates reliably without reliance on external connectivity. It can also be extended by adding new agents, tools, or capabilities with minimal changes to existing components.

Progressive capabilities

The assistant has been developed in stages. Each stage introduces new perception and action capabilities. This incremental approach illustrates how the system advances from basic information retrieval to context-aware autonomy.

Level 1: Inspect vehicle status using voice commands

At the base level, the assistant can retrieve real-time vehicle state from onboard sensors in response to spoken queries. For example, “What is the current cabin temperature?”. This provides immediate, hands-free access to vehicle information.

Level 2: Modify vehicle status using voice commands

Building on retrieval, the assistant can execute driver commands to adjust vehicle settings. For example, “Set the temperature to 70 degrees”. This shifts the role of the assistant from an information source to an active participant in the vehicle operation.

Level 3: Visual interpretation and driver alerts

With vision integrated, the assistant processes the driving environment through real-time visual input. For instance, it can detect when the vehicle is in a High Occupancy Vehicle (HOV) lane and issue a warning if the occupancy requirement is not met.

Here, the assistant reasons with complex, dynamic, real-world situations. It interprets context, not just instructions.

Figure 3: Synthetically generated images showing what the Vision Agent observes from a car’s interior and exterior on the highway.

Figure 4: The assistant interprets visual input, recognizes an HOV lane violation, and provides a safety/compliance warning.

Level 4: Visual-informed autonomous action

The assistant combines deep visual understanding with autonomous decision-making. This enables it to deliver critical safety interventions. For example, if a crash is detected, the assistant can autonomously contact emergency services. Ensuring help is dispatched even if the driver is unable to respond.

This capability marks a shift from passive monitoring to active, situational response. The assistant not only perceives and interprets its environment but also acts decisively to ensure driver safety.

Figure 5: Synthetic images of a car’s interior and exterior during an accident.

Figure 6: The assistant recognizes an emergency and contacts emergency services.

Altogether, these progressive levels show a clear trajectory from simple sensor queries to autonomous, context-aware behavior. Each step brings us closer to in-vehicle assistants that can truly perceive, understand, and act within the vehicle environment. Paving the way toward fully AI-defined mobility.

Key learnings

Designing for scalability

We compared two common approaches to agentic system design.

A monolithic architecture, where a single large agent manages all tasks
A modular architecture, where a supervisor agent acts as a router and facade for specialized sub-agents.

Internal evaluations show that the modular approach scales more effectively and delivers more consistent results. It allows individual components to be updated or extended without disrupting the overall system. Like replacing a car part without rebuilding the entire engine.

Our observations are consistent with findings from LangChain. Their research shows that supervisor-based multi-agent architectures maintain performance as the number of tools grows. In contrast, single-agent systems degrade rapidly when overloaded with context or capabilities.

Figure 7: Agentic Architecture Scalability—Source: LangChain.

Enabling real-time inference on constrained hardware

Running AI models on edge hardware is challenging due to tight compute and memory budgets. To address these limits, we routed requests between the VLM and the compact LLM and applied 4-bit quantization to both.

The larger VLM delivers strong multimodal reasoning but incurs higher latency. As a result, non-vision tasks are routed to the LLM to maintain responsiveness. This division of labor achieves a balance between functionality and responsiveness. It enables real-time multimodal inference for automotive workloads on resource-constrained devices.

Figure 8: Impact of KV cache warmup on end-to-end latency.

We further reduced latency by precomputing the KV cache for the system prompt. The cache reused as the prefix for all conversations. In ablation tests, this warm-up procedure delivered a 2x speedup in end-to-end latency for the input query: “What is the car fuel capacity?”

Enhancing accuracy with on-device RAG

Incorporating RAG improved the assistant’s responsiveness and reliability. By embedding a knowledge base like the car manual, the assistant can answer context-aware, technical questions swiftly and privately, without an internet connection. This significantly improves real-time usefulness and builds driver trust.

Multimodal understanding unlocks new possibilities

Integrating a vision component was pivotal. It transformed the assistant from just a listener to a context-aware observer that could understand visual cues inside and outside the vehicle. Cues such as passenger safety or lane conditions, enabling more intuitive and proactive behavior.

Observability enables faster iteration

We instrumented the system with MLflow tracing and the OpenAI Agents SDK. This provided visibility into every agent’s execution time, tool usage, and handoff flow. The added visibility accelerates debugging, performance tuning, and design improvements as the system matures.

Figure 9: MLflow traces provide fine-grained visibility per request.

Constraining agent capabilities

Our architecture enforces strict constraints on agent behavior to ensure safety and reliability. Each agent only has access to the minimal set of tools it needs. This keeps its scope contained through the principle of least privilege.

We designed containment through the Guardrail Agent. It screens all inputs using rule-based checks, blocking unsafe or out-of-scope commands before they reach sensitive tools. We validate robustness against malicious or adversarial inputs through adversarial testing. We use frameworks like DeepEval to reduce the risk of unintended agent behavior that could result from sophisticated input manipulation.

Call to Action

To move this agentic, on-device automobile assistant from prototype to real-world deployment, we must balance four priorities: model intelligence, latency, memory usage, and safety.

Advancing model intelligence on edge hardware requires progress in post-training techniques such as pruning, quantization, and knowledge distillation. These methods enable compact models to maintain strong performance despite limited parameter capacity.

Further reductions in latency depend on improvements in hardware architecture to address memory-bound inference. For example, placing compute closer to memory to minimize data movement bottlenecks. Optimizing memory usage also calls for hardware-aware strategies, including quantization-aware training, efficient attention mechanisms, and robust lightweight model architectures. Such strategies ensure models can operate within strict VRAM budgets while still remaining functional.

Most critically, safety must be embedded at every layer. As Arm's IP supports ISO/SAE 21434 compliance, this system is well positioned to align with automotive cybersecurity standards. However, deploying such a system at scale requires alignment with industry-wide validation practices and thorough adversarial testing to meet both technical and regulatory requirements.

Addressing these interwoven challenges requires continued optimization, security risk engineering, and cross-industry collaboration. These efforts are vital for bringing safe, capable, and real-time AI assistants to the vehicles of the future.

Looking ahead: the future of in-car intelligence

This prototype is just the beginning. The next frontiers for on-device assistants are:

Proactive, context-aware reasoning

By leveraging world models, AI systems that learn and simulate real-world dynamics, future assistants can not only respond to human inputs and predefined triggers. They can also anticipate scenarios, plan actions, and adapt to long-term consequences. Training these models in rich virtual environments will allow them to handle complex driving conditions safely and reliably before ever hitting the road.

Personalized, continual learning

Future assistants will continuously adapt to each driver. Efficient fine-tuning techniques like QLoRA could allow models to adapt to the users' unique preferences, driving style, and specific vehicle vocabulary over time. Over time, this makes interactions feel more natural and tailored to each individual.

Closing Thoughts

We are only beginning to explore what is possible when powerful, privacy-first AI resides directly in your car. The agentic assistant developed here demonstrates that intelligent, collaborative, and extensible in-vehicle AI is within reach. Every advance in software and hardware brings us closer to cars that are not just a mode of transport, but truly intelligent and intuitive partners on every journey.

To learn more about how Arm is advancing this vision, visit the Arm Automotive Developer Platform and the Automotive Learning Path for resources on building next-generation vehicle software within the Arm ecosystem.

Acknowledgments

Special thanks to the Arm Developer Advocacy team and my team members: Barbara Corriero, Han Yin, and Tim Ko for guidance throughout the internship.

Automotive blog

Building an on-device multimodal assistant for automobiles

Aaron Ang

In this blog post, learn how we are only starting to see what is possible when powerful, privacy-first AI runs directly in your car.
- September 18, 2025
Building safe and scalable automotive systems with Functional Safety, containers, and DDS

odinlmshen

Build safe, modular SDV systems on Arm with functional safety, containerized workloads, and DDS-based real-time communication.
- July 27, 2025
Driving the future of autonomous development: Deploying Open AD Kit on Arm Neoverse

odinlmshen

Accelerate SDV development with Arm’s new Learning Path: deploy Open AD Kit on Neoverse to simulate AV stacks using SOAFEE, ROS 2, and Docker.
- June 6, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Building an on-device multimodal assistant for automobiles

Introduction: the era of AI-defined vehicles

Glossary

System architecture: Modular intelligence on the edge

Components

Example input-output sequence

Model and hardware configuration

Design implications

Progressive capabilities

Level 1: Inspect vehicle status using voice commands

Level 2: Modify vehicle status using voice commands

Level 3: Visual interpretation and driver alerts

Level 4: Visual-informed autonomous action

Key learnings

Designing for scalability

Enhancing accuracy with on-device RAG

Multimodal understanding unlocks new possibilities

Observability enables faster iteration

Constraining agent capabilities

Call to Action

Looking ahead: the future of in-car intelligence

Proactive, context-aware reasoning

Personalized, continual learning

Closing Thoughts

Acknowledgments

Building an on-device multimodal assistant for automobiles

Building safe and scalable automotive systems with Functional Safety, containers, and DDS

Driving the future of autonomous development: Deploying Open AD Kit on Arm Neoverse