Arm’s latest mobile CPU architecture includes Scalable Matrix Extension 2 (SME2), an in-built matrix extension. SME2 accelerates the matrix-heavy AI operations behind large language models (LLMs), media processing, speech recognition, computer vision, real-time apps (AI assistants, computational photography, and AI filters) and multimodal apps.
We wanted to put this to the test to see the likely impact on real-world apps running on SME2-based devices, so we developed a Smart Yoga Tutor. This tutor uses AI techniques including pose estimation, LLMs and text-to-speech (TTS). With the newly launched SME2-enabled Arm C1 CPU cluster, we achieved a 2.5x speedup on the AI pipeline in our Smart Yoga Tutor. However, let us start at the beginning of the Yoga Tutor story:
In 2021, we launched a Smart TV fitness application leveraging BlazePose to monitor workout poses. This system provided basic pose matching, comparing the user to a pre-recorded instructor. However, it lacked the ability to offer nuanced feedback or adapt to individual needs.
With the advent of LLMs, we realized that this could be transformed into a personalized instructor. An instructor able to give real-time, conversational feedback tailored to each user.
Our goal was to create an intelligent yoga assistant that goes beyond simple pose detection. By integrating real-time pose estimation with an LLM and TTS, the system offers personalized, conversational feedback, akin to a virtual yoga instructor.
To do this, there are several technical hurdles we need to overcome:
We implemented the demo on an Android smartphone ready for the first SME2 devices. It could also run on a smart TV or other consumer device. Where SME2 is not available, AI performance is enabled by previous Neon and SVE technologies.
We first explored incorporating Vision-Language Models (VLMs). However, due to their substantial computational requirements and slower inference times on mobile hardware, they were unsuitable for our real-time application.
To achieve real-time performance on mobile devices, we transformed the incoming video stream into LLM input with MediaPipe’s BlazePose. We chose BlazePose for its proven accuracy and speed.
For generating personalized feedback, we used Microsoft’s Phi-4-mini language model. It offered the best quality output within the model size and latency we could afford. The llama.cpp framework also supports this model.
We tried using Android’s built-in TTS engine to vocalize the feedback. However, its output was too robotic, detracting from the user experience. Given that our application had available processing capacity, we switched to using Piper, an ONNX-based TTS system that runs on the CPU. While Piper is less performant than the native TTS engine, it delivers more natural-sounding speech, enhancing user engagement.
llama.cpp provides extensive support for many LLMs, including Phi, LLaMA, DeepSeek, Gemma and Qwen. This flexibility allowed us to experiment and identify the optimal balance between performance and output quality. Llama.cpp is designed for efficient CPU-based inference. It enables on-device LLM execution, reducing latency and enhancing privacy.
llama.cpp also integrates with Arm KleidiAI, a suite of optimized micro-kernels for Arm CPUs. This integration enables efficient execution of our 4-bit quantized LLM on Arm-based devices. KleidiAI includes SME2 performance benefits that are already implemented in the frameworks that integrate it. These include llama.cpp, Alibaba MNN, Google's LiteRT and Mediapipe, and Microsoft's ONNX Runtime.
Overall, llama.cpp delivered the best speed, SME2 integration, and model flexibility.
We needed to integrate llama.cpp into a Kotlin Android app, and this is done through the Java Native Interface (JNI). Firstly, llama.cpp needed to be cross-compiled for Android with the Android NDK. Then it required careful work on our JNI integration for the app to maintain the performance and quality of the llama.cpp binary when called from the Kotlin code through the JNI.
Our Yoga Tutor employs a real-time processing pipeline. This pipeline delivers immediate and context appropriate feedback. The application captures live video input from the device’s camera. and processes each frame using MediaPipe’s BlazePose model to extract 33 3D body landmarks.
These landmarks are turned into joint angles and compared to a pre-recorded instructor’s pose. Based on any deviations, a score is computed to assess user accuracy.
Using this score and the type of the previous prompt, the application decides on the next prompt type: correction, praise, transition, or general. A prompt is constructed and sent to a fine-tuned LLM, which generates feedback.
As the LLM outputs tokens, the application monitors punctuation to identify complete phrases. Each complete phrase is then sent to the Piper engine for speech synthesis. This provides immediate and continuous guidance on pose corrections.
Getting a short input prompt with enough information and a concise accurate output required considerable experimentation. Tuning the system prompt was also critical. Supplying the LLM the full set of 33 3D coordinates from both the user’s and instructor’s poses was too long.
We tried using only the difference between user and instructor poses. This halved the input size. While more efficient, it still did not produce consistent results. We refined the input by calculating specific joint angles from BlazePose. Then we supplied only those that showed significant deviation. This reduced the token count and improved the relevance of responses.
Balancing brevity with expressiveness was essential. Concise prompts helped reduce latency, which is critical for responsiveness in real-time applications. However, excessive simplification sometimes limited the variety and nuance of generated feedback. Striking the right balance became a key design consideration in how we format LLM inputs.
To process BlazePose outputs for effective feedback, we calculate specific joint angles using the 3D coordinates provided by BlazePose. Each angle is defined by three key points. For example, the elbow angle is computed using the shoulder, elbow, and wrist coordinates.
To assess deviations, we subtract the instructor’s joint angles from the user’s corresponding angles. This simple calculation highlights where the user’s pose differs from the ideal, enabling targeted feedback.
In our application, the pose score is not directly used in the LLM prompt. Instead, it drives app behavior. For example, when to trigger transitions, when to provide praise or filler comments, and how to assess user progress.
The score reflects how closely the user's pose matches the instructor's pose. It is calculated based on the angular differences between corresponding joints. Each joint angle can optionally be assigned a weight, allowing more significant joints to influence the score more heavily.
Prompt engineering allowed us to structure inputs for our language model. However, it did not consistently yield accurate or context appropriate feedback for users’ yoga poses. The model sometimes focused on less important corrections, or suggested moving the right part in the wrong direction. We turned to fine-tuning to enhance the model’s performance and provide more reliable feedback.
We used Unsloth, a Python library that streamlines and optimizes fine-tuning for LLMs. It provides a suite of notebooks and tools tailored for various models, including Phi-4. We only needed minor changes to Unsloth’s notebooks.
Our first approach to building a fine-tuning dataset used the OpenAI API to generate correction responses from yoga pose input. While this produced helpful results sometimes, the quality was inconsistent. Only about 7 out of 10 completions were usable. Relying on this method alone would have required manually reviewing and pruning thousands of samples, which was not sustainable for our timeline.
To scale dataset creation, we developed a rule-based Python script that automatically generated synthetic training examples. The script creates random lists of joint angle deviations. It applies a predefined set of rules for each pose to determine the response. It selects the joint with the highest deviation and generates a corrective instruction based on that joint and the current pose. This method allowed rapid data generation. However, its deterministic nature led to repetitive phrasing, limiting the variety seen after fine-tuning.
To introduce more diversity, we reintroduced the OpenAI API. This time, not for full generation, but to rewrite outputs from the rule-based script. By randomizing temperature and seed values for each call, we preserved the correctness of the base response while adding stylistic variation. This significantly improved the expressiveness of our dataset and helped the model generalize better during evaluation.
The following shows the inference times for three core components:
This latency is suitable for applications like yoga. Poses are held for several seconds, allowing the system ample time to process and provide feedback. Internal testing indicates that using Arm Scalable Matrix Extension 2 (SME2) within the new Arm C1 CPU cluster for upcoming devices can achieve:
4.7x improvement on TTFT for LLMs.
Significant power, battery and heat savings become possible with SME2. For example, using only one core for the LLM instead of the current six will still result in a 2.2x pipeline speedup.
To improve yoga pose correction systems, we need community-driven datasets. The datasets must catalogue yoga poses and include annotations of common mistakes with their corrections. Such datasets would provide a richer training ground for models, enabling them to offer more nuanced feedback to practitioners.
Looking ahead, advances in software and hardware present opportunities to overcome current limitations. Continued improvements in Arm-based chips, along with performance gains from technologies like SME2, could allow us to run more capable LLMs without compromising responsiveness. With more compute available, we could consider using a larger and more accurate model, or even a VLM, to analyze full frames directly. However, enabling VLM-based reasoning would also require more sophisticated datasets that include examples of common mistakes and corrections.
If you would like to know more about SME2, try the Learning Path Accelerate Voice Assistant performance with KleidiAI and SME2. The Voice Assistant has an AI pipeline with a lot of similar components to the Yoga Tutor. For more comprehensive SME2 information there is an SME Overview and an SME Programmer's Guide.