This blog post is the second in our Neural Super Sampling (NSS) series. The post explores why we introduced NSS and explains its architecture, training, and inference components.
In August 2025, we announced Arm neural technology that will ship in Arm GPUs in 2026. The first use case of the technology is Neural Super Sampling (NSS). NSS is a next-generation, AI-powered upscaling solution. Developers can already start experimenting with NSS today, as discussed in the first post of this two-part series.
In this blog post, we take a closer look at how NSS works. We cover everything from training and network architecture to post-processing and inference. This deep dive is for ML engineers and mobile graphics developers. It explains how NSS works and how it can be deployed on mobile hardware.
Temporal super sampling (TSS), also known as TAA, has become an industry standard solution for anti-aliasing over the last decade. TSS offers several benefits. It addresses all types of aliasing, is compute-efficient for deferred rendering, and extensible to upscaling. However, it is not without its challenges. Hand-tuned heuristics, commonly used in TSS approaches today, can be difficult to scale and require continual adjustment across varied content. Issues like ghosting, disocclusion artifacts, and temporal instability remain. These problems worsen when combined with upscaling.
NSS overcomes these limitations by using a trained neural model. Instead of relying on static rules, it learns from data. It generalizes across conditions and content types, adapting to motion dynamics and identifying aliasing patterns more effectively. These capabilities help NSS handle edge cases more reliably than approaches such as AMD's FSR 2 and Arm ASR.
NSS is trained using sequences of 540p frames rendered at 1 sample per pixel. Each frame is paired with 1080p ground truth images rendered at 16spp. Sequences are about 100 frames to help the model understand how image content changes over time.
Inputs include rendered images, such as color, motion vectors, and depth, alongside engine metadata, such as jitter vectors, and camera matrices. The model is trained recurrently and runs forward across a sequence of multiple frames before performing each backpropagation. This approach lets the network propagate gradients through time and learn how to accumulate information.
The network is trained using a spatiotemporal loss function. It simultaneously penalizes errors in both spatial fidelity and temporal consistency. Spatial fidelity keeps each frame sharp, detailed, and visually accurate. It also preserves the edges, textures, and fine structures. Temporal stability discourages flickering, jittering, or other forms of temporal noise across consecutive frames.
Training is done in PyTorch using well-established practices. Including the Adam optimizer, a cosine annealing learning rate schedule, and standard data augmentation strategies. Pre- and post-processing passes are written in Slang for flexibility and performance. ExecuTorch is used for quantization-aware training.
The NSS network uses a four-level UNet backbone. It includes skip connections to preserve spatial structure. It downsamples and upsamples input data across three encoder and decoder modules.
We evaluated several approaches:
The network generates three per-pixel outputs:
The network outputs serve two paths:
NSS introduces two key feedback mechanisms to address temporal instabilities:
These inputs help the model maintain temporal stability without relying on handcrafted rules.
A GPU-based pre-processing stage runs before inference. It prepares the inputs required by NSS. This stage gathers per-pixel attributes like color, motion vectors, and depth. It also computes the luma derivative, a temporal signal that flags thin-feature flicker, and a disocclusion mask that highlights stale history. In addition, it reprojects hidden features from previous frames.
These are assembled into a single input tensor for the neural network. This stage runs as a compute shader. It executes before the inference call, which runs on the GPU using Vulkan ML extensions.
After inference, a post-process stage runs as a compute shader to construct the output color. All steps are integrated into the render graph and are designed to run efficiently on mobile. These steps include:
We assess NSS using several metrics. These include PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity Index), and FLIP, a rendering focused perceptual error metric. These metrics do not always match human perception. However, they help surface problem cases. Tracking multiple metrics builds confidence.
A Continuous Integration (CI) workflow replays test sequences. It logs performance across NSS, Arm Accuracy Super Resolution (ASR), and other baselines. For visual comparisons and perceptual evaluation, please refer to the whitepaper linked below.
Learn more about Neural Super Sampling
In 540p-to-1080p comparisons, NSS improves stability and detail retention. It performs well in scenes with fast motion, partially occluded objects, and thin geometry. Unlike non-neural approaches such as Arm ASR or AMDs FSR 2, NSS also handles particle effects without needing a reactive mask.
While silicon products with Neural Accelerators have not yet been announced, we can estimate whether NSS is fast enough. This estimate is based on minimum performance assumptions and the number of MACs required to perform an inference of the network. This analysis applies to any accelerator which meets the same assumptions for throughput, power and utilization. We assume a target of 10 TOP/s per-watt of neural acceleration is achievable at a sustainable GPU clock frequency.
We target ≤4ms for the upscaler per frame in sustained performance conditions. Shader stages before and after inference take about 1.4ms on a low-frequency GPU. With this budget, NSS must stay below approximately 27 GOPs. Our parameter prediction network uses about 10 GOPs. This fits comfortably within that range, even at 40% neural accelerator efficiency.
Early simulation data shows NSS costs approximately 75% of Arm ASR’s runtime in 1.5× upscaling (balanced mode). It is projected to outperform Arm ASR 2× upscaling (balanced mode). Efficiency gains come from replacing complex heuristics with a streamlined inference pass.
NSS introduces a practical, ML-powered approach to temporal super sampling. It replaces hand-tuned heuristics with learned filters and stability cues. It also runs within the real-time constraints of mobile hardware. Its training approach, compact architecture, and use of ML extensions for Vulkan make it performant and adaptable. For ML engineers building neural rendering solutions, NSS is a deployable, well-structured example of inference running inside the graphics pipeline.
To explore the Arm Neural Graphics Development Kit, visit the NSS page on the Arm Developer Hub. There you can find sample code and review the network structure. We welcome feedback from developers using the SDK or retraining NSS for their own content. Your insights can help shape the future of neural rendering on mobile.
Get started