August 12, 2025

How Neural Super Sampling works: Architecture, training, and inference

A deep dive into a practical, ML-powered approach to temporal super sampling.

Reading time 7 minutes

This blog post is the second in our Neural Super Sampling (NSS) series. The post explores why we introduced NSS and explains its architecture, training, and inference components.

In August 2025, we announced Arm neural technology that will ship in Arm GPUs in 2026. The first use case of the technology is Neural Super Sampling (NSS). NSS is a next-generation, AI-powered upscaling solution. Developers can already start experimenting with NSS today, as discussed in the first post of this two-part series.

In this blog post, we take a closer look at how NSS works. We cover everything from training and network architecture to post-processing and inference. This deep dive is for ML engineers and mobile graphics developers. It explains how NSS works and how it can be deployed on mobile hardware.

Why we replaced heuristics with Neural Super Sampling

Temporal super sampling (TSS), also known as TAA, has become an industry standard solution for anti-aliasing over the last decade. TSS offers several benefits. It addresses all types of aliasing, is compute-efficient for deferred rendering, and extensible to upscaling. However, it is not without its challenges. Hand-tuned heuristics, commonly used in TSS approaches today, can be difficult to scale and require continual adjustment across varied content. Issues like ghosting, disocclusion artifacts, and temporal instability remain. These problems worsen when combined with upscaling.

NSS overcomes these limitations by using a trained neural model. Instead of relying on static rules, it learns from data. It generalizes across conditions and content types, adapting to motion dynamics and identifying aliasing patterns more effectively. These capabilities help NSS handle edge cases more reliably than approaches such as AMD's FSR 2 and Arm ASR.

Training the NSS network: Recurrent learning with feedback

NSS is trained using sequences of 540p frames rendered at 1 sample per pixel. Each frame is paired with 1080p ground truth images rendered at 16spp. Sequences are about 100 frames to help the model understand how image content changes over time.

Inputs include rendered images, such as color, motion vectors, and depth, alongside engine metadata, such as jitter vectors, and camera matrices. The model is trained recurrently and runs forward across a sequence of multiple frames before performing each backpropagation. This approach lets the network propagate gradients through time and learn how to accumulate information.

The network is trained using a spatiotemporal loss function. It simultaneously penalizes errors in both spatial fidelity and temporal consistency. Spatial fidelity keeps each frame sharp, detailed, and visually accurate. It also preserves the edges, textures, and fine structures. Temporal stability discourages flickering, jittering, or other forms of temporal noise across consecutive frames.

Training is done in PyTorch using well-established practices. Including the Adam optimizer, a cosine annealing learning rate schedule, and standard data augmentation strategies. Pre- and post-processing passes are written in Slang for flexibility and performance. ExecuTorch is used for quantization-aware training.

Network architecture and output designs

The NSS network uses a four-level UNet backbone. It includes skip connections to preserve spatial structure. It downsamples and upsamples input data across three encoder and decoder modules.

We evaluated several approaches:

Image prediction: Easy to implement but struggled under quantization and caused visual artefacts.
Kernel prediction: Generalized well and quantized effectively but produced high bandwidth overhead due to many large kernel maps.
Parameter prediction (Chosen): Outputs a small set of parameters per pixel. These drive post-processing steps like filtering and sample accumulation. It is quantization-friendly and bandwidth-efficient.

The network generates three per-pixel outputs:

A 4x4 filter kernel.
Temporal coefficients used for accumulation and rectification.
A hidden state tensor, passed to the next frame as temporal feedback.

The network outputs serve two paths:

The filter kernel and temporal coefficients that are consumed by the post-processing stage to compute the final upscaled image.
The hidden state, which is passed forward to inform the next frame’s inference. Unlike approaches like Arm ASR, which rely on hand-tuned heuristics, a machine-learning approach like NSS has a three-fold benefit:
1. NSS estimates dynamic kernel filters and parameters, that address aliasing at a per-pixel granularity.
2. NSS harnesses temporal feedback which captures historic state across multiple frames, for greater temporal stability.
3. NSS can be fine-tuned on new game content, further enabling developers to optimize image quality for their specific titles.

Improving frame-to-frame consistency with temporal feedback

NSS introduces two key feedback mechanisms to address temporal instabilities:

Hidden features from prior frames are passed forward, allowing the network to learn what changed and what persisted.
A luma derivative is computed to detect flickering thin features, which highlights temporal differences that indicate instability.

These inputs help the model maintain temporal stability without relying on handcrafted rules.

Pre-processing stage: Preparing the input

A GPU-based pre-processing stage runs before inference. It prepares the inputs required by NSS. This stage gathers per-pixel attributes like color, motion vectors, and depth. It also computes the luma derivative, a temporal signal that flags thin-feature flicker, and a disocclusion mask that highlights stale history. In addition, it reprojects hidden features from previous frames.

These are assembled into a single input tensor for the neural network. This stage runs as a compute shader. It executes before the inference call, which runs on the GPU using Vulkan ML extensions.

Post-processing: From raw output to final frame

After inference, a post-process stage runs as a compute shader to construct the output color. All steps are integrated into the render graph and are designed to run efficiently on mobile. These steps include:

Motion vector dilation. This reduces aliasing when reprojecting history.
History reprojection. A Catmull-Rom filter to reduce reprojection blur.
Filtering. This applies the 4x4 kernel to anti-alias the current color input.
Sparse upscaling. This maps jittered low-res samples onto a high-res grid. Any missing pixels are zero-filled and then sparsely filtered with the 4x4 kernel. This step performs interpolation and anti-aliasing, like demosaicing.
Rectification. This uses the predicted theta parameter to reject stale history.
Sample accumulation. This uses a predicted alpha parameter to blend new data with the history buffer. Performed in a tone-mapped domain to prevent “firefly” artefacts.

How we validate quality

We assess NSS using several metrics. These include PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity Index), and FLIP, a rendering focused perceptual error metric. These metrics do not always match human perception. However, they help surface problem cases. Tracking multiple metrics builds confidence.

A Continuous Integration (CI) workflow replays test sequences. It logs performance across NSS, Arm Accuracy Super Resolution (ASR), and other baselines. For visual comparisons and perceptual evaluation, please refer to the whitepaper linked below.

Learn more about Neural Super Sampling

In 540p-to-1080p comparisons, NSS improves stability and detail retention. It performs well in scenes with fast motion, partially occluded objects, and thin geometry. Unlike non-neural approaches such as Arm ASR or AMDs FSR 2, NSS also handles particle effects without needing a reactive mask.

Can NSS run in real-time?

While silicon products with Neural Accelerators have not yet been announced, we can estimate whether NSS is fast enough. This estimate is based on minimum performance assumptions and the number of MACs required to perform an inference of the network. This analysis applies to any accelerator which meets the same assumptions for throughput, power and utilization. We assume a target of 10 TOP/s per-watt of neural acceleration is achievable at a sustainable GPU clock frequency.

We target ≤4ms for the upscaler per frame in sustained performance conditions. Shader stages before and after inference take about 1.4ms on a low-frequency GPU. With this budget, NSS must stay below approximately 27 GOPs. Our parameter prediction network uses about 10 GOPs. This fits comfortably within that range, even at 40% neural accelerator efficiency.

Early simulation data shows NSS costs approximately 75% of Arm ASR’s runtime in 1.5× upscaling (balanced mode). It is projected to outperform Arm ASR 2× upscaling (balanced mode). Efficiency gains come from replacing complex heuristics with a streamlined inference pass.

Start building with NSS

NSS introduces a practical, ML-powered approach to temporal super sampling. It replaces hand-tuned heuristics with learned filters and stability cues. It also runs within the real-time constraints of mobile hardware.
Its training approach, compact architecture, and use of ML extensions for Vulkan make it performant and adaptable. For ML engineers building neural rendering solutions, NSS is a deployable, well-structured example of inference running inside the graphics pipeline.

To explore the Arm Neural Graphics Development Kit, visit the NSS page on the Arm Developer Hub. There you can find sample code and review the network structure. We welcome feedback from developers using the SDK or retraining NSS for their own content. Your insights can help shape the future of neural rendering on mobile.

Get started

By Liam O'Neil

Article text

Re-use is only permitted for informational and non-commerical or personal use only.