The engineering team at Unity recently asked us to help them investigate a developer bug report. The problem they encountered was poor image quality when using ASTC to compress RGBM-encoded light maps when compared to ETC2+EAC at the same bit rate. Our investigation highlighted several interesting lessons for getting the most effective compression using ASTC. Unity have kindly let us use it as a case study for this blog.
Just to set the scene, the developer was compressing this RGBM-encoded skybox image (shown converted back to normal RGB data):
The following images below show a zoomed in region of this for several different compression formats:
It is clear from these images that the ASTC version is suffering from objectionable block artifacts. No wonder the developer was unhappy with image quality! Let’s see what we can do about it.
Before we get started it is useful have an overview of the input data format. RGBM data stores high-dynamic range (HDR) RGB data encoded into a low-dynamic range (LDR) texture. The RGB channels in the texture data store the base color channels, and the M channel stores a multiplier scaling factor for the RGB channels. Reconstructing an actual HDR color sample requires some additional logic in the shader code:
vec4 data = texture(sampler, uv);
data.rgb = data.rgb * data.a * 5.0;
data.a = 1.0;
This RGBM encoding therefore gives Unity the ability to extract high-dynamic RGB data, with values between 0 and 5, from low-dynamic range storage formats.
ASTC is a complex format which I will not cover in detail here, but there are some useful essentials which you need to know.
Like other GPU compression formats, it is a lossy block-based format. It compresses an input block of NxM texels into 128 bits of output. For each block the encoding stores texel colors by storing two color values – called endpoints – that are shared by all texels in a partition. Per-texel colors are recovered by using a weight for each texel that controls the mixing ratios of the two endpoint colors.
For situations where there is complex color variation inside each block, such as a red ball resting on green grass, up to four separate partitions can be defined. Each partition stores unique endpoint colors, but shares interpolation weights with the other partitions.
For situations where there is uncorrelated data across the color channels, such as an alpha mask packed with RGB data, one color component can be assigned a unique set of per-texel weights. This allows the interpolation mix for that channel to be different from the other three.
All of this is relatively standard technology for block-based formats. What is unique about ASTC is the amount of control the compressor gets over assigning storage to each of these things. The compressor can dynamically choose how many bits to assign to encoding color endpoints, and how many to assign to encoding interpolation weights. It can also choose how many color endpoint partitions to encode, and whether to use a second set of interpolation weights or not. This is what makes ASTC so flexible and, for the most part, high quality. It is not forced to spend bits on things which are not important for any given compression block.
What is universally true however, is that the compressor never gets enough bits to do everything it would like. The compression is lossy – so we always lose some accuracy somewhere – so the job of the compressor is to make sure it spends the bits available on the things that give the most benefit.
Knowing the above, let us think though what RGBM is likely to need from the data encoding.
First, it is storing 4 color channels, so we have full-sized color endpoints and cannot save any bits there.
Second, the “RGB” channels are likely to be well correlated to each other; color data normally is. However, the “M” channel is not directly storing a color and is therefore likely to be poorly correlated with the RGB channels. This is likely to force the compressor to try to use a second set of texel weights.
Third, the encoding accuracy of the “M” channel is going to be important. Any errors in the compressed alpha value are multiplied by 5 in the shader code doing the HDR data reconstruction.
Finally, we are likely to be sensitive to how a value is encoded across a color channel and the multiplier channel. If we look at this in the integer domain, ignoring the unorm scaling, let us think about storing two 8-bit luminance values – 50 and 55. One valid encoding would be:
This is nice to encode; there is little variation across RGB endpoints, and no variation across the multiplier channel. However, the same data could also be stored as:
This has a larger numeric range in both channels. This would need more bits to encode even though it is functionally the same as the first encoding shown for this use case.
In total, this is a hard set of requirements for the compressor because the first three all want high bit rate to be assigned to different things. We simply don’t have enough storage capacity in the compressed format to assign enough bits to all of them. The final part of this also leaves the compressor at the mercy of input data which is poorly transformed being much harder to compress. The compressor does not know this RGBM-specific data equivalency and therefore cannot automatically optimize the encoding based on it.
We know that ETC2+EAC compresses better than ASTC, so it’s useful to consider why this occurs.
First, the format explicitly splits the RGB (encoded using ETC2) and M (encoded using EAC) data. This means the M data is explicitly non-correlated and gets a natural increase in bit rate significance because the EAC data, at 4bpp, is only stores a single channel. Also, the bit rate split between the two is fixed. This avoids the situation that we get with ASTC where the compressor heuristics can over-assign bits to one at the expense of the other.
Second, the types of error that ETC2+EAC tends to produce looks more like dither speckle rather than block artifacts. This can result in worse PSNR, but as the perceptual impact of this type of error is lower it's often produces a more usable output image.
Our first attempt to fix this used the error weighting functionality in the astcenc compressor to bias the encoding towards alpha channel accuracy. When using the “-ch 1 1 1 5” command line option, we tell the compressor to weight the alpha channel five times more heavily when it is considering encoding choices. This weighting aligns with the scaling that is applied during the RGBM data reconstruction in the shader program.
-ch 1 1 1 5”
This helped; PSNR for 4x4 blocks improved from 48.0 dB to 50.9 dB, and even exceeded ETC2+EAC which had a PSNR of 48.7 dB. Unfortunately, this test also highlighted why PSNR is regarded as a poor metric. The output might be a closer match to the original on average, which gives the better PSNR score, but it still suffers from visibly worse block artifact problems than the ETC2+EAC compressed image.
For the second attempt to fix this we went back to looking at the RGBM data format to see what we could do to change the input data. We were looking for a modification which would help the compressor avoid spending bits on things because we could “design them out” in the data format.
For ASTC the most expensive encoding option is including a second plane of per-texel weights, which are needed to cope with a non-correlated data channel. This either costs many bits to accurately store the weights, leaving fewer for endpoint colors, or you accept less accurate weights which gives less accurate reconstruction. Either of these is problematic because we needed both accurate endpoints and interpolation to ensure an accurate M value, and we already knew that any error in the M channel gets scaled by five times by the reconstruction process. “Less accurate” quickly becomes a liability for this use case. We quickly decided we had to find a way to avoid storing a non-correlated channel in the encoded data.
Our chosen solution here was to create a helper utility application which pre-processes the input data before compression. It writes a modified uncompressed image back to disk before invoking the astcenc compressor. This pre-processing pass iterates across the image, repacking the data block-by-block at the targeted ASTC block size. This repacking forces all texels in a block to use the same M value, scaling the RGB values by the necessary amount to keep the reconstructed values as close as possible to the original. To reduce block artifacts that are caused by rapid changes in M values, we also added a limit on how fast the M value could change across neighboring blocks.
This one change solves most of the issues we identified earlier:
This worked fantastically well, although we still needed to manually bias the encode error weights to ensure a good M channel endpoint accuracy. The PSNR for ASTC 4x4 blocks increases from 50.8 dB to 56.2 dB, 7.5 dB better than ETC2+EAC at the same bit rate, with no visible block artifacts in the image.
When we had solved the image quality problems, it was time to optimize the content to make it even better for mobile devices. The true super-power of ASTC is the bit rate choice that it gives content creators. By choosing different texel block sizes, which all encode to 128-bit output, artists can trade-off quality against bit rate. Compression rates can vary from 8bpp (4x4 blocks) all the way down to 0.89bpp (12x12 blocks), with fine steppings in between.
For mobile devices DRAM access is a major power consumer, so to finish off our investigation we wanted to see if we could use a larger block sizes to lower the bit rate, while still beating ETC2 in terms of visual quality. We tried the commonly used block sizes: 5x5 blocks (5.12bpp), 6x6 blocks (3.56bpp), 8x6 blocks (2.65bpp), and 8x8 blocks (2bpp).
The results were clear, we could significantly reduce bit rate and still beat ETC2+EAC in terms of image quality. The first three of these block sizes are better than ETC2+EAC, with significant PSNR advantages despite a ~3x bit rate disadvantage in the case of ASTC using 8x6 blocks.
With the 8x6 blocks we are starting to lose some detail, and we also start to get some minor block artifacts starting to appear. However, despite this they are still very usable images, and at less than half the bit rate of the ETC2+EAC version they are compelling.
With 8x8 blocks at 2bpp ASTC starts to struggle. The PSNR is slightly worse than ETC2+EAC and the block artifacts start to become clearly noticeable. Given it is at a 4x bit rate disadvantage, perhaps this is not too surprising.
This is a good opportunity to highlight some best practice recommendations for using ASTC with the new Mali Valhall GPU family. This includes the new Mali-G57 and Mali-G77 GPUs, which will both start shipping in consumer devices in 2020.
For Valhall we have doubled the texturing throughput. These two GPUs keep the same two pixels per core per clock throughput as the previous generation of Mali GPUs, but they can now perform four bilinear (LINEAR_MIP_NEAREST) or two trilinear (LINEAR_MIP_LINEAR) samples per core per clock. The downside of filtering twice as fast is that we need to pull twice as much data out of the texture caches, so this faster throughput is only possible when the texture format uses 32-bits per texel after decompression.
The ETC2, ETC2+EAC, and the sRGB ASTC formats always decompress into 32 bits per sample. They are therefore automatically benefiting from the faster Valhall texturing path. Unfortunately, the non-sRGB LDR ASTC format, and the HDR ASTC format, are specified to decompress into a four channel fp16 intermediate result. This means that the faster filtering path cannot be used by default.
ASTC decode mode extensions for OpenGL ES and for Vulkan are now published by Khronos. These allow developers to opt into using a lower precision intermediate format. Using these extensions will improve texturing caching and allow the new fast filtering path to be used.
Note: It is recommended to use these extensions on all Mali GPUs that expose them, even if the faster filtering path is not available, due to the improved texture cache benefits.
This investigation has highlighted the importance of how input data is encoded for textures which do not store well correlated data channels. Content creators are reminded that, while lossy format compressors can compress anything, the bits can only be spent once. It makes sense to help the compressor spend them wisely.
When we fixed the format problems, we have also highlighted the efficiency and flexibility of ASTC. We have shown how artists can turn down the dial to 2.65bpp (8x6 blocks, 51.9 dB). This reduces bandwidth and storage requirements by a factor of 3, while keeping a higher PSNR than ETC+EAC2 (8bpp, 48.7 dB) with no significant block artifacts.
ASTC is widely supported in the mainstream game engines, and ASTC LDR textures are universally supported on devices supporting OpenGL ES 3.2. Note that all Mali GPUs that support ASTC implement the optional ASTC HDR profile, which would allow the HDR source data for this to be directly encoded, without needing RGBM encoding.
The reference compressor and the RGBM M-channel blockify utility tool can be downloaded from GitHub:
Test image source:
Discover more tools in Arm Mobile Studio
Just a short follow-up to this one. The prototype in astc_rgbm_blockify helper script GitHub applies a naive algorithm, applying the value of M which preserves the most precision to every block by default. This works fine for NEAREST filtering, but can cause problems with LINEAR filtering. Under filtering the color channels are sampled and filtered separately (i.e. before RGBM reconstruction) and with blocks with wildly different M values that can cause artifacts along block boundaries. This is unrelated to ASTC - it's due to the preprocess and would occur without any compression at all.
The ratio parameter to the utility aims to help mitigate this by restricting the maximum size of "step" between blocks, the default recommendation being a 30% delta between blocks (a ratio parameter of 0.7). As we can see in the image above this isn't sufficient to stop these block edge artifacts under filtering. If you have textures with a very fast HDR roll off like this one, a higher ratio value (0.9 - a 10% step change) helps, but by restricting the rate of M roll off you start introducing quantization issues in the region where the M value is being held higher than it would normally be by the ratio limiter.
Another solution would be to apply "blockify" selectively to only blocks where the M value is similar to neighboring blocks, perhaps in conjunction with some ratio-like control over M to make things "more similar" for small quantization impacts.
Smaller ASTC blocks also help to mitigate this. Smaller blocks cover less space, so naturally cover fewer M values, and give a faster roll of capability within a certain number of texels. This all comes at the the expense of more bitrate, of course.
There is no magic fix here - RGBM is a technique which places a data encoding link between the RGB channels and the M channel, and any kind of GPU hardware filtering is simply ignoring that data encoding significance completely.