Hi! Half a year later, I'm back with the same question, albeit regarding Vulkan :)
We're developing for G76/G78 devices, such as Note 8 Pro or Samsung S20FE, and we're unable to see any effect of F16 support on any of our Mali GPUs. By effects, I mean even intentional half overflows in shader calculations don't show any artifacts on Mali GPUs, while they do show up on Adreno and desktop hardware. There's no effect on performance, with or without F16 extensions.
Here's how we create the VkDevice:
VkPhysicalDeviceFeatures deviceFeatures = {}; deviceFeatures.imageCubeArray = true; deviceFeatures.independentBlend = true; devCreateInfo.pEnabledFeatures = &deviceFeatures; devCreateInfo.enabledLayerCount = 0; devCreateInfo.ppEnabledExtensionNames = deviceExtensions.data(); devCreateInfo.enabledExtensionCount = (uint32_t)deviceExtensions.size(); VkPhysicalDeviceFloat16Int8FeaturesKHR float16Features = { VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_FLOAT16_INT8_FEATURES_KHR }; float16Features.shaderFloat16 = true; devCreateInfo.pNext = &float16Features; VK_CALL( vkCreateDevice( physDevice, &devCreateInfo, nullptr, &m_device.device ) );
We also include VK_KHR_shader_float16_int8 into ppEnabledExtensionNames.
What's strange, while the GLES extensions app reports support for VK_KHR_shader_float16_int8 extension, when we capture and replay RenderDoc captures on Mali hardware, RenderDoc fails on replay, stating that VK_KHR_shader_float16_int8 is not supported. We've also tried forcing the compiler to use RelaxedPrecision decorations, but this didn't produce any visual results.
Can you please clarify:
Hi Ivan,
A relatively long answer, sorry ...
From the GPU side of things, all Mali GPUs support 16-bit calculations, generally implemented as vec2 issue down a 32-bit data path. There are not 16-bit options for every hardware instruction, and there are cases where we force 32-bit texture coords in fragment shaders because of many content issues in the wild. The compiler may choose not to use a 16-bit operation if it doesn't make sense (e.g. cost of type conversion higher than the saving of switching to narrower precision).
Lack of performance gain can occur for a couple of reasons:
I'm not aware of any vendors blocking fp16 support - it would be a major power efficiency and performance hit with no obvious upside.
In the general case there isn't automatic overflow clamping, so fp16 overflows should be visible if an fp16 type is being used.
For Vulkan, RelaxedPrecision is fine. On newer drivers with the extension for explicit types that should work too (but has the same limitations that it might not be explicit in reality if the hadrware operation isn't physically availble in a 16-bit flavor).
What toolchain are you using to generate your SPIR-V? We have seen a few cases where the final SPIR-V has lost RelaxedPrecision annotation by the time it reaches the final output we get given. If you're able to share a SPIR-V file we can check what's going on (feel free to email me via "mobilestudio at arm dot com" if you can't share publicly).
Cheers, Pete
Hi there, does Peter's reply answer your question? If so, please mark it as a suggested answer. Many thanks.
Hi Peter! I've sent an email to the address you mentioned. It contains SPIR-V asm extracted from RenderDoc. It's compiled from HLSL using DXC v. 1.6.2104.52 with command-line arguments
-spirv -fspv-target-env=vulkan1.0 -fvk-use-dx-layout -Zpr -HV 2018 -enable-16bit-types -O3.
Are there any general advices you can give about using DXC and potentially SPIRV-Tools regarding Mali hardware, like recommended optimization layers, input arguments for DXC, etc?
We've continued the discussion in email and eventually found the core issue -- drivers for Redmi Note 8 Pro (G76) and Samsung S20FE (G77) ignore true half precision types, but work with relaxed ops, such as min16float. Using min16float instead of half types produces SPIR-V with RelaxedOps decorators, and using spvtools::CreateRelaxFloatOpsPass() produces shaders that exhibit F16 precision artifacts, such as distorted vertices and broken texture scrolling. It's just that driver expects RelaxedOps decorators and visual artifacts are different from Adreno. They even differ between GPU generations -- with spvtools::CreateRelaxFloatOpsPass() both G76 and G77 output broken results, but the ways they are broken are slightly different.
We're in the process of evaluating energy/performance impact of F16 in our case, but, in the end, my original claim was wrong and F16 operations on Mali do work.
For now, we've settled for the following definitions in the beginning of our HLSL shaders:
"#define half min16float\n"; "#define half2 min16float2\n"; "#define half3 min16float3\n"; "#define half4 min16float4\n"; "#define half3x3 min16float3x3\n"; "#define half4x4 min16float4x4\n\n";