Will the compute kernel compilers (OpenCL 1.2 or some future Vulkan 1.1) still support explicit vector style programming (e.g. float4, uint4, int4, half8) on the Bifrost GPUs with 4-wide execution engines?
I'm asking because I have some "embarrassingly parallel" algorithms that map well to SIMD-style vector programming but benefit from inter-lane communications.
On a scalar-per-thread design, this can be accomplished with shuffles.
But if shuffles aren't available I would prefer to use explicit vectors and permutations.
Any tips on whether this is possible on OpenCL?
Or, maybe, VK 1.1 will bring subgroup shuffles to G7x/G31?
Thanks,
-ASM
Pre-Bifrost Mali is a SIMD architecture.
Bifrost is 4-wide SIMT with a 32-bit data path. Narrower types treat the 32-bit path as a small SIMD unit (e.g. to get efficiency benefits for fp16 computation you need something which converts into clean SIMD vec2 operations).
For vector operations on 32-bit types, such as your examples, the two architectures should be similar. You'll generate efficient SIMD code, and the compiler can always scalarize the equivalent of the SIMD code for SIMT architectures. The inverse is not true - vectorizing code can be difficult - so always try to write vector code where you can.