Will the compute kernel compilers (OpenCL 1.2 or some future Vulkan 1.1) still support explicit vector style programming (e.g. float4, uint4, int4, half8) on the Bifrost GPUs with 4-wide execution engines?
I'm asking because I have some "embarrassingly parallel" algorithms that map well to SIMD-style vector programming but benefit from inter-lane communications.
On a scalar-per-thread design, this can be accomplished with shuffles.
But if shuffles aren't available I would prefer to use explicit vectors and permutations.
Any tips on whether this is possible on OpenCL?
Or, maybe, VK 1.1 will bring subgroup shuffles to G7x/G31?
Thanks,
-ASM
The OpenCL specification requires support for vector types, so they will of course still be supported. The compiler will handle any mapping to the underlying ISA (SIMD or otherwise).
Bifrost generally will still benefit from vector data types, in particular for:
Do you have a specific example you can share? It's hard to provide advice when the question is so generic.
HTH, Pete
I'd like to use vector component addressing to simulate VK 1.1 shuffles (assuming they aren't supported).For example, an "inclusive scan max" would be:
uint4 scan_inclusive_max(uint4 const v) { // 0123 // 0012 max // ---- // 0123 // 0101 max // ---- // 0123 // uint4 const w = max(v,v.s0012); uint4 const x = max(w,w.s0101); return x; }
Another example might be simulating an XOR shuffle (butterfly):
// SIMT / Scalar uint a = ...; uint b = ...; int lt = a < subgroupShuffleXor(b,3); // SIMD / Vector uint4 a = ...; uint4 b = ...; int4 lt = a < b.3210;
(Forgive any typos/bugs!)
For 32-bit types using swizzle selectors should work just fine for this - it's core in the specification so is available on all GPUs which support the API, irrespective of underlying ISA.
For 8-bit and 16-bit types you might lose some efficiency for some swizzles if they don't vectorize neatly into 32-bit register accesses. E.g. short4.01 and short4.23 are fine because both accesses are in the same 32-bit register, whereas short4.02 might require two accesses because the lanes being accessed are not in the same 32-bits of storage.
Got it.
I probably wasn't clear enough... I understand that it will compile but I am really asking if a vector style program will be performant on Bifrost and compile to a fast quad-state opcode?
I'm trying to avoid using local memory to share data between a subgroup's lanes. Reverting to vector-style programming with swizzles would allow that. Scalar-per-thread (SIMT) programming requires bouncing data through local memory if there is no support for subgroup operations.
So instead of launching a workgroup of Nx4 scalar-per-thread work items, I would like to launch N work items and work on quads.
Other GPU vendors do not support vector-style programming but Bifrost looks like it could potentially support both SIMD and SIMT since its execution engine width is a narrow quad and supporting both styles seems practical.I have a HiKey 960 with OpenCL installed and working so I suppose I can just try it but wanted to hear from the experts first. :)
Pre-Bifrost Mali is a SIMD architecture.
Bifrost is 4-wide SIMT with a 32-bit data path. Narrower types treat the 32-bit path as a small SIMD unit (e.g. to get efficiency benefits for fp16 computation you need something which converts into clean SIMD vec2 operations).
For vector operations on 32-bit types, such as your examples, the two architectures should be similar. You'll generate efficient SIMD code, and the compiler can always scalarize the equivalent of the SIMD code for SIMT architectures. The inverse is not true - vectorizing code can be difficult - so always try to write vector code where you can.