Hello, Since lack of local memory in Mali, I am trying to use subgroups as Intel does in clDNN library, although they have local memory but registers exchange even faster than local memory. I have three questions about subgroups in Bifrost and Valhall implementation of OpenCL:
Hi,
1. Broadcast and reduction operations (and generally all subgroup operations) don't touch memory. The data stays within the GPU cores. 32-bit is indeed the max amount of data that can be exchanged with a single machine operation.
2. Yes, it is the best way. There are talks of supporting all vector types in future extensions but we'd implement it like that under the hood for existing GPUs anyway.
3. Not currently but the mapping is always the same and you can predict it. Subgroups are allocated following the local linear id order, which is well defined (see www.khronos.org/.../get_local_linear_id.html).
Let me know if you have other questions.
Regards,
Kévin
Awesome! Thank you so much!