cl_khr_subgroups questions

Since lack of local memory in Mali, I am trying to use subgroups as Intel does in clDNN library, although they have local memory but registers exchange even faster than local memory. I have three questions about subgroups in Bifrost and Valhall implementation of OpenCL:

  1. How effective this extension, does it implemented through real exchange of registers or it implemented by using local memory (which is in Mali GPU is really global). I’m particularly interested in sub_group_broadcast and sub_group_reduce_<op> functions. Is the 32-bit variables exchange is most effective?
  2. What is the best way to exchange vectors between adjacent threads, for example I want to exchange half16 vector, my guess is to pack it to 32-bit (not 64-bit) values like this:
    half16 in; half16 out;
    s01 = as_half2(sub_group_broadcast(as_int(in.s01), 7)); \
    out.s23 = as_half2(sub_group_broadcast(as_int(in.s23), 7)); \
    out.s45 = as_half2(sub_group_broadcast(as_int(in.s45), 7)); \
    out.s67 = as_half2(sub_group_broadcast(as_int(in.s67), 7)); \
    out.s89 = as_half2(sub_group_broadcast(as_int(in.s89), 7)); \
    out.sab = as_half2(sub_group_broadcast(as_int(in.sab), 7)); \
    out.scd = as_half2(sub_group_broadcast(as_int(in.scd), 7)); \
    out.sef = as_half2(sub_group_broadcast(as_int(in.sef), 7)); \
  3. Is there a way to control how subgroup maps inside 2D or 3D workgroups?
More questions in this forum