cl_khr_subgroups questions

Hello,
Since lack of local memory in Mali, I am trying to use subgroups as Intel does in clDNN library, although they have local memory but registers exchange even faster than local memory. I have three questions about subgroups in Bifrost and Valhall implementation of OpenCL:

  1. How effective this extension, does it implemented through real exchange of registers or it implemented by using local memory (which is in Mali GPU is really global). I’m particularly interested in sub_group_broadcast and sub_group_reduce_<op> functions. Is the 32-bit variables exchange is most effective?
  2. What is the best way to exchange vectors between adjacent threads, for example I want to exchange half16 vector, my guess is to pack it to 32-bit (not 64-bit) values like this:
    half16 in; half16 out;
    s01 = as_half2(sub_group_broadcast(as_int(in.s01), 7)); \
    out.s23 = as_half2(sub_group_broadcast(as_int(in.s23), 7)); \
    out.s45 = as_half2(sub_group_broadcast(as_int(in.s45), 7)); \
    out.s67 = as_half2(sub_group_broadcast(as_int(in.s67), 7)); \
    out.s89 = as_half2(sub_group_broadcast(as_int(in.s89), 7)); \
    out.sab = as_half2(sub_group_broadcast(as_int(in.sab), 7)); \
    out.scd = as_half2(sub_group_broadcast(as_int(in.scd), 7)); \
    out.sef = as_half2(sub_group_broadcast(as_int(in.sef), 7)); \
  3. Is there a way to control how subgroup maps inside 2D or 3D workgroups?
Parents
  • Hi,

    1. Broadcast and reduction operations (and generally all subgroup operations) don't touch memory. The data stays within the GPU cores. 32-bit is indeed the max amount of data that can be exchanged with a single machine operation.

    2. Yes, it is the best way. There are talks of supporting all vector types in future extensions but we'd implement it like that under the hood for existing GPUs anyway.

    3. Not currently but the mapping is always the same and you can predict it. Subgroups are allocated following the local linear id order, which is well defined (see www.khronos.org/.../get_local_linear_id.html).

    Let me know if you have other questions.

    Regards,

    Kévin

Reply
  • Hi,

    1. Broadcast and reduction operations (and generally all subgroup operations) don't touch memory. The data stays within the GPU cores. 32-bit is indeed the max amount of data that can be exchanged with a single machine operation.

    2. Yes, it is the best way. There are talks of supporting all vector types in future extensions but we'd implement it like that under the hood for existing GPUs anyway.

    3. Not currently but the mapping is always the same and you can predict it. Subgroups are allocated following the local linear id order, which is well defined (see www.khronos.org/.../get_local_linear_id.html).

    Let me know if you have other questions.

    Regards,

    Kévin

Children
More questions in this forum