Support forums

Graphics, Gaming, and VR forum cl_khr_subgroups questions

State Accepted Answer
Locked Locked
Replies 2 replies
Subscribers 137 subscribers
Views 33759 views
Users 0 members are here

Options

How was your experience today?

This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

cl_khr_subgroups questions

Yury over 5 years ago

Hello,
Since lack of local memory in Mali, I am trying to use subgroups as Intel does in clDNN library, although they have local memory but registers exchange even faster than local memory. I have three questions about subgroups in Bifrost and Valhall implementation of OpenCL:

How effective this extension, does it implemented through real exchange of registers or it implemented by using local memory (which is in Mali GPU is really global). I’m particularly interested in sub_group_broadcast and sub_group_reduce_<op> functions. Is the 32-bit variables exchange is most effective?
What is the best way to exchange vectors between adjacent threads, for example I want to exchange half16 vector, my guess is to pack it to 32-bit (not 64-bit) values like this:
half16 in; half16 out;
s01 = as_half2(sub_group_broadcast(as_int(in.s01), 7)); \
out.s23 = as_half2(sub_group_broadcast(as_int(in.s23), 7)); \
out.s45 = as_half2(sub_group_broadcast(as_int(in.s45), 7)); \
out.s67 = as_half2(sub_group_broadcast(as_int(in.s67), 7)); \
out.s89 = as_half2(sub_group_broadcast(as_int(in.s89), 7)); \
out.sab = as_half2(sub_group_broadcast(as_int(in.sab), 7)); \
out.scd = as_half2(sub_group_broadcast(as_int(in.scd), 7)); \
out.sef = as_half2(sub_group_broadcast(as_int(in.sef), 7)); \
Is there a way to control how subgroup maps inside 2D or 3D workgroups?

Top replies

Kévin Petit over 5 years ago +1 verified

Hi, 1. Broadcast and reduction operations (and generally all subgroup operations) don't touch memory. The data stays within the GPU cores. 32-bit is indeed the max amount of data that can be exchanged...

+1 Kévin Petit over 5 years ago

Hi,

1. Broadcast and reduction operations (and generally all subgroup operations) don't touch memory. The data stays within the GPU cores. 32-bit is indeed the max amount of data that can be exchanged with a single machine operation.

2. Yes, it is the best way. There are talks of supporting all vector types in future extensions but we'd implement it like that under the hood for existing GPUs anyway.

3. Not currently but the mapping is always the same and you can predict it. Subgroups are allocated following the local linear id order, which is well defined (see www.khronos.org/.../get_local_linear_id.html).

Let me know if you have other questions.

Regards,

Kévin
Cancel
Up +1 Down

Cancel
0 Yury over 5 years ago in reply to Kévin Petit

Awesome! Thank you so much!
Cancel
Up 0 Down

Cancel