This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Opencl work group and G76 core

Hi,

I would like to ask whether a work group with 192 work items can run on multiple G76 cores?

I thought similar as other GPUs, one work group can only run on one shader core. However, it seems not the case. 

I got similar latency between a work group with 192 work items and a work-group with 24 work items. But one core should only can run 24 (3x8) work items parallelly.

Therefore, I guess the 192 work items actually run on multiple cores?

Thank you!

Parents
  • Hi,

    Individual work-groups run wholly on a single core. Work-groups are batched before being distributed to cores. That batching is controlled by the driver. By default the driver will configure batching such that each core has access to enough work to be fully loaded (where possible). You can change how that batching operates using [1].

    If you want to run a kernel with only 192 work-items for example, you probably want to reduce the work-group size (to make spreading the work across cores possible) and maybe reduce batch sizes further than the driver's default using [1] to spread the work around more. Note that this example assumes that you are not running other kernels in parallel and that the GPU was idle at the time the kernel is submitted.

    Hope this helps.

    Regards,

    Kevin

    [1] www.khronos.org/.../cl_arm_scheduling_controls.html

Reply
  • Hi,

    Individual work-groups run wholly on a single core. Work-groups are batched before being distributed to cores. That batching is controlled by the driver. By default the driver will configure batching such that each core has access to enough work to be fully loaded (where possible). You can change how that batching operates using [1].

    If you want to run a kernel with only 192 work-items for example, you probably want to reduce the work-group size (to make spreading the work across cores possible) and maybe reduce batch sizes further than the driver's default using [1] to spread the work around more. Note that this example assumes that you are not running other kernels in parallel and that the GPU was idle at the time the kernel is submitted.

    Hope this helps.

    Regards,

    Kevin

    [1] www.khronos.org/.../cl_arm_scheduling_controls.html

Children