This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

About mali-g76 MP12 GPU and micro-architecture

About mali-g76 MP12 GPU and micro-architecture.

1) Do they context switch between warps to hide memory access latency when the kernel has memory operations ??

2) I saw the datasheet max thread count is 768, is it right 256 threads per execution engine?

and as I know, they have 8 lanes (8-wide warp) per execution engine. how they can run 768 threads simultaneously?? (with context switching? or  are they more lanes?...)

I want to understand the process to execute threads in aspect of  micro-architectrue.

3) If they can run 768 threads simultaneously and the work-group size is only 24, do they run 24 warps(8-wide warps*3 engine) with same work_group id per core?

if work-group size is 8, remain lanes(24 lanes-8 = 16 lanes) don't work?

(in case of Nvidia, multiple warps with the same work-group per SM)

please help me ~!

Parents
  • 1) Do they context switch between warps to hide memory access latency when the kernel has memory operations ??

    Massively multithreaded architectures like GPUs shader cores context switch all the time.

    BUT, don't think of this like the heavy-weight context switching like a CPU, that's simply not how this type of hardware works. All the contexts for all threads are live in registers all the time, and the GPU can pick any active warp any clock cycle. 

    and as I know, they have 8 lanes (8-wide warp) per execution engine. how they can run 768 threads simultaneously?? (with context switching? or  are they more lanes?...)

    The first point is don't confuse issue width (new operations issued per cycle) with execution capacity (total number of in-flight operations). Most operations will have a latency of multiple cycles, so you can have have more than 1 warp live in a singe hardware unit at the same time due to pipeline length. Some of the fixed function units (texture unit, for example) have very long pipelines. 

    The other point is that the purpose of having this may threads is to hide the latency of memory fetches. A significant portion of the 768 are likely to be idle waiting for data from memory; the others keep the hardware busy.

    3) If they can run 768 threads simultaneously and the work-group size is only 24, do they run 24 warps(8-wide warps*3 engine) with same work_group id per core?

    If your work group size is 24 threads, you'll get 3 warps (24 / 8) per work group, and 32 (768 / 24) different workgroups running in each core at the same time. 

    if work-group size is 8, remain lanes(24 lanes-8 = 16 lanes) don't work?

    If the work-group size is 8, then you get 1 warp per work group, and 96 different workgroups running in each core at the same time. The hardware is three independent 8-wide units, so an 8-wide work group is fine, you'll just issue three different workgroups in parallel.

    What you want to avoid is work groups narrower than 8. A 4-wide workgroup would only use half of the 8-wide hardware, so keep workgroups larger than 8 (and on newer hardware, larger than 16).

    HTH, 

    Pete

Reply
  • 1) Do they context switch between warps to hide memory access latency when the kernel has memory operations ??

    Massively multithreaded architectures like GPUs shader cores context switch all the time.

    BUT, don't think of this like the heavy-weight context switching like a CPU, that's simply not how this type of hardware works. All the contexts for all threads are live in registers all the time, and the GPU can pick any active warp any clock cycle. 

    and as I know, they have 8 lanes (8-wide warp) per execution engine. how they can run 768 threads simultaneously?? (with context switching? or  are they more lanes?...)

    The first point is don't confuse issue width (new operations issued per cycle) with execution capacity (total number of in-flight operations). Most operations will have a latency of multiple cycles, so you can have have more than 1 warp live in a singe hardware unit at the same time due to pipeline length. Some of the fixed function units (texture unit, for example) have very long pipelines. 

    The other point is that the purpose of having this may threads is to hide the latency of memory fetches. A significant portion of the 768 are likely to be idle waiting for data from memory; the others keep the hardware busy.

    3) If they can run 768 threads simultaneously and the work-group size is only 24, do they run 24 warps(8-wide warps*3 engine) with same work_group id per core?

    If your work group size is 24 threads, you'll get 3 warps (24 / 8) per work group, and 32 (768 / 24) different workgroups running in each core at the same time. 

    if work-group size is 8, remain lanes(24 lanes-8 = 16 lanes) don't work?

    If the work-group size is 8, then you get 1 warp per work group, and 96 different workgroups running in each core at the same time. The hardware is three independent 8-wide units, so an 8-wide work group is fine, you'll just issue three different workgroups in parallel.

    What you want to avoid is work groups narrower than 8. A 4-wide workgroup would only use half of the 8-wide hardware, so keep workgroups larger than 8 (and on newer hardware, larger than 16).

    HTH, 

    Pete

Children