This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

About mali-g76 MP12 GPU and micro-architecture

About mali-g76 MP12 GPU and micro-architecture.

1) Do they context switch between warps to hide memory access latency when the kernel has memory operations ??

2) I saw the datasheet max thread count is 768, is it right 256 threads per execution engine?

and as I know, they have 8 lanes (8-wide warp) per execution engine. how they can run 768 threads simultaneously?? (with context switching? or  are they more lanes?...)

I want to understand the process to execute threads in aspect of  micro-architectrue.

3) If they can run 768 threads simultaneously and the work-group size is only 24, do they run 24 warps(8-wide warps*3 engine) with same work_group id per core?

if work-group size is 8, remain lanes(24 lanes-8 = 16 lanes) don't work?

(in case of Nvidia, multiple warps with the same work-group per SM)

please help me ~!

Parents
  • 1) Do they context switch between warps to hide memory access latency when the kernel has memory operations ??

    Massively multithreaded architectures like GPUs shader cores context switch all the time.

    BUT, don't think of this like the heavy-weight context switching like a CPU, that's simply not how this type of hardware works. All the contexts for all threads are live in registers all the time, and the GPU can pick any active warp any clock cycle. 

    and as I know, they have 8 lanes (8-wide warp) per execution engine. how they can run 768 threads simultaneously?? (with context switching? or  are they more lanes?...)

    The first point is don't confuse issue width (new operations issued per cycle) with execution capacity (total number of in-flight operations). Most operations will have a latency of multiple cycles, so you can have have more than 1 warp live in a singe hardware unit at the same time due to pipeline length. Some of the fixed function units (texture unit, for example) have very long pipelines. 

    The other point is that the purpose of having this may threads is to hide the latency of memory fetches. A significant portion of the 768 are likely to be idle waiting for data from memory; the others keep the hardware busy.

    3) If they can run 768 threads simultaneously and the work-group size is only 24, do they run 24 warps(8-wide warps*3 engine) with same work_group id per core?

    If your work group size is 24 threads, you'll get 3 warps (24 / 8) per work group, and 32 (768 / 24) different workgroups running in each core at the same time. 

    if work-group size is 8, remain lanes(24 lanes-8 = 16 lanes) don't work?

    If the work-group size is 8, then you get 1 warp per work group, and 96 different workgroups running in each core at the same time. The hardware is three independent 8-wide units, so an 8-wide work group is fine, you'll just issue three different workgroups in parallel.

    What you want to avoid is work groups narrower than 8. A 4-wide workgroup would only use half of the 8-wide hardware, so keep workgroups larger than 8 (and on newer hardware, larger than 16).

    HTH, 

    Pete

Reply
  • 1) Do they context switch between warps to hide memory access latency when the kernel has memory operations ??

    Massively multithreaded architectures like GPUs shader cores context switch all the time.

    BUT, don't think of this like the heavy-weight context switching like a CPU, that's simply not how this type of hardware works. All the contexts for all threads are live in registers all the time, and the GPU can pick any active warp any clock cycle. 

    and as I know, they have 8 lanes (8-wide warp) per execution engine. how they can run 768 threads simultaneously?? (with context switching? or  are they more lanes?...)

    The first point is don't confuse issue width (new operations issued per cycle) with execution capacity (total number of in-flight operations). Most operations will have a latency of multiple cycles, so you can have have more than 1 warp live in a singe hardware unit at the same time due to pipeline length. Some of the fixed function units (texture unit, for example) have very long pipelines. 

    The other point is that the purpose of having this may threads is to hide the latency of memory fetches. A significant portion of the 768 are likely to be idle waiting for data from memory; the others keep the hardware busy.

    3) If they can run 768 threads simultaneously and the work-group size is only 24, do they run 24 warps(8-wide warps*3 engine) with same work_group id per core?

    If your work group size is 24 threads, you'll get 3 warps (24 / 8) per work group, and 32 (768 / 24) different workgroups running in each core at the same time. 

    if work-group size is 8, remain lanes(24 lanes-8 = 16 lanes) don't work?

    If the work-group size is 8, then you get 1 warp per work group, and 96 different workgroups running in each core at the same time. The hardware is three independent 8-wide units, so an 8-wide work group is fine, you'll just issue three different workgroups in parallel.

    What you want to avoid is work groups narrower than 8. A 4-wide workgroup would only use half of the 8-wide hardware, so keep workgroups larger than 8 (and on newer hardware, larger than 16).

    HTH, 

    Pete

Children

  • thank you for your kind answer~!!! (peter harris)

    I clearly understand~! It is big help! I am honor to have your answer :)

    and I have one more question.

    I executed mali offline profiler.

    it tells me number of work register and uniform register the kernel uses.

    1) how many work registers per engine? or per core? (in datasheet, max 64 work register per one thread, and if I use 64 register, mali runs 384 threads, and if I use 32 register, mali runs 768 threads. is it right if I use more than 32 registers, performance is down?)

    2) there are 8 lanes (8 threads) per engine. so, 8 * 64register = 512 registers per the engine?

    3) what is unfirom register?(it is increased when I used buffer)

    4) how many uniform register per engine? or per core? ( is it right all threads in the engine share the uniform register? what is the number of uniform register making performce down)

    I want to know exact register file size. I want to know when they are in bad performance

    5) in micro architecture, quad manager exist. in mali-g76 have 8 lanes. quad manager pack the threads in 4 or 8???

    plase help me.

  • 1) how many work registers per engine? or per core? (in datasheet, max 64 work register per one thread, and if I use 64 register, mali runs 384 threads, and if I use 32 register, mali runs 768 threads. is it right if I use more than 32 registers, performance is down?)

    There are enough registers in the hardware for 32 * 768 to be allocated concurrently. If you use more than 32 then the allocation size for a thread doubles to 64, so we can only allocate half as many threads.

    This _might_ impact performance, but it depends on what you are doing and the ratio of the workloads though the different pipes, so it's not a simple "yes / no" answer I'm afraid. In general if you have a high percentage of texturing in a shader you need more threads, because the texture unit is relatively high latency, but for vertex and compute workloads you might not see much slowdown at all with 64 registers. 

    2) there are 8 lanes (8 threads) per engine. so, 8 * 64register = 512 registers per the engine?

    8192 (768 / (3 * 32)) per engine, 24567 in per shader core (768 / 32). 

    3) what is unfirom register?(it is increased when I used buffer)

    A read-only register for storing uniforms and other draw-time constants that the shader might need

    4) how many uniform register per engine? or per core?

    All threads in a draw share the uniform register file. It's a fixed size, so there isn't any step change as you see with work registers. The size is not publicly documented, but in general I'd say it's large enough you don't have to worry. 

    5) in micro architecture, quad manager exist. in mali-g76 have 8 lanes. quad manager pack the threads in 4 or 8???

    In the general case all GPUs fragment shade in blocks of 2x2 pixels, due to the need for derivatives. Hardware will pack out warps as needed to fill the available width. 

    HTH,
    Pete