This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

About mali-g76 MP12 GPU and micro-architecture

About mali-g76 MP12 GPU and micro-architecture.

1) Do they context switch between warps to hide memory access latency when the kernel has memory operations ??

2) I saw the datasheet max thread count is 768, is it right 256 threads per execution engine?

and as I know, they have 8 lanes (8-wide warp) per execution engine. how they can run 768 threads simultaneously?? (with context switching? or  are they more lanes?...)

I want to understand the process to execute threads in aspect of  micro-architectrue.

3) If they can run 768 threads simultaneously and the work-group size is only 24, do they run 24 warps(8-wide warps*3 engine) with same work_group id per core?

if work-group size is 8, remain lanes(24 lanes-8 = 16 lanes) don't work?

(in case of Nvidia, multiple warps with the same work-group per SM)

please help me ~!

Parents

  • thank you for your kind answer~!!! (peter harris)

    I clearly understand~! It is big help! I am honor to have your answer :)

    and I have one more question.

    I executed mali offline profiler.

    it tells me number of work register and uniform register the kernel uses.

    1) how many work registers per engine? or per core? (in datasheet, max 64 work register per one thread, and if I use 64 register, mali runs 384 threads, and if I use 32 register, mali runs 768 threads. is it right if I use more than 32 registers, performance is down?)

    2) there are 8 lanes (8 threads) per engine. so, 8 * 64register = 512 registers per the engine?

    3) what is unfirom register?(it is increased when I used buffer)

    4) how many uniform register per engine? or per core? ( is it right all threads in the engine share the uniform register? what is the number of uniform register making performce down)

    I want to know exact register file size. I want to know when they are in bad performance

    5) in micro architecture, quad manager exist. in mali-g76 have 8 lanes. quad manager pack the threads in 4 or 8???

    plase help me.

Reply

  • thank you for your kind answer~!!! (peter harris)

    I clearly understand~! It is big help! I am honor to have your answer :)

    and I have one more question.

    I executed mali offline profiler.

    it tells me number of work register and uniform register the kernel uses.

    1) how many work registers per engine? or per core? (in datasheet, max 64 work register per one thread, and if I use 64 register, mali runs 384 threads, and if I use 32 register, mali runs 768 threads. is it right if I use more than 32 registers, performance is down?)

    2) there are 8 lanes (8 threads) per engine. so, 8 * 64register = 512 registers per the engine?

    3) what is unfirom register?(it is increased when I used buffer)

    4) how many uniform register per engine? or per core? ( is it right all threads in the engine share the uniform register? what is the number of uniform register making performce down)

    I want to know exact register file size. I want to know when they are in bad performance

    5) in micro architecture, quad manager exist. in mali-g76 have 8 lanes. quad manager pack the threads in 4 or 8???

    plase help me.

Children
  • 1) how many work registers per engine? or per core? (in datasheet, max 64 work register per one thread, and if I use 64 register, mali runs 384 threads, and if I use 32 register, mali runs 768 threads. is it right if I use more than 32 registers, performance is down?)

    There are enough registers in the hardware for 32 * 768 to be allocated concurrently. If you use more than 32 then the allocation size for a thread doubles to 64, so we can only allocate half as many threads.

    This _might_ impact performance, but it depends on what you are doing and the ratio of the workloads though the different pipes, so it's not a simple "yes / no" answer I'm afraid. In general if you have a high percentage of texturing in a shader you need more threads, because the texture unit is relatively high latency, but for vertex and compute workloads you might not see much slowdown at all with 64 registers. 

    2) there are 8 lanes (8 threads) per engine. so, 8 * 64register = 512 registers per the engine?

    8192 (768 / (3 * 32)) per engine, 24567 in per shader core (768 / 32). 

    3) what is unfirom register?(it is increased when I used buffer)

    A read-only register for storing uniforms and other draw-time constants that the shader might need

    4) how many uniform register per engine? or per core?

    All threads in a draw share the uniform register file. It's a fixed size, so there isn't any step change as you see with work registers. The size is not publicly documented, but in general I'd say it's large enough you don't have to worry. 

    5) in micro architecture, quad manager exist. in mali-g76 have 8 lanes. quad manager pack the threads in 4 or 8???

    In the general case all GPUs fragment shade in blocks of 2x2 pixels, due to the need for derivatives. Hardware will pack out warps as needed to fill the available width. 

    HTH,
    Pete