I'm actively optimize my opencl programs for G76.
Here are some confused places for programmable shader core:
Can we expect the 8 threads in a warp will execute in parallel? I'm not certain about this because in ARM's doc it said " A warp is made up of multiples of quads. Quads are groups of four threads. ".
It seems the quad is the smallest parallel execution unit. If so, will the 2 quads execute in sequential?
2. In bifrost optimization guide, it said "Load and store operations are faster if all threads in a quad load from the same cache-line ".
For better performance, we should confirm that the threads in a quad (not in a warp) are loading from same cache-line, right?
3. Related with Question 1, each shader core can perform 24 FP32 FMA instructions at most for G76.
Does it mean 24 threads (3 engines and 8 threads per engine) can execute 24 FMA parallelly?
Thanks very much in advance if anyone can answer my questions.
zengzeng.sun said: Can we expect the 8 threads in a warp will execute in parallel?
Yes
zengzeng.sun said: For better performance, we should confirm that the threads in a quad (not in a warp) are loading from same cache-line, right?
Optimizing for access locality across the whole warp is recommended and the most future-proof optimization point. It's good practice for any GPU architecture, and many other GPUs have wider warps than Mali today.
zengzeng.sun said: Does it mean 24 threads (3 engines and 8 threads per engine) can execute 24 FMA parallelly?
Yes.
Kind regards, Pete
Dear Peter,
Thanks so much for your anwsers.
I have some more questions:
1. Is it possible that single execution engine execute multiple warps in fully parallel manner ? For example, 32 threads in 2 warps (warp size 16) are
running parallelly in just one engine.
2. Can we expect more benefits if we confirm the access locality across all parallel threads of different engines or different warps, instead of just the threads in a warp?
Hi Zengzeng,
zengzeng.sun said:1. Is it possible that single execution engine execute multiple warps in fully parallel manner ?
Like any processor the functional units are deeply pipelined, and like any GPU there are a lot of warps live at the same time. There will be multiple warps being processed at the same time in different pipeline stages (in one engine, across engines, and across shader cores).
zengzeng.sun said:2. Can we expect more benefits if we confirm the access locality across all parallel threads of different engines or different warps, instead of just the threads in a warp?
Spatial and temporal locality is *always* a good thing for caches, irrespective of what processor you are using. Your data set will nearly always be bigger than your L1 (and often the L2) data cache, so good locality is essential to getting the best cache hit rate.