I'm actively optimize my opencl programs for G76.
Here are some confused places for programmable shader core:
Can we expect the 8 threads in a warp will execute in parallel? I'm not certain about this because in ARM's doc it said " A warp is made up of multiples of quads. Quads are groups of four threads. ".
It seems the quad is the smallest parallel execution unit. If so, will the 2 quads execute in sequential?
2. In bifrost optimization guide, it said "Load and store operations are faster if all threads in a quad load from the same cache-line ".
For better performance, we should confirm that the threads in a quad (not in a warp) are loading from same cache-line, right?
3. Related with Question 1, each shader core can perform 24 FP32 FMA instructions at most for G76.
Does it mean 24 threads (3 engines and 8 threads per engine) can execute 24 FMA parallelly?
Thanks very much in advance if anyone can answer my questions.