We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
I'm actively optimize my opencl programs for G76.
Here are some confused places for programmable shader core:
Can we expect the 8 threads in a warp will execute in parallel? I'm not certain about this because in ARM's doc it said " A warp is made up of multiples of quads. Quads are groups of four threads. ".
It seems the quad is the smallest parallel execution unit. If so, will the 2 quads execute in sequential?
2. In bifrost optimization guide, it said "Load and store operations are faster if all threads in a quad load from the same cache-line ".
For better performance, we should confirm that the threads in a quad (not in a warp) are loading from same cache-line, right?
3. Related with Question 1, each shader core can perform 24 FP32 FMA instructions at most for G76.
Does it mean 24 threads (3 engines and 8 threads per engine) can execute 24 FMA parallelly?
Thanks very much in advance if anyone can answer my questions.