How to understand warp size and execution engine for shader cores?

I'm actively optimize my opencl programs for G76.

Here are some confused places for programmable shader core:

  1. I think the warp is a concept from SIMT. The threads in a warp will be scheduled together and executed by execution engine in parallel. As we know, the warp size for G76 is 8.

   Can we expect the 8 threads in a warp will execute in parallel? I'm not certain about this because in ARM's doc it said " A warp is made up of multiples of quads. Quads are groups of four threads. ".

   It seems the quad is the smallest parallel execution unit. If so, will the 2 quads execute in sequential?  

2. In bifrost optimization guide, it said "Load and store operations are faster if all threads in a quad load from the same cache-line ".

    For better performance, we should confirm that the threads in a quad (not in a warp) are loading from same cache-line, right?

3. Related with Question 1, each shader core can perform 24 FP32 FMA instructions at most for G76.

    Does it mean 24 threads (3 engines and 8 threads per engine) can execute 24 FMA parallelly?

Thanks very much in advance if anyone can answer my questions.

