This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to understand warp size and execution engine for shader cores?

I'm actively optimize my opencl programs for G76.

Here are some confused places for programmable shader core:

  1. I think the warp is a concept from SIMT. The threads in a warp will be scheduled together and executed by execution engine in parallel. As we know, the warp size for G76 is 8.

   Can we expect the 8 threads in a warp will execute in parallel? I'm not certain about this because in ARM's doc it said " A warp is made up of multiples of quads. Quads are groups of four threads. ".

   It seems the quad is the smallest parallel execution unit. If so, will the 2 quads execute in sequential?  

2. In bifrost optimization guide, it said "Load and store operations are faster if all threads in a quad load from the same cache-line ".

    For better performance, we should confirm that the threads in a quad (not in a warp) are loading from same cache-line, right?

3. Related with Question 1, each shader core can perform 24 FP32 FMA instructions at most for G76.

    Does it mean 24 threads (3 engines and 8 threads per engine) can execute 24 FMA parallelly?

Thanks very much in advance if anyone can answer my questions.

Parents Reply Children
  • Hi Zengzeng, 

    1. Is it possible that single execution engine execute multiple warps in fully parallel manner ?

    Like any processor the functional units are deeply pipelined, and like any GPU there are a lot of warps live at the same time. There will be multiple warps being processed at the same time in different pipeline stages (in one engine, across engines, and across shader cores).

    2. Can we expect more benefits if we confirm the access locality across all parallel threads of different engines or different warps, instead of just the threads in a warp?

    Spatial and temporal locality is *always* a good thing for caches, irrespective of what processor you are using. Your data set will nearly always be bigger than your L1 (and often the L2) data cache, so good locality is essential to getting the best cache hit rate.

    Kind regards, 
    Pete