Support forums

Mobile, Graphics, and Gaming forum How to understand warp size and execution engine for shader cores?

State Accepted Answer
+1 person also asked this people also asked this
Locked Locked
Replies 3 replies
Subscribers 138 subscribers
Views 26047 views
Users 0 members are here

Options

How was your experience today?

This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to understand warp size and execution engine for shader cores?

zengzeng.sun over 6 years ago

I'm actively optimize my opencl programs for G76.

Here are some confused places for programmable shader core:

I think the warp is a concept from SIMT. The threads in a warp will be scheduled together and executed by execution engine in parallel. As we know, the warp size for G76 is 8.

Can we expect the 8 threads in a warp will execute in parallel? I'm not certain about this because in ARM's doc it said " A warp is made up of multiples of quads. Quads are groups of four threads. ".

It seems the quad is the smallest parallel execution unit. If so, will the 2 quads execute in sequential?

2. In bifrost optimization guide, it said "Load and store operations are faster if all threads in a quad load from the same cache-line ".

For better performance, we should confirm that the threads in a quad (not in a warp) are loading from same cache-line, right?

3. Related with Question 1, each shader core can perform 24 FP32 FMA instructions at most for G76.

Does it mean 24 threads (3 engines and 8 threads per engine) can execute 24 FMA parallelly?

Thanks very much in advance if anyone can answer my questions.

How to understand warp size and execution engine for shader cores?

Top replies