1. What is the relationship between FP32 operations/clock and Thread count? According to my understanding, FP32 operations/clock should be equal to Thread count. Does "FP32 operations/clock " mean a single shader core? If so, multiply it by the number of cores, which is indeed consistent with the thread count of some GPUs. But there are exceptions, such as T760. Also, if this table is based on a single shader core, the thread count in it should also be based on a single core, it should not be FP32 operations/clock * NumOfCore = thread count.
2. The thread count in the second table will vary according to the number of registers. Does the number of registers refer to the number of registers used by the entire shader or the number of registers used by a specific instruction? How do we evaluate the impact of this number on our game? For example, in a shader, I saw from the offline compiler that there are 100+ registers. What kind of impact will it have on the thread?
3. What is the relationship between Thread count, FP32/clock and the processing capacity of the Arithmetic processing unit in the core? See the description below for details
Valhall :There are 2 FMAs in the shader core, each of which is 16wide, which is 32 FP32 FMA, but the FP32 operations/clock on the dataset is 64?
Midgard :It can be seen from the document that T760 has 2 A pipelines, each of which can process 4 FP32 at the same time, that is, 8 FP32 in general, so why is 28 written in the database? 34 in the document?
I'm very happy for your reply.
Thread only indicates how much data is being processed at the same time, but they are not really calculating. Such as some thread are idle waiting for io to return.
And “fp32/load/store/sample per clock” these are the key to really determine the performance throughput?
Is my understanding correct?
Yes, exactly that.