This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Confusion of thread count and FP32 operations

1.  What is the relationship between FP32 operations/clock and Thread count? According to my understanding, FP32 operations/clock should be equal to Thread count. Does "FP32 operations/clock " mean a single shader core? If so, multiply it by the number of cores, which is indeed consistent with the thread count of some GPUs. But there are exceptions, such as T760. Also, if this table is based on a single shader core, the thread count in it should also be based on a single core, it should not be FP32 operations/clock * NumOfCore = thread count.

2.  The thread count in the second table will vary according to the number of registers. Does the number of registers refer to the number of registers used by the entire shader or the number of registers used by a specific instruction? How do we evaluate the impact of this number on our game? For example, in a shader, I saw from the offline compiler that there are 100+ registers. What kind of impact will it have on the thread?

3. What is the relationship between Thread count, FP32/clock and the processing capacity of the Arithmetic processing unit in the core? See the description below for details

Valhall :There are 2 FMAs in the shader core, each of which is 16wide, which is 32 FP32 FMA, but the FP32 operations/clock on the dataset is 64?

Midgard :It can be seen from the document that T760 has 2 A pipelines, each of which can process 4 FP32 at the same time, that is, 8 FP32 in general, so why is 28 written in the database? 34 in the document?

  • Hi Shawn, 

    1. What is the relationship between FP32 operations/clock and Thread count?    

    There isn't really a relationship here. The FP32 operations/clock is the width of the datapath for instruction issue in a single clock cycle. A design needs "enough" threads to fill the data path, but that's typically much lower than the total GPU thread count (more on that in the next section).

     2.  The thread count in the second table will vary according to the number of registers. Does the number of registers refer to the number of registers used by the entire shader or the number of registers used by a specific instruction? How do we evaluate the impact of this number on our game?  

    It's the number of work registers (i.e. excluding uniform registers) used by the whole shader program. The performance impact depends on the content. GPUs have lots of threads so we can hide the latency of data fetches from memory; we have far more threads than we can issue in any single clock cycle, and many threads will be sleeping waiting for memory fetches.

    Shaders that run with reduced thread count due to higher register load will still have enough threads available to keep the data pipelines busy, but will have fewer "spare" threads for latency hiding. If the content gets a high number of cache misses, or has a high ratio of load-to-arithmetic then it is more likely that the shader core will get idle cycles where there are no new threads eligible to issue in that clock cycle. 

     3. What is the relationship between Thread count, FP32/clock and the processing capacity of the Arithmetic processing unit in the core?  

    No relationship between arithmetic processing capacity and thread count.

    FP32 slots per clock shows the total number of 32-bit operations (of any type) that can be issued to the arithmetic pipelines, so it an approximate measure of the arithmetic performance. Not all pipelines can perform the same operations - e.g. in Valhall there are two types of general purpose arithmetic pipe (FMA, CVT), plus a special functions unit (SFU) which each process different types of instruction. Not all of these "ops" are interchangeable - e.g. you can only do 32 float FMAs, but you can 32 CVT operations in paralllel.

    Valhall :There are 2 FMAs in the shader core, each of which is 16wide, which is 32 FP32 FMA, but the FP32 operations/clock on the dataset is 64?  

    In the data sheet we count this (16-wide FMA + 16-wide CVT) * 2 PU = 64 32-bit ops per clock.

     Midgard :It can be seen from the document that T760 has 2 A pipelines, each of which can process 4 FP32 at the same time, that is, 8 FP32 in general, so why is 28 written in the database? 34 in the document?  

    The Midgard arithmetic pipeline is complex - one instruction can do multiple component SIMD and scalar operations. We count the "ops/cy" count as 3*4-wide SIMD and 2*1-wide scalar = 14 per pipe. As above, not all operations are equal. Only 8 of the 14 can be multiplies, for example. 

    Kind regards,  Pete

  • I'm very happy for your reply.

    Thread only indicates how much data is being processed at the same time, but they are not really calculating. Such as some thread are idle waiting for io to return.

    And “fp32/load/store/sample per clock” these are the key to really determine the performance throughput?

    Is my understanding correct?