This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Expected Increase in Throughput for Int8 vs FP32 Multiplication

What increase in throughput can I expect on my device from changing a sequence of

```

fmla  v1.4s, v1.4s, v1.4

```

to

```

mla  v1.16b, v1.16b, v1.16b  

```

?

My device consist of X3, A715 and A510 processors.

In profiling peakflops I got a ~2x increase in throughput. I would have expected an 4x increase.

Is there any matrix multiplication related instruction on arm in which I can expect a 4x increase in throughput by using int8 data types (possibly widening accumulator)?

Parents
  • Thanks a lot for your reply.  

    Based on your suggestion, I read the optimization guides and can confirm that the observed throughput is as expected.

    The latency of both FMLA and MLA is four cycles on all architectures (X3, A510, A715). The throughput halfs for MLA vs FMLA  on the A715 and A510. Although the number of variables passing through the system increases by a factor of 4, only half the resources are available, resulting in a 2x increase. In contrast, the throughput stays constant for the A510, yielding a 4x increase. Below is the summary table of throughput and arch:

    ```
    |      | X3 | A510 | A715 |
    |------|----|------|------|
    | FMLA | 4  | 2    | 2    |
    | MLA  | 2  | 2    | 1    |
    ```

    To cross-reference the predictions from the hardware guides, I would like to understand which sockets refer to which arch. So far, the sockets are "anonymous" to me, as I cannot relate socket ID to its arch name. Do you know of any way to do that,  ?

Reply
  • Thanks a lot for your reply.  

    Based on your suggestion, I read the optimization guides and can confirm that the observed throughput is as expected.

    The latency of both FMLA and MLA is four cycles on all architectures (X3, A510, A715). The throughput halfs for MLA vs FMLA  on the A715 and A510. Although the number of variables passing through the system increases by a factor of 4, only half the resources are available, resulting in a 2x increase. In contrast, the throughput stays constant for the A510, yielding a 4x increase. Below is the summary table of throughput and arch:

    ```
    |      | X3 | A510 | A715 |
    |------|----|------|------|
    | FMLA | 4  | 2    | 2    |
    | MLA  | 2  | 2    | 1    |
    ```

    To cross-reference the predictions from the hardware guides, I would like to understand which sockets refer to which arch. So far, the sockets are "anonymous" to me, as I cannot relate socket ID to its arch name. Do you know of any way to do that,  ?

Children
  • Great to hear that you have findings for the throughput increase.

    I don't have any S9 tablet datasheet at hand.  There is a generic method to check the Arm CPU processor type.

    In Android or Linux-like OS, you can run this command " cat /proc/cpuinfo".  Here is one example for you.    

    Please check the CPU part number. After you know the CPU type of each CPU id, you can try to connect it to the Socket ID.

    • Cortex-X3 part number is 0xD4E.   
    • Cortex-A715 part number is 0xD4D.
    • Cortex-A510 part number is 0xD46.

    <quote>

    # cat /proc/cpuinfo
    processor : 0
    BogoMIPS : 26.00
    Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 bti
    CPU implementer : 0x41
    CPU architecture: 8
    CPU variant : 0x0
    CPU part : 0xd46
    CPU revision : 2

    processor : 1
    BogoMIPS : 26.00
    Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 bti
    CPU implementer : 0x41
    CPU architecture: 8
    CPU variant : 0x0
    CPU part : 0xd46
    CPU revision : 2

    </quote>