This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Any NEON instructions can be dual issued with vector long multiply accumulate (SMLAL) on Cortex A53 or A55?

Anyone know if this is possible? I've tried several instruction mixes:

                                                        A53                A55

smlal.int8 & 64bit load (ld1)            0.97                1.72                                   instructions/cycle

smlal.int8 & 64bit dup                     0.97                0.96

smlal.int8 & 64bit or                        0.97                0.96

mla.int8 & 64bit dup                        1.79               1.74

mla.int16 & 64bit dup                      0.97               0.95

The Cortex A55 optimization guide has contradictory info.https://static.docs.arm.com/epm128372/20/arm_cortex_a55_software_optimization_guide_v2.pdf

In 1 place, it says dual issue = 01, which in my understanding means it can only issue on slot 0, but allows another SIMD instruction to issue on slot 1.

In the 2nd place, it says dual issue = 00, which means it prevents dual issue. Does that mean it uses both slots?

What's even more surprising is even a simple SIMD OR can't execute in parallel. What I really want to do is dual issue dup(specific lane) with smlal.

What's going on? Is smlal using both SIMD units?

One suggestion was it could be a register file bandwidth bottleneck

https://stackoverflow.com/questions/34037900/can-cortex-a57-dual-issue-128-bit-neon-instructions

That makes sense because unlike almost every other instructions, smlal takes 3 operands and writes an output that's 2x as wide, while regular mla only does 3 reads and 1 write. And it would be very unreasonable to add a 3rd write port to the NEON register file to support writing smlal's 128 bit output and another 64bit output from a 2nd instruction

Another explanation is some instructions can't be dual issued every cycle (e.g. "issue vector load every three fmla")

https://github.com/Tencent/ncnn/wiki/arm-a53-a55-dual-issue

----------------------------------------benchmark program--------------------------------------------

uint8_t data[32] __attribute__((aligned(32)));
for (int i = 0; i < N; i += 16)
{
asm volatile(
"smlal v9.8h, v0.8b, v0.8b\n"
"ld1 {v0.8b},%0\n"
"smlal v10.8h, v1.8b, v1.8b\n"
"ld1 {v1.8b},%0\n"

"smlal v11.8h, v2.8b, v2.8b\n"
"ld1 {v2.8b},%0\n"

"smlal v12.8h, v3.8b, v3.8b\n"
"ld1 {v3.8b},%0\n"

"smlal v13.8h, v4.8b, v4.8b\n"
"ld1 {v4.8b},%0\n"

"smlal v14.8h, v5.8b, v5.8b\n"
"ld1 {v5.8b},%0\n"

"smlal v15.8h, v6.8b, v6.8b\n"
"ld1 {v6.8b},%0\n"
"smlal v0.8h, v7.8b, v7.8b\n"
"ld1 {v7.8b},%0\n"
"smlal v1.8h, v8.8b, v8.8b\n"
"ld1 {v8.8b},%0\n"
"smlal v2.8h, v9.8b, v9.8b\n"
"ld1 {v9.8b},%0\n"
"smlal v3.8h, v10.8b, v10.8b\n"
"ld1 {v10.8b},%0\n"
"smlal v4.8h, v11.8b, v11.8b\n"
"ld1 {v11.8b},%0\n"
"smlal v5.8h, v12.8b, v12.8b\n"
"ld1 {v12.8b},%0\n"
"smlal v6.8h, v13.8b, v13.8b\n"
"ld1 {v13.8b},%0\n"
"smlal v7.8h, v14.8b, v14.8b\n"
"ld1 {v14.8b},%0\n"
"smlal v8.8h, v15.8b, v15.8b\n"
"ld1 {v15.8b},%0\n"
: :"m"(data));

}