Any NEON instructions can be dual issued with vector long multiply accumulate (SMLAL) on Cortex A53 or A55?

Anyone know if this is possible? I've tried several instruction mixes:

                                                        A53                A55

smlal.int8 & 64bit load (ld1)            0.97                1.72                                   instructions/cycle

smlal.int8 & 64bit dup                     0.97                0.96

smlal.int8 & 64bit or                        0.97                0.96

mla.int8 & 64bit dup                        1.79               1.74

mla.int16 & 64bit dup                      0.97               0.95

The Cortex A55 optimization guide has contradictory info.https://static.docs.arm.com/epm128372/20/arm_cortex_a55_software_optimization_guide_v2.pdf

In 1 place, it says dual issue = 01, which in my understanding means it can only issue on slot 0, but allows another SIMD instruction to issue on slot 1.

In the 2nd place, it says dual issue = 00, which means it prevents dual issue. Does that mean it uses both slots?

What's even more surprising is even a simple SIMD OR can't execute in parallel. What I really want to do is dual issue dup(specific lane) with smlal.

What's going on? Is smlal using both SIMD units?

One suggestion was it could be a register file bandwidth bottleneck

https://stackoverflow.com/questions/34037900/can-cortex-a57-dual-issue-128-bit-neon-instructions

That makes sense because unlike almost every other instructions, smlal takes 3 operands and writes an output that's 2x as wide, while regular mla only does 3 reads and 1 write. And it would be very unreasonable to add a 3rd write port to the NEON register file to support writing smlal's 128 bit output and another 64bit output from a 2nd instruction

Another explanation is some instructions can't be dual issued every cycle (e.g. "issue vector load every three fmla")

https://github.com/Tencent/ncnn/wiki/arm-a53-a55-dual-issue

----------------------------------------benchmark program--------------------------------------------

uint8_t data[32] __attribute__((aligned(32)));
for (int i = 0; i < N; i += 16)
{
asm volatile(
"smlal v9.8h, v0.8b, v0.8b\n"
"ld1 {v0.8b},%0\n"
"smlal v10.8h, v1.8b, v1.8b\n"
"ld1 {v1.8b},%0\n"

"smlal v11.8h, v2.8b, v2.8b\n"
"ld1 {v2.8b},%0\n"

"smlal v12.8h, v3.8b, v3.8b\n"
"ld1 {v3.8b},%0\n"

"smlal v13.8h, v4.8b, v4.8b\n"
"ld1 {v4.8b},%0\n"

"smlal v14.8h, v5.8b, v5.8b\n"
"ld1 {v5.8b},%0\n"

"smlal v15.8h, v6.8b, v6.8b\n"
"ld1 {v6.8b},%0\n"
"smlal v0.8h, v7.8b, v7.8b\n"
"ld1 {v7.8b},%0\n"
"smlal v1.8h, v8.8b, v8.8b\n"
"ld1 {v8.8b},%0\n"
"smlal v2.8h, v9.8b, v9.8b\n"
"ld1 {v9.8b},%0\n"
"smlal v3.8h, v10.8b, v10.8b\n"
"ld1 {v10.8b},%0\n"
"smlal v4.8h, v11.8b, v11.8b\n"
"ld1 {v11.8b},%0\n"
"smlal v5.8h, v12.8b, v12.8b\n"
"ld1 {v12.8b},%0\n"
"smlal v6.8h, v13.8b, v13.8b\n"
"ld1 {v13.8b},%0\n"
"smlal v7.8h, v14.8b, v14.8b\n"
"ld1 {v14.8b},%0\n"
"smlal v8.8h, v15.8b, v15.8b\n"
"ld1 {v15.8b},%0\n"
: :"m"(data));

}

More questions in this forum