Anyone know if this is possible? I've tried several instruction mixes:
A53 A55
smlal.int8 & 64bit load (ld1) 0.97 1.72 instructions/cycle
smlal.int8 & 64bit dup 0.97 0.96
smlal.int8 & 64bit or 0.97 0.96
mla.int8 & 64bit dup 1.79 1.74
mla.int16 & 64bit dup 0.97 0.95
The Cortex A55 optimization guide has contradictory info.https://static.docs.arm.com/epm128372/20/arm_cortex_a55_software_optimization_guide_v2.pdf
In 1 place, it says dual issue = 01, which in my understanding means it can only issue on slot 0, but allows another SIMD instruction to issue on slot 1.
In the 2nd place, it says dual issue = 00, which means it prevents dual issue. Does that mean it uses both slots?
What's even more surprising is even a simple SIMD OR can't execute in parallel. What I really want to do is dual issue dup(specific lane) with smlal.
What's going on? Is smlal using both SIMD units?
One suggestion was it could be a register file bandwidth bottleneck
https://stackoverflow.com/questions/34037900/can-cortex-a57-dual-issue-128-bit-neon-instructions
That makes sense because unlike almost every other instructions, smlal takes 3 operands and writes an output that's 2x as wide, while regular mla only does 3 reads and 1 write. And it would be very unreasonable to add a 3rd write port to the NEON register file to support writing smlal's 128 bit output and another 64bit output from a 2nd instruction
Another explanation is some instructions can't be dual issued every cycle (e.g. "issue vector load every three fmla")
https://github.com/Tencent/ncnn/wiki/arm-a53-a55-dual-issue
----------------------------------------benchmark program--------------------------------------------
uint8_t data[32] __attribute__((aligned(32))); for (int i = 0; i < N; i += 16) { asm volatile( "smlal v9.8h, v0.8b, v0.8b\n" "ld1 {v0.8b},%0\n" "smlal v10.8h, v1.8b, v1.8b\n" "ld1 {v1.8b},%0\n" "smlal v11.8h, v2.8b, v2.8b\n" "ld1 {v2.8b},%0\n" "smlal v12.8h, v3.8b, v3.8b\n" "ld1 {v3.8b},%0\n" "smlal v13.8h, v4.8b, v4.8b\n" "ld1 {v4.8b},%0\n" "smlal v14.8h, v5.8b, v5.8b\n" "ld1 {v5.8b},%0\n" "smlal v15.8h, v6.8b, v6.8b\n" "ld1 {v6.8b},%0\n" "smlal v0.8h, v7.8b, v7.8b\n" "ld1 {v7.8b},%0\n" "smlal v1.8h, v8.8b, v8.8b\n" "ld1 {v8.8b},%0\n" "smlal v2.8h, v9.8b, v9.8b\n" "ld1 {v9.8b},%0\n" "smlal v3.8h, v10.8b, v10.8b\n" "ld1 {v10.8b},%0\n" "smlal v4.8h, v11.8b, v11.8b\n" "ld1 {v11.8b},%0\n" "smlal v5.8h, v12.8b, v12.8b\n" "ld1 {v12.8b},%0\n" "smlal v6.8h, v13.8b, v13.8b\n" "ld1 {v13.8b},%0\n" "smlal v7.8h, v14.8b, v14.8b\n" "ld1 {v14.8b},%0\n" "smlal v8.8h, v15.8b, v15.8b\n" "ld1 {v15.8b},%0\n" : :"m"(data)); }