The time it took me to test each instruction with MegPeak is as follows。stp instruction time differs greatly from the official documentation(Arm_Cortex-A76_Software_Optimization_Guide)bandwidth: 19.067337 Gbpsldd throughput: 0.221717 ns 3547478.000000 runs 16000000ldq throughput: 0.221717 ns 3547478.000000 runs 16000000stq throughput: 1.327807 ns 21244908.000000 runs 16000000ldpq throughput: 0.442687 ns 5666398.000000 runs 12800000lddx2 throughput: 0.442888 ns 7086206.000000 runs 16000000ld1q throughput: 0.221316 ns 3541061.000000 runs 16000000eor throughput: 0.221280 ns 3540478.000000 runs 16000000fmla throughput: 0.221590 ns 3545436.000000 runs 16000000fmlad throughput: 0.221480 ns 3543687.000000 runs 16000000fmla_x2 throughput: 0.475682 ns 7610905.000000 runs 16000000mla throughput: 0.884828 ns 14157246.000000 runs 16000000fmul throughput: 0.221298 ns 3540769.000000 runs 16000000mul throughput: 0.884700 ns 14155203.000000 runs 16000000addp throughput: 0.221262 ns 3540187.000000 runs 16000000sadalp throughput: 0.442833 ns 7085331.000000 runs 16000000add throughput: 0.221262 ns 3540186.000000 runs 16000000fadd throughput: 0.221590 ns 3545436.000000 runs 16000000smull throughput: 0.442432 ns 7078915.000000 runs 16000000smlal_4b throughput: 0.442724 ns 7083581.000000 runs 16000000smlal_8b throughput: 0.442851 ns 7085622.000000 runs 16000000dupd_lane_s8 throughput: 0.221280 ns 3540478.000000 runs 16000000mlaq_lane_s16 throughput: 0.885192 ns 10622309.000000 runs 12000000sshll throughput: 0.442706 ns 7083289.000000 runs 16000000tbl throughput: 0.221262 ns 3540187.000000 runs 16000000ins throughput: 0.442651 ns 7082415.000000 runs 16000000sqrdmulh throughput: 0.884609 ns 14153745.000000 runs 16000000usubl throughput: 0.221207 ns 3539311.000000 runs 16000000abs throughput: 0.221553 ns 3544853.000000 runs 16000000fcvtzs throughput: 0.885320 ns 14165121.000000 runs 16000000scvtf throughput: 0.884828 ns 14157246.000000 runs 16000000fcvtns throughput: 0.884810 ns 14156954.000000 runs 16000000fcvtms throughput: 0.884773 ns 14156371.000000 runs 16000000fcvtps throughput: 0.885265 ns 14164246.000000 runs 16000000fcvtas throughput: 0.884427 ns 14150829.000000 runs 16000000fcvtn throughput: 0.884554 ns 14152871.000000 runs 16000000fcvtl throughput: 0.884974 ns 14159579.000000 runs 16000000ins_ldd throughput: 0.442824 ns 5668148.000000 runs 12800000ldq_fmlaq throughput: 0.232800 ns 3724808.000000 runs 16000000ldd_fmlaq_sep throughput: 0.249211 ns 3189901.000000 runs 12800000ldd_fmlaq_lane_sep throughput: 0.243519 ns 3896305.000000 runs 16000000ldd_ldx_ins_fmlaq_lane_sep throughput: 0.364600 ns 4666874.000000 runs 12800000ins_fmlaq_lane_1_4_sep throughput: 0.381871 ns 4887954.000000 runs 12800000ldd_fmlaq_lane_1_4_sep throughput: 0.221891 ns 2840199.000000 runs 12800000ins_fmlaq_lane_sep throughput: 1.089538 ns 17432604.000000 runs 16000000
MegPeak --github.com/.../MegPeak