=======================================
for matrix 4 by 4 multiplication, neon programming is slower than natural code with
auto-vectorization option. (Xilinx Zynq 702 EVM board - cortex a9 with gcc complier option
-mfloat-abi=softfp -mfpu=neon-fp16 -ftree…