performance: single thread looks OK while multi-thread ( 8 threads) poor .as compared to linux64.
arm machine is with 64-core. it's verified that 8 threads are started. but why is performance much slow？
The same flow, works one time slower in arm64 than that in Intel MKL . I am wondering whether it's usage/linking issue.
The linking step is probably not an issue. If you don't link properly, you would not be able to produce a final binary.
By the way, isn't Intel MKL a specific library for x86 architecture (with SSE and AVX SIMD instruction set extensions)? How do you run the engine written specifically with Intel MKL on an arm64 machine?
Intel MKL is based on X86 archi. and I am using the same testing input and replacing MKL functions with those in arm64.
Single thread result shows it's the same as MKL one while Multi-threads are poor in arm64. (one time slower in arm64 than that in Intel MKL). I am sure that multi-threads are activated in arm64. But why is performance poor?
Please share these info so we could progress any further:
Could you share details of arm and x86 machines that run your code? (machine model, CPU model, RAM size, CPUs and RAMs frequencies, etc.)
Would it be possible to provide real numbers of your experiment?
Also, could it be possible to narrow down code sections that seems less attractive on arm machine?