performance: single thread looks OK while multi-thread ( 8 threads) poor .as compared to linux64.
arm machine is with 64-core. it's verified that 8 threads are started. but why is performance much slow？
It depends on a lot of things, and there is not enough context that we could work on.
Two possibilities, though:1) with more threads and more core in use, temperature could increase drastically, then thermal management driver could play and throttle CPUs.
2) the workload is implemented using shared memory space with global variables and you suffer from false sharing. When you run a single threaded version, there is no memory thrashing, while a multithreaded version may thrash memory at each access (cache invalidation, cleaning and fetch back and forth between cores).
Thank you for replying.
As for the item 2 you mentioned about, our flow works well on Intel MKL lib, while the same engine seems poor on arm lib for Multi-thread.
I am wondering what's the difference between libarmpl.a and libarmpl_mp.a? Any special is required of linking libarmpl_mp.a in user's engine?
I'm not an expert on this library, but I guess the libarmpl_mp.a version is for thread safetyness of some functions (the same idea as *_r functions in C library).Please refer to developer.arm.com/.../Access-Arm-Performance-Libraries to verify how to link properly.
By how "poor" or "much slow" are where talking about?Would it be possible to provide comparative numbers of your experiment?Any details of machines that run your code?Would it be possible to narrow down code sections that seems to achieve not at its full potential on arm machine?
The same flow, works one time slower in arm64 than that in Intel MKL . I am wondering whether it's usage/linking issue.
The linking step is probably not an issue. If you don't link properly, you would not be able to produce a final binary.
By the way, isn't Intel MKL a specific library for x86 architecture (with SSE and AVX SIMD instruction set extensions)? How do you run the engine written specifically with Intel MKL on an arm64 machine?
Intel MKL is based on X86 archi. and I am using the same testing input and replacing MKL functions with those in arm64.
Single thread result shows it's the same as MKL one while Multi-threads are poor in arm64. (one time slower in arm64 than that in Intel MKL). I am sure that multi-threads are activated in arm64. But why is performance poor?
Please share these info so we could progress any further:
Could you share details of arm and x86 machines that run your code? (machine model, CPU model, RAM size, CPUs and RAMs frequencies, etc.)
Would it be possible to provide real numbers of your experiment?
Also, could it be possible to narrow down code sections that seems less attractive on arm machine?