Hi all,I am using the HiKey970 board to run inferences on neural networks. The board comprises ARM Cortex-A73 and ARM Cortex-A53 cores.I am using `taskset` to pin the inference process (that spawns 4 threads) once on the LITTLE cores (0-3) and once on the big cores (4-7). Contrary to what I was expecting, the inference time is almost double when running on big cores, compared to LITTLE cores.Is there an explanation for this behavior? Are there tools that can help me understand why the threads are slower when using big cores?To be more precise, the board is flashed with kernel version 4.9.78-147538-g244928755bbe, the code that I am using can be found in this repo.
Please make sure that DVFS is disabled, and the frequency of CPUs is set a fixed value with,
for i in 0 1 2 3; do echo 0 > /sys/devices/system/cpu/cpu$i/online donefor i in 4 5 6 7; do echo 1805000 > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_max_freq echo 1805000 > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_min_freqdone
Thank you for your response. I tried the above and I observe the same results. I unpluged the cores that were not used and set the frequency to a constant value.Would you suggest another strategy to get a clear view of why I am getting the unexpected performance? The executable I am running is the same, I just use `taskset` each time to specify the cores I want to run on.May I ask, is there a benefit in unplugging the cores I am not using? Or that is an alternative approach to using `tasket`? I wasn't setting `/sys/devices/system/cpu/cpu{0-3/4-7}/online` to 0 before.
Hi zois,
Is it possible that your inferences tasks run in fact on the NPU?
If this is the case, the task would not be CPU bound and that would explain the behaviour you are seeing.
Hi vstehle,I don't think this is the case, since the API for the NPU is not very open, and the code I am using is not making calls to it. What is strange to me is that this behavior appears only for one network, ResNet50. The rest of the networks have expected behavior, performance better or equal when using big cores.I am looking now whether it is a synchronization issue for the implementation of the specific network.
I also tried moving all system processes/threads to the cores that I am not using for inference, using `cset`. Still the same behavior, little cores demonstrate better performance, almost double compared to big cores.