Hi all,I am using the HiKey970 board to run inferences on neural networks. The board comprises ARM Cortex-A73 and ARM Cortex-A53 cores.I am using `taskset` to pin the inference process (that spawns 4 threads) once on the LITTLE cores (0-3) and once on the big cores (4-7). Contrary to what I was expecting, the inference time is almost double when running on big cores, compared to LITTLE cores.Is there an explanation for this behavior? Are there tools that can help me understand why the threads are slower when using big cores?To be more precise, the board is flashed with kernel version 4.9.78-147538-g244928755bbe, the code that I am using can be found in this repo.
Hi zois,
Big processor consumes more energy, hence generates more heat.It is possible that temperature could be a cause of poor performance while running on the big cluster.
You can check transition tables between states on cooling devices due to the temperature with:
cat /sys/devices/virtual/thermal/cooling_device*/stats/trans_table
You can check if the corresponding cooling_device affects CPU frequency:
cat /sys/devices/virtual/thermal/cooling_device*/type
You could run this script before and after your application and check differences.
#!/bin/bash for cdev in `ls -d /sys/devices/virtual/thermal/cooling_device*` do echo ${cdev} cat ${cdev}/type cat ${cdev}/stats/trans_table echo "" done
Hi Willy Wolff,thank you for the suggestion. The specific distribution does not provide trans_table, but I monitored
/sys/devices/system/cpu/cpufreq/policy4/stats/total_trans
Could you check running your benchmark with all frequencies set at max?
#!/bin/bash for cpufreq in `ls -d /sys/devices/system/cpu/cpufreq/policy*` do echo "Working on ${cpufreq}" gov=`cat ${cpufreq}/scaling_governor` echo "${cpufreq} currently using \"${gov}\" governor, will force to performance" sudo bash -c "echo performance > ${cpufreq}/scaling_governor" sudo bash -c "echo 0 > ${cpufreq}/stats/reset" cat ${cpufreq}/stats/trans_table echo "" done for devfreq in `ls -d /sys/class/devfreq/*` do echo "Working on ${devfreq}" IFS=', ' read -r -a array <<< `cat ${devfreq}/available_governors` # echo "${array}" if [[ " ${array[@]} " =~ " performance " ]]; then echo "== performance governor available for ${devfreq}, will use it" sudo bash -c "echo performance > ${devfreq}/governor" fi sudo bash -c "echo 0 > ${devfreq}/trans_stat" cat ${devfreq}/trans_stat done echo "remember to reset governors at the end" # YOUR BENCHMARK for cpufreq in `ls -d /sys/devices/system/cpu/cpufreq/policy*` do echo "Working on ${cpufreq}" cat ${cpufreq}/stats/trans_table done for devfreq in `ls -d /sys/class/devfreq/*` do echo "Working on ${devfreq}" cat ${devfreq}/trans_stat done
Also, for how long your program is running?
Are your results "consistent" when you run your program multiple times?
==
zois said:May I ask, is there a benefit in unplugging the cores I am not using? Or that is an alternative approach to using `tasket`?
By offlining all CPU cores of a big or LITTLE cluster, you will remove the cluster from the data cache coherency domain at the hardware level.
See big.LITTLE.Whitepaper.pdf (hexus.net).
With `taskset`, you just force the Linux scheduler to run your code on the CPUs provided.
Thanks for the suggestions and the clarification on offlining cores.I tried the suggested approach, not using though the same script due to some files not existing in the flashed distribution.
By using the maximum frequency on each cluster, the results are consistent among consecutive runs. The program runs for 2 minutes each time. I have also tried inserting a 10 second sleep before invoking the next run, in order to allow some time for CPUs to cool.Yet, the same behavior, LITTLE cores have better performance, and comparing `/sys/devices/system/cpu/cpufreq/policy*/stats/total_trans` before and after running, the number of transitions remains the same.One question, since the device I am using does not provide performance governor for ddr (userspace and simple_ondemand are available), is there a suggestion on how to set memory frequency to maximum?Tools such as `lshw -C memory` and `dmidecode` do not work on this board. I tried writing to `/sys/devices/platform/ddr_devfreq/devfreq/ddr_devfreq/cur_freq` but it doesn't have the effect I want.
I think that inserting a 10 second sleep is not really useful if you have not seen any cooling action. But you can keep it in case that should happen.
'performance' governor for the DDR is not a requirement.The idea is to have consistent frequencies on all of these things, so no surprises when the policy/governor start playing.I don't know what is your end goal, but if it's not in OS/hardware configuration tweaking, it's a nice idea to have all frequencies fixed to limit surprise in your results.
For instance, on another board which implements a different big.LITTLE configuration (Hardkernel Odroid-XU3/4 with A7x4-A15x4), I have seen some nasty weird results with "bad" cluster frequencies configuration. The issue was about data cache coherency mechanism which was playing hard depending on CPU frequencies (see https://dl.acm.org/doi/10.1145/3372799.3394370). However, on the HiKey970 it's unlikely that the bad perf on the big is related to the same issue found in this study as the data cache coherency mechanism is a bit different with the use of a snoop filter.
One thing to maybe have DDR frequency working and all other files that I mention is to bump your kernel version? But I don't know if a newer kernel version has this supported ...
I suggest you check with performance counters to have a more in-depth clue about what the hardware is actually doing beyond the OS.
Also, you could have a try running both clusters at the same (high) frequency if it's possible? So you could see if the hardware doesn't favour your code with deep hardware feature available on one or the other CPU type.
After a bit of research: this paper https://arxiv.org/pdf/1908.11450.pdf seems to be related to what you're looking for. Their results on ResNet50 seems to contradict your results. Maybe triple check your scripts?
Meanwhile, I will try to have my hand on this board.
[...]
Finally, I had my hand on a board. It seems that the board heats up quite easily.
Please, try to monitor thermal states while running your code in parallel:
#!/bin/bash while true; do echo "==="; echo "temperature"; cat /sys/devices/virtual/thermal/thermal_zone*/temp; echo "cooling state"; cat /sys/devices/virtual/thermal/cooling_device*/cur_state; sleep 0.1; done
I'm curious, what is your results with VGG?
Hi Willy,It seems that I am not able to keep the DDR frequency constant. Bumping the kernel could help, but is out of my time scope, I would need to cover some knowledge in order to port that to the board.
You are right, I need to check the hardware counters closer in order to understand how the architecture of each cluster is affecting execution. I plan to do that next.Regarding the clusters' frequencies, I have set them both to the highest possible value, 1.86GHz for LITTLE and 2.36GHz for big. Even with that configuration, I notice the performance difference I have mentioned. Additionally, I was monitoring the frequency while running ResNet50, I don't see transitions in CPU frequency during execution. So I think that can be ruled out as a cause.Thanks for the publication, I checked the scripts, I don't see anything funny or unexpected happening with the configuration of the board or the arguments to the executables.I monitored the temperature as you suggested. There is a ~9C difference on the board between using big and LITTLE cores. Though as I mentioned previously, I was monitoring the frequency of the cores as well and didn't observe any transition, even though the temperature is higher.
The VGG (16 and 19) have expected behavior. Performance is ~40% better when using big cores, comparing to LITTLE cores. The best performance is observed when using the GPU on the board.
Interesting.
You may need to investigate deeper, probably with performance counters.You could use Streamline to do so: developer.arm.com/.../streamline-performance-analyzer
Having a fixed DDR frequency would be nice. Instead of the unavailable performance governor, you could force a min and max range:
cat /sys/class/devfreq/ddr_devfreq/available_frequencies 415000000 830000000 1244000000 1866000000 sudo bash -c "echo 1866000000 > /sys/class/devfreq/ddr_devfreq/min_freq" sudo bash -c "echo 1866000000 > /sys/class/devfreq/ddr_devfreq/max_freq"
Also, try to use the same frequency on both cluster.
cat /sys/devices/system/cpu/cpufreq/policy0/scaling_available_frequencies 509000 1018000 1210000 1402000 1556000 1690000 1844000 cat /sys/devices/system/cpu/cpufreq/policy4/scaling_available_frequencies 682000 1018000 1210000 1364000 1498000 1652000 1863000 2093000 2362000 sudo bash -c "echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor" sudo bash -c "echo 1018000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq" sudo bash -c "echo performance > /sys/devices/system/cpu/cpufreq/policy4/scaling_governor" sudo bash -c "echo 1018000 > /sys/devices/system/cpu/cpufreq/policy4/scaling_max_freq"
Remember to continue monitoring temperature, and frequency change while you're investigating.You could also limit distraction for your hardware by removing "useless" processes, like removing all "unneeded" background services and running all in serial console, without any GUI running. It will limit a bit context switching and some other annoyance for your application.
Thanks for the advice, I will try it out.Posting the following, regarding setting the memory and GPU frequency, in case someone else comes across this thread.For the specific distribution, kernel and device (Lebian, 4.9.78-147538-g244928755bbe, HikEy 970),
/sys/devices/platform/ddr_devfreq/devfreq/ddr_devfreq/cur_freq
/sys/devices/platform/ddr_devfreq/devfreq/ddr_devfreq