This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why performance is higher on LITTLE cores?

Hi all,

I am using the HiKey970 board to run inferences on neural networks. The board comprises ARM Cortex-A73 and ARM Cortex-A53 cores.
I am using `taskset` to pin the inference process (that spawns 4 threads) once on the LITTLE cores (0-3) and once on the big cores (4-7). Contrary to what I was expecting, the inference time is almost double when running on big cores, compared to LITTLE cores.

Is there an explanation for this behavior? Are there tools that can help me understand why the threads are slower when using big cores?

To be more precise, the board is flashed with kernel version 4.9.78-147538-g244928755bbe, the code that I am using can be found in this repo.

Top replies

vstehle over 3 years ago in reply to zois +1

Hi zois , Is it possible that your inferences tasks run in fact on the NPU? If this is the case, the task would not be CPU bound and that would explain the behaviour you are seeing.

Parents

0 Willy Wolff over 3 years ago
Hi zois,

Big processor consumes more energy, hence generates more heat.
It is possible that temperature could be a cause of poor performance while running on the big cluster.

You can check transition tables between states on cooling devices due to the temperature with:

cat /sys/devices/virtual/thermal/cooling_device*/stats/trans_table

You can check if the corresponding cooling_device affects CPU frequency:
cat /sys/devices/virtual/thermal/cooling_device*/type

You could run this script before and after your application and check differences.

#!/bin/bash for cdev in `ls -d /sys/devices/virtual/thermal/cooling_device*` do echo ${cdev} cat ${cdev}/type cat ${cdev}/stats/trans_table echo "" done
Cancel
Up 0 Down

Cancel

Reply

0 Willy Wolff over 3 years ago
Hi zois,

Big processor consumes more energy, hence generates more heat.
It is possible that temperature could be a cause of poor performance while running on the big cluster.

You can check transition tables between states on cooling devices due to the temperature with:

cat /sys/devices/virtual/thermal/cooling_device*/stats/trans_table

You can check if the corresponding cooling_device affects CPU frequency:
cat /sys/devices/virtual/thermal/cooling_device*/type

You could run this script before and after your application and check differences.

#!/bin/bash for cdev in `ls -d /sys/devices/virtual/thermal/cooling_device*` do echo ${cdev} cat ${cdev}/type cat ${cdev}/stats/trans_table echo "" done
Cancel
Up 0 Down

Cancel

Children

0 zois over 3 years ago in reply to Willy Wolff
Hi Willy Wolff,

thank you for the suggestion. The specific distribution does not provide trans_table, but I monitored
/sys/devices/system/cpu/cpufreq/policy4/stats/total_trans
.

For the big cores I didn't see any increase in the transitions after running inference.

I think the performance issue is caused by a different source.
Cancel
Up 0 Down

Cancel

0 Willy Wolff over 3 years ago in reply to zois

Hi zois,

Could you check running your benchmark with all frequencies set at max?

#!/bin/bash

for cpufreq in `ls -d /sys/devices/system/cpu/cpufreq/policy*`
do
    echo "Working on ${cpufreq}"
    gov=`cat ${cpufreq}/scaling_governor`
    echo "${cpufreq} currently using \"${gov}\" governor, will force to performance"
    sudo bash -c "echo performance > ${cpufreq}/scaling_governor"
    sudo bash -c "echo 0 > ${cpufreq}/stats/reset"
    cat ${cpufreq}/stats/trans_table
    echo ""
done

for devfreq in `ls -d /sys/class/devfreq/*`
do
    echo "Working on ${devfreq}"
    IFS=', ' read -r -a array <<< `cat ${devfreq}/available_governors`
    # echo "${array}"
    if [[ " ${array[@]} " =~ " performance " ]]; then
        echo "== performance governor available for ${devfreq}, will use it"
        sudo bash -c "echo performance > ${devfreq}/governor"
    fi
    sudo bash -c "echo 0 > ${devfreq}/trans_stat"
    cat ${devfreq}/trans_stat
done

echo "remember to reset governors at the end"



# YOUR BENCHMARK

for cpufreq in `ls -d /sys/devices/system/cpu/cpufreq/policy*`
do
    echo "Working on ${cpufreq}"
    cat ${cpufreq}/stats/trans_table
done

for devfreq in `ls -d /sys/class/devfreq/*`
do
    echo "Working on ${devfreq}"
    cat ${devfreq}/trans_stat
done

Also, for how long your program is running?

Are your results "consistent" when you run your program multiple times?

zois said:
May I ask, is there a benefit in unplugging the cores I am not using? Or that is an alternative approach to using `tasket`?

By offlining all CPU cores of a big or LITTLE cluster, you will remove the cluster from the data cache coherency domain at the hardware level.

See big.LITTLE.Whitepaper.pdf (hexus.net).

With `taskset`, you just force the Linux scheduler to run your code on the CPUs provided.

0 zois over 3 years ago in reply to Willy Wolff

Thanks for the suggestions and the clarification on offlining cores.

I tried the suggested approach, not using though the same script due to some files not existing in the flashed distribution.

By using the maximum frequency on each cluster, the results are consistent among consecutive runs. The program runs for 2 minutes each time. I have also tried inserting a 10 second sleep before invoking the next run, in order to allow some time for CPUs to cool.

Yet, the same behavior, LITTLE cores have better performance, and comparing `/sys/devices/system/cpu/cpufreq/policy*/stats/total_trans` before and after running, the number of transitions remains the same.

One question, since the device I am using does not provide performance governor for ddr (userspace and simple_ondemand are available), is there a suggestion on how to set memory frequency to maximum?
Tools such as `lshw -C memory` and `dmidecode` do not work on this board. I tried writing to `/sys/devices/platform/ddr_devfreq/devfreq/ddr_devfreq/cur_freq` but it doesn't have the effect I want.
Cancel
Up 0 Down

Cancel
0 Willy Wolff over 3 years ago in reply to zois

Hi zois,

I think that inserting a 10 second sleep is not really useful if you have not seen any cooling action. But you can keep it in case that should happen.

'performance' governor for the DDR is not a requirement.
The idea is to have consistent frequencies on all of these things, so no surprises when the policy/governor start playing.
I don't know what is your end goal, but if it's not in OS/hardware configuration tweaking, it's a nice idea to have all frequencies fixed to limit surprise in your results.

For instance, on another board which implements a different big.LITTLE configuration (Hardkernel Odroid-XU3/4 with A7x4-A15x4), I have seen some nasty weird results with "bad" cluster frequencies configuration. The issue was about data cache coherency mechanism which was playing hard depending on CPU frequencies (see https://dl.acm.org/doi/10.1145/3372799.3394370). However, on the HiKey970 it's unlikely that the bad perf on the big is related to the same issue found in this study as the data cache coherency mechanism is a bit different with the use of a snoop filter.

One thing to maybe have DDR frequency working and all other files that I mention is to bump your kernel version? But I don't know if a newer kernel version has this supported ...

I suggest you check with performance counters to have a more in-depth clue about what the hardware is actually doing beyond the OS.

Also, you could have a try running both clusters at the same (high) frequency if it's possible? So you could see if the hardware doesn't favour your code with deep hardware feature available on one or the other CPU type.

After a bit of research: this paper https://arxiv.org/pdf/1908.11450.pdf seems to be related to what you're looking for. Their results on ResNet50 seems to contradict your results. Maybe triple check your scripts?

Meanwhile, I will try to have my hand on this board.

[...]

Finally, I had my hand on a board. It seems that the board heats up quite easily.

Please, try to monitor thermal states while running your code in parallel:

#!/bin/bash while true; do echo "==="; echo "temperature"; cat /sys/devices/virtual/thermal/thermal_zone*/temp; echo "cooling state"; cat /sys/devices/virtual/thermal/cooling_device*/cur_state; sleep 0.1; done
Cancel
Up 0 Down

Cancel
0 Willy Wolff over 3 years ago in reply to Willy Wolff

Hi zois,

I'm curious, what is your results with VGG?
Cancel
Up 0 Down

Cancel
0 zois over 3 years ago in reply to Willy Wolff

Hi Willy,

It seems that I am not able to keep the DDR frequency constant. Bumping the kernel could help, but is out of my time scope, I would need to cover some knowledge in order to port that to the board.

You are right, I need to check the hardware counters closer in order to understand how the architecture of each cluster is affecting execution. I plan to do that next.

Regarding the clusters' frequencies, I have set them both to the highest possible value, 1.86GHz for LITTLE and 2.36GHz for big. Even with that configuration, I notice the performance difference I have mentioned. Additionally, I was monitoring the frequency while running ResNet50, I don't see transitions in CPU frequency during execution. So I think that can be ruled out as a cause.

Thanks for the publication, I checked the scripts, I don't see anything funny or unexpected happening with the configuration of the board or the arguments to the executables.

I monitored the temperature as you suggested. There is a ~9C difference on the board between using big and LITTLE cores. Though as I mentioned previously, I was monitoring the frequency of the cores as well and didn't observe any transition, even though the temperature is higher.
Cancel
Up 0 Down

Cancel
0 zois over 3 years ago in reply to Willy Wolff

The VGG (16 and 19) have expected behavior. Performance is ~40% better when using big cores, comparing to LITTLE cores. The best performance is observed when using the GPU on the board.
Cancel
Up 0 Down

Cancel
0 Willy Wolff over 3 years ago in reply to zois

Hi zois,

Interesting.

You may need to investigate deeper, probably with performance counters.
You could use Streamline to do so: developer.arm.com/.../streamline-performance-analyzer

Having a fixed DDR frequency would be nice. Instead of the unavailable performance governor, you could force a min and max range:

cat /sys/class/devfreq/ddr_devfreq/available_frequencies 415000000 830000000 1244000000 1866000000 sudo bash -c "echo 1866000000 > /sys/class/devfreq/ddr_devfreq/min_freq" sudo bash -c "echo 1866000000 > /sys/class/devfreq/ddr_devfreq/max_freq"

Also, try to use the same frequency on both cluster.

cat /sys/devices/system/cpu/cpufreq/policy0/scaling_available_frequencies 509000 1018000 1210000 1402000 1556000 1690000 1844000 cat /sys/devices/system/cpu/cpufreq/policy4/scaling_available_frequencies 682000 1018000 1210000 1364000 1498000 1652000 1863000 2093000 2362000 sudo bash -c "echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor" sudo bash -c "echo 1018000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq" sudo bash -c "echo performance > /sys/devices/system/cpu/cpufreq/policy4/scaling_governor" sudo bash -c "echo 1018000 > /sys/devices/system/cpu/cpufreq/policy4/scaling_max_freq"

Remember to continue monitoring temperature, and frequency change while you're investigating.
You could also limit distraction for your hardware by removing "useless" processes, like removing all "unneeded" background services and running all in serial console, without any GUI running. It will limit a bit context switching and some other annoyance for your application.
Cancel
Up 0 Down

Cancel
0 zois over 3 years ago in reply to Willy Wolff
Thanks for the advice, I will try it out.

Posting the following, regarding setting the memory and GPU frequency, in case someone else comes across this thread.
For the specific distribution, kernel and device (Lebian, 4.9.78-147538-g244928755bbe, HikEy 970),

/sys/devices/platform/ddr_devfreq/devfreq/ddr_devfreq/cur_freq

does not have writing privileges. Even if we modify the privileges and write the file, the frequency does not seem to change. Instead, if we change the memory or GPU governor to `userspace`, there is a new directory, named `userspace,` created under

/sys/devices/platform/ddr_devfreq/devfreq/ddr_devfreq
.
Within that directory, there is a file `cur_freq` that we can write and set the frequency in the desired value.
Cancel
Up 0 Down

Cancel