This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why performance is higher on LITTLE cores?

Hi all,

I am using the HiKey970 board to run inferences on neural networks. The board comprises ARM Cortex-A73 and ARM Cortex-A53 cores.
I am using `taskset` to pin the inference process (that spawns 4 threads) once on the LITTLE cores (0-3) and once on the big cores (4-7). Contrary to what I was expecting, the inference time is almost double when running on big cores, compared to LITTLE cores.

Is there an explanation for this behavior? Are there tools that can help me understand why the threads are slower when using big cores?

To be more precise, the board is flashed with kernel version 4.9.78-147538-g244928755bbe, the code that I am using can be found in this repo.

Parents
  • Hi zois,

    Interesting.

    You may need to investigate deeper, probably with performance counters.
    You could use Streamline to do so:
    developer.arm.com/.../streamline-performance-analyzer

    Having a fixed DDR frequency would be nice. Instead of the unavailable performance governor, you could force a min and max range:

    cat /sys/class/devfreq/ddr_devfreq/available_frequencies
    415000000 830000000 1244000000 1866000000
    
    sudo bash -c "echo 1866000000 > /sys/class/devfreq/ddr_devfreq/min_freq"
    sudo bash -c "echo 1866000000 > /sys/class/devfreq/ddr_devfreq/max_freq"


    Also, try to use the same frequency on both cluster.

    cat /sys/devices/system/cpu/cpufreq/policy0/scaling_available_frequencies
    509000 1018000 1210000 1402000 1556000 1690000 1844000
    cat /sys/devices/system/cpu/cpufreq/policy4/scaling_available_frequencies
    682000 1018000 1210000 1364000 1498000 1652000 1863000 2093000 2362000
    
    sudo bash -c "echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor"
    sudo bash -c "echo 1018000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq"
    
    sudo bash -c "echo performance > /sys/devices/system/cpu/cpufreq/policy4/scaling_governor"
    sudo bash -c "echo 1018000 > /sys/devices/system/cpu/cpufreq/policy4/scaling_max_freq"

    Remember to continue monitoring temperature, and frequency change while you're investigating.
    You could also limit distraction for your hardware by removing "useless" processes, like removing all "unneeded" background services and running all in serial console, without any GUI running. It will limit a bit context switching and some other annoyance for your application.

Reply
  • Hi zois,

    Interesting.

    You may need to investigate deeper, probably with performance counters.
    You could use Streamline to do so:
    developer.arm.com/.../streamline-performance-analyzer

    Having a fixed DDR frequency would be nice. Instead of the unavailable performance governor, you could force a min and max range:

    cat /sys/class/devfreq/ddr_devfreq/available_frequencies
    415000000 830000000 1244000000 1866000000
    
    sudo bash -c "echo 1866000000 > /sys/class/devfreq/ddr_devfreq/min_freq"
    sudo bash -c "echo 1866000000 > /sys/class/devfreq/ddr_devfreq/max_freq"


    Also, try to use the same frequency on both cluster.

    cat /sys/devices/system/cpu/cpufreq/policy0/scaling_available_frequencies
    509000 1018000 1210000 1402000 1556000 1690000 1844000
    cat /sys/devices/system/cpu/cpufreq/policy4/scaling_available_frequencies
    682000 1018000 1210000 1364000 1498000 1652000 1863000 2093000 2362000
    
    sudo bash -c "echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor"
    sudo bash -c "echo 1018000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq"
    
    sudo bash -c "echo performance > /sys/devices/system/cpu/cpufreq/policy4/scaling_governor"
    sudo bash -c "echo 1018000 > /sys/devices/system/cpu/cpufreq/policy4/scaling_max_freq"

    Remember to continue monitoring temperature, and frequency change while you're investigating.
    You could also limit distraction for your hardware by removing "useless" processes, like removing all "unneeded" background services and running all in serial console, without any GUI running. It will limit a bit context switching and some other annoyance for your application.

Children