This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Inexplicable performance on big.LITTLE technology (on Android)

I apologise for the long question, but I am trying to measure performance of different indexing techniques on various platforms, one of which is Adaptive Radix tree.

I have run tests where the basic steps look like this (c/c++):

Step 1: Generate or load data (few million key-value pairs)
Step 2: Insert into index and measure time taken (insert_time)
Step 3: Retrieve from index and measure time taken (retrieve_time)

I find that always insert_time > retrieve_time on most platforms such as Intel desktops (i386/amd64), iPad (Apple A9), Android (ARMv7) and Raspberry Pi 3 (ARMv8). This is expected, as insert complexity is higher than retrieve complexity.

But when I run the steps on big.LITTLE platforms, specifically Snapdragon 845 (Xiaomi POCO F1) and HiSilicon Kirin 659 (Honor 9 lite), I find insert_time < retrieve_time, except when data size is too low.

To diagnose what could be wrong, I went through the following steps:

Ensure that the thread is running at maximum speed by using following code:

void set_thread_priority() {
    nice(-20);
    int policy = 0;
    struct sched_param param;
    pthread_getschedparam(pthread_self(), &policy, &param);
    param.sched_priority = sched_get_priority_max(policy);
    pthread_setschedparam(pthread_self(), policy, &param);
}

I could see that the nice value is reflected against the process and the thread runs 100% CPU in most cases (it is basically single thread algorithm).

Set CPU affinity using following code:

void set_affinity() {
    cpu_set_t mask;
    CPU_ZERO(&mask);
    CPU_SET(4, &mask);
    CPU_SET(5, &mask);
    CPU_SET(6, &mask);
    CPU_SET(7, &mask);
    sched_setaffinity(0, sizeof(mask), &mask);
}

This code also reflects well on big.LITTLE because when I set CPUs as 0, 1, 2, 3, the code runs much slower than when I set CPUs as 4, 5, 6, 7. Even then insert_time < retrieve_time in both cases.

Ensure that sufficient free RAM is available for my dataset
To avoid the possibility that Step 3 might retrieve from virtual memory, I added Step 4, which is just repeating Step 3:
```
Step 4: Retrieve from index and measure time taken again (retrieve_time2)
```

To my surprise, retrieve_time2 > retrieve_time > insert_time (by 2 to 3 seconds for 10 million records).

As for my code, the insert code looks like this:

    it1 = m.begin();
    start = getTimeVal();
    for (; it1 != m.end(); ++it1) {
        art_insert(&at, (unsigned char*) it1->first.c_str(),
               (int) it1->first.length() + 1, (void *) it1->second.c_str(),
               (int) it1->second.length());
        ctr++;
    }
    stop = getTimeVal();

and retrieve code looks like this:

    it1 = m.begin();
    start = getTimeVal();
    for (; it1 != m.end(); ++it1) {
        int len;
        char *value = (char *) art_search(&at,
            (unsigned char*) it1->first.c_str(), (int) it1->first.length() + 1, &len);
        ctr++;
    }
    stop = getTimeVal();

Any pointers as to what I could do further? Or is there an explanation for this from the platform perspective?

Top replies

0 Zhifei Yang over 5 years ago

It's not easy for me to say what's the exact problem. But I hope you can investigate from below aspects:

1) Can you do the CPU hot-plug-out to shut down all big or LITTLE CPUs, so that the big.LITTLE SoC regresses into a non-big.LITTLE CPU topology?
Then you can test the insert/retrieve time instead of setting the CPU affinity.

2) "Ensure that the thread is running at maximum speed by using following code:" Can you double check that what's the CPU frequency governor in your big.LITTLE system? In the recent Linux kernels, Energy Aware Scheduling (EAS) governor was deployed. If your insert/retrieve operations are distributed into different threads or even in single thread but running in different CPU time, the CPU frequency may be adjusted based on the current CPU workloads.

See https://developer.arm.com/open-source/energy-aware-scheduling

3) Is it possible for you to add some memory barrier instructions in the high level C++ STL language?
Just to make sure the memory results of insert operations are flushed into memory, not temporarily stored in cache. So that your next retrieve operations may not fetch the data from cache.

4) "To my surprise, retrieve_time2 > retrieve_time > insert_time (by 2 to 3 seconds for 10 million records)."
It depends on your code design. If the data that retrieve_time2 wants to access is stored in the memory but the retrieve_time can access the data from cache, you may observe the result as what you mentioned. In a huge loop ( the loop count is too larger compared to a small TLB/L1 or L2 cache size), the retrieve_time operations may dither the TLB/L1/L2 cache so that some data are flushed out into next level cache or memory.
Cancel
Up +1 Down

Cancel
0 Arundale Ramanathan over 5 years ago in reply to Zhifei Yang
Dear Zhifei Yang - Thank you for the considered and categorical response.

About (1) and (2), I am unable to hotplug out CPUs or change the governor policy as I don't have root access on the new phone (Xiaomi Poco F1). However, I have read access to /sys/bus/cpu/devices/cpu?/cpufreq and I found some interesting things that may throw light on what is happening:

scaling_governor is set as schedutil

when I run my program (single threaded), I find that it affects cpu4 to cpu7 only (affinity) and the frequency changes to 2.6GHz and stays there during both insert and retrieve. I can see this from stats/time_in_state and stats/total_trans. There is hardly any change in speed transition (total_trans) for the duration of my program run.

Since it appears that the CPU runs at constant speed (although not maximum 2.8GHz), can I conclude that the problem is not caused by CPU?

About (3), I can completely rule out reading from cache as I insert into multiple different index technologies (not just ART) before retrieving. So considering the size of data set there cannot be inserted data remaining in cache.

Another observation is that, when I use the top utility, initially the VIRT column shows 4.5G and after insert it shows 5.4G. My dataset including source buffer and index does not exceed 2GB. The phone has 6GB ram and 1.4GB free ram after insert.

I don't understand why VIRT column increases eventhough I have enough free ram. That is why I retrieve twice to rule out reading from virtual memory, but that also does not improve retrieve speed.

Could it be because of the type of memory caches used in SDM845? Or could it be because of type of low power memory used (LPDDR4x)?

I am stuck because I am unable to explain this anomaly on SDM845. Please let me know if you have any ideas.

Regards

Arun
Cancel
Up 0 Down

Cancel
0 Jason Andrews over 5 years ago

Hi,

It would be interesting to try the Arm Streamline performance analysis tool and see if you can learn more about what is happening.

There are a few blogs about it in the Software Tools area and a recent one about running on non-rooted Android which should help.

https://community.arm.com/tools/b/blog/posts/streamline-performance-analyzer-for-android-application-profiling

Let me know if you need any help with Streamline.

Thanks,

Jason
Cancel
Up +1 Down

Cancel
0 Zhifei Yang over 5 years ago in reply to Arundale Ramanathan

For the 2), "schedutil" is the name of EAS cpu governor, which means your CPU frequency is managed by EAS dynamically. If your CPUs are running with different workloads ( workload will be calculated by EAS algorithm automatically ), your CPU frequencies may vary from time to time. So your benchmark tests may be run under different CPU frequencies.
Cancel
Up +1 Down

Cancel
0 Arundale Ramanathan over 5 years ago in reply to Zhifei Yang

Hi Zhifei Yang,

Thank you for replying. From what I see from stats/time_in_state and stats/total_trans in linux. there seems to be no change in speed transition (total_trans). So I am not sure if EAS varies the frequencies.

My insert operations mainly depend on memcpy. Is it possible that memcpy directly copies within external memory without first bringing data into cache? If this is so, then the behaviour is explained, because my retrieve operations have to bring data into cache and would be slower.
Cancel
Up 0 Down

Cancel
0 Zhifei Yang over 5 years ago in reply to Arundale Ramanathan

Yes. For some special data operation APIs such as memset(), memcpy(), they may trigger the Writing Steaming Mode for the CPU feature. In this mode, the memory data will not be allocated into cache to avoid the cache pollution because those kind of data usually are used only once. You have to check the Technical Reference Manual of the CPU you are using to know whether it supports Writing Steaming Mode or not.
Cancel
Up 0 Down

Cancel
0 qweiczzz over 5 years ago

It's not easy for me to say what's the exact problem. But I hope you can investigate from below aspects:

1) Can you do the CPU hot-plug-out to shut down all big or LITTLE CPUs, so that the big.LITTLE SoC regresses into a non-big.LITTLE CPU topology?
Then you can test the insert/retrieve time instead of setting the CPU affinity.

2) "Ensure that the thread is running at maximum speed by using following code:" Can you double check that what's the CPU frequency governor in your big.LITTLE system? In the recent Linux kernels, Energy Aware Scheduling (EAS) governor was deployed. If your insert/retrieve operations are distributed into different threads or even in single thread but running in different CPU time, the CPU frequency may be adjusted based on the current CPU workloads.

See https://www.allicdata.com

3) Is it possible for you to add some memory barrier instructions in the high level C++ STL language?
Just to make sure the memory results of insert operations are flushed into memory, not temporarily stored in cache. So that your next retrieve operations may not fetch the data from cache.

4) "To my surprise, retrieve_time2 > retrieve_time > insert_time (by 2 to 3 seconds for 10 million records)."
It depends on your code design. If the data that retrieve_time2 wants to access is stored in the memory but the retrieve_time can access the data from cache, you may observe the result as what you mentioned. In a huge loop ( the loop count is too larger compared to a small TLB/L1 or L2 cache size), the retrieve_time operations may dither the TLB/L1/L2 cache so that some data are flushed out into next level cache or memory.
Cancel
Up 0 Down

Cancel
0 drake00 over 4 years ago

This code likewise considers well big.LITTLE in light of the fact that when I set CPUs as 0, 1, 2, 3, the code runs much slower than when I set CPUs as 4, 5, 6, 7. And still, at the end of the day insert_time < retrieve_time in the two cases.

Guarantee that adequate free RAM is accessible for my dataset

[url=https://appsync.biz/dafont/][color=#000000][size=1]Dafont[/size][/color][/url][url=https://showbox.bio/][color=#000000][size=1]Showbox[/size][/color][/url][url=https://www.oovoo.onl/adam4adam/][color=#000000][size=1]Adam4adam[/size][/color][/url]

To maintain a strategic distance from the likelihood that Step 3 may recover from virtual memory, I included Step 4, which is simply rehashing Step 3:
Cancel
Up 0 Down

Cancel
0 Arundale Ramanathan over 4 years ago in reply to drake00

This answer is not clear to me. Please explain.
Cancel
Up 0 Down

Cancel
0 jhoney over 4 years ago

Ensure that the thread is running at maximum speed by using following code
Cancel
Up 0 Down

Cancel