I apologise for the long question, but I am trying to measure performance of different indexing techniques on various platforms, one of which is Adaptive Radix tree.
I have run tests where the basic steps look like this (c/c++):
Step 1: Generate or load data (few million key-value pairs) Step 2: Insert into index and measure time taken (insert_time) Step 3: Retrieve from index and measure time taken (retrieve_time)
I find that always insert_time > retrieve_time on most platforms such as Intel desktops (i386/amd64), iPad (Apple A9), Android (ARMv7) and Raspberry Pi 3 (ARMv8). This is expected, as insert complexity is higher than retrieve complexity.
But when I run the steps on big.LITTLE platforms, specifically Snapdragon 845 (Xiaomi POCO F1) and HiSilicon Kirin 659 (Honor 9 lite), I find insert_time < retrieve_time, except when data size is too low.
To diagnose what could be wrong, I went through the following steps:
Ensure that the thread is running at maximum speed by using following code:
void set_thread_priority() { nice(-20); int policy = 0; struct sched_param param; pthread_getschedparam(pthread_self(), &policy, ¶m); param.sched_priority = sched_get_priority_max(policy); pthread_setschedparam(pthread_self(), policy, ¶m); }
I could see that the nice value is reflected against the process and the thread runs 100% CPU in most cases (it is basically single thread algorithm).
Set CPU affinity using following code:
void set_affinity() { cpu_set_t mask; CPU_ZERO(&mask); CPU_SET(4, &mask); CPU_SET(5, &mask); CPU_SET(6, &mask); CPU_SET(7, &mask); sched_setaffinity(0, sizeof(mask), &mask); }
This code also reflects well on big.LITTLE because when I set CPUs as 0, 1, 2, 3, the code runs much slower than when I set CPUs as 4, 5, 6, 7. Even then insert_time < retrieve_time in both cases.
Ensure that sufficient free RAM is available for my dataset
To avoid the possibility that Step 3 might retrieve from virtual memory, I added Step 4, which is just repeating Step 3:
Step 4: Retrieve from index and measure time taken again (retrieve_time2)
To my surprise, retrieve_time2 > retrieve_time > insert_time (by 2 to 3 seconds for 10 million records).
As for my code, the insert code looks like this:
it1 = m.begin(); start = getTimeVal(); for (; it1 != m.end(); ++it1) { art_insert(&at, (unsigned char*) it1->first.c_str(), (int) it1->first.length() + 1, (void *) it1->second.c_str(), (int) it1->second.length()); ctr++; } stop = getTimeVal();
and retrieve code looks like this:
it1 = m.begin(); start = getTimeVal(); for (; it1 != m.end(); ++it1) { int len; char *value = (char *) art_search(&at, (unsigned char*) it1->first.c_str(), (int) it1->first.length() + 1, &len); ctr++; } stop = getTimeVal();
Any pointers as to what I could do further? Or is there an explanation for this from the platform perspective?
Hi,
It would be interesting to try the Arm Streamline performance analysis tool and see if you can learn more about what is happening.
There are a few blogs about it in the Software Tools area and a recent one about running on non-rooted Android which should help.
https://community.arm.com/tools/b/blog/posts/streamline-performance-analyzer-for-android-application-profiling
Let me know if you need any help with Streamline.
Thanks,
Jason