I apologise for the long question, but I am trying to measure performance of different indexing techniques on various platforms, one of which is Adaptive Radix tree.
I have run tests where the basic steps look like this (c/c++):
Step 1: Generate or load data (few million key-value pairs) Step 2: Insert into index and measure time taken (insert_time) Step 3: Retrieve from index and measure time taken (retrieve_time)
I find that always insert_time > retrieve_time on most platforms such as Intel desktops (i386/amd64), iPad (Apple A9), Android (ARMv7) and Raspberry Pi 3 (ARMv8). This is expected, as insert complexity is higher than retrieve complexity.
But when I run the steps on big.LITTLE platforms, specifically Snapdragon 845 (Xiaomi POCO F1) and HiSilicon Kirin 659 (Honor 9 lite), I find insert_time < retrieve_time, except when data size is too low.
To diagnose what could be wrong, I went through the following steps:
Ensure that the thread is running at maximum speed by using following code:
void set_thread_priority() { nice(-20); int policy = 0; struct sched_param param; pthread_getschedparam(pthread_self(), &policy, ¶m); param.sched_priority = sched_get_priority_max(policy); pthread_setschedparam(pthread_self(), policy, ¶m); }
I could see that the nice value is reflected against the process and the thread runs 100% CPU in most cases (it is basically single thread algorithm).
Set CPU affinity using following code:
void set_affinity() { cpu_set_t mask; CPU_ZERO(&mask); CPU_SET(4, &mask); CPU_SET(5, &mask); CPU_SET(6, &mask); CPU_SET(7, &mask); sched_setaffinity(0, sizeof(mask), &mask); }
This code also reflects well on big.LITTLE because when I set CPUs as 0, 1, 2, 3, the code runs much slower than when I set CPUs as 4, 5, 6, 7. Even then insert_time < retrieve_time in both cases.
Ensure that sufficient free RAM is available for my dataset
To avoid the possibility that Step 3 might retrieve from virtual memory, I added Step 4, which is just repeating Step 3:
Step 4: Retrieve from index and measure time taken again (retrieve_time2)
To my surprise, retrieve_time2 > retrieve_time > insert_time (by 2 to 3 seconds for 10 million records).
As for my code, the insert code looks like this:
it1 = m.begin(); start = getTimeVal(); for (; it1 != m.end(); ++it1) { art_insert(&at, (unsigned char*) it1->first.c_str(), (int) it1->first.length() + 1, (void *) it1->second.c_str(), (int) it1->second.length()); ctr++; } stop = getTimeVal();
and retrieve code looks like this:
it1 = m.begin(); start = getTimeVal(); for (; it1 != m.end(); ++it1) { int len; char *value = (char *) art_search(&at, (unsigned char*) it1->first.c_str(), (int) it1->first.length() + 1, &len); ctr++; } stop = getTimeVal();
Any pointers as to what I could do further? Or is there an explanation for this from the platform perspective?
This answer is not clear to me. Please explain.