I apologise for the long question, but I am trying to measure performance of different indexing techniques on various platforms, one of which is Adaptive Radix tree.
I have run tests where the basic steps look like this (c/c++):
Step 1: Generate or load data (few million key-value pairs) Step 2: Insert into index and measure time taken (insert_time) Step 3: Retrieve from index and measure time taken (retrieve_time)
I find that always insert_time > retrieve_time on most platforms such as Intel desktops (i386/amd64), iPad (Apple A9), Android (ARMv7) and Raspberry Pi 3 (ARMv8). This is expected, as insert complexity is higher than retrieve complexity.
But when I run the steps on big.LITTLE platforms, specifically Snapdragon 845 (Xiaomi POCO F1) and HiSilicon Kirin 659 (Honor 9 lite), I find insert_time < retrieve_time, except when data size is too low.
To diagnose what could be wrong, I went through the following steps:
Ensure that the thread is running at maximum speed by using following code:
void set_thread_priority() { nice(-20); int policy = 0; struct sched_param param; pthread_getschedparam(pthread_self(), &policy, ¶m); param.sched_priority = sched_get_priority_max(policy); pthread_setschedparam(pthread_self(), policy, ¶m); }
I could see that the nice value is reflected against the process and the thread runs 100% CPU in most cases (it is basically single thread algorithm).
Set CPU affinity using following code:
void set_affinity() { cpu_set_t mask; CPU_ZERO(&mask); CPU_SET(4, &mask); CPU_SET(5, &mask); CPU_SET(6, &mask); CPU_SET(7, &mask); sched_setaffinity(0, sizeof(mask), &mask); }
This code also reflects well on big.LITTLE because when I set CPUs as 0, 1, 2, 3, the code runs much slower than when I set CPUs as 4, 5, 6, 7. Even then insert_time < retrieve_time in both cases.
Ensure that sufficient free RAM is available for my dataset
To avoid the possibility that Step 3 might retrieve from virtual memory, I added Step 4, which is just repeating Step 3:
Step 4: Retrieve from index and measure time taken again (retrieve_time2)
To my surprise, retrieve_time2 > retrieve_time > insert_time (by 2 to 3 seconds for 10 million records).
As for my code, the insert code looks like this:
it1 = m.begin(); start = getTimeVal(); for (; it1 != m.end(); ++it1) { art_insert(&at, (unsigned char*) it1->first.c_str(), (int) it1->first.length() + 1, (void *) it1->second.c_str(), (int) it1->second.length()); ctr++; } stop = getTimeVal();
and retrieve code looks like this:
it1 = m.begin(); start = getTimeVal(); for (; it1 != m.end(); ++it1) { int len; char *value = (char *) art_search(&at, (unsigned char*) it1->first.c_str(), (int) it1->first.length() + 1, &len); ctr++; } stop = getTimeVal();
Any pointers as to what I could do further? Or is there an explanation for this from the platform perspective?
Yes. For some special data operation APIs such as memset(), memcpy(), they may trigger the Writing Steaming Mode for the CPU feature. In this mode, the memory data will not be allocated into cache to avoid the cache pollution because those kind of data usually are used only once. You have to check the Technical Reference Manual of the CPU you are using to know whether it supports Writing Steaming Mode or not.