I apologise for the long question, but I am trying to measure performance of different indexing techniques on various platforms, one of which is Adaptive Radix tree.
I have run tests where the basic steps look like this (c/c++):
Step 1: Generate or load data (few million key-value pairs) Step 2: Insert into index and measure time taken (insert_time) Step 3: Retrieve from index and measure time taken (retrieve_time)
I find that always insert_time > retrieve_time on most platforms such as Intel desktops (i386/amd64), iPad (Apple A9), Android (ARMv7) and Raspberry Pi 3 (ARMv8). This is expected, as insert complexity is higher than retrieve complexity.
But when I run the steps on big.LITTLE platforms, specifically Snapdragon 845 (Xiaomi POCO F1) and HiSilicon Kirin 659 (Honor 9 lite), I find insert_time < retrieve_time, except when data size is too low.
To diagnose what could be wrong, I went through the following steps:
Ensure that the thread is running at maximum speed by using following code:
void set_thread_priority() { nice(-20); int policy = 0; struct sched_param param; pthread_getschedparam(pthread_self(), &policy, ¶m); param.sched_priority = sched_get_priority_max(policy); pthread_setschedparam(pthread_self(), policy, ¶m); }
I could see that the nice value is reflected against the process and the thread runs 100% CPU in most cases (it is basically single thread algorithm).
Set CPU affinity using following code:
void set_affinity() { cpu_set_t mask; CPU_ZERO(&mask); CPU_SET(4, &mask); CPU_SET(5, &mask); CPU_SET(6, &mask); CPU_SET(7, &mask); sched_setaffinity(0, sizeof(mask), &mask); }
This code also reflects well on big.LITTLE because when I set CPUs as 0, 1, 2, 3, the code runs much slower than when I set CPUs as 4, 5, 6, 7. Even then insert_time < retrieve_time in both cases.
Ensure that sufficient free RAM is available for my dataset
To avoid the possibility that Step 3 might retrieve from virtual memory, I added Step 4, which is just repeating Step 3:
Step 4: Retrieve from index and measure time taken again (retrieve_time2)
To my surprise, retrieve_time2 > retrieve_time > insert_time (by 2 to 3 seconds for 10 million records).
As for my code, the insert code looks like this:
it1 = m.begin(); start = getTimeVal(); for (; it1 != m.end(); ++it1) { art_insert(&at, (unsigned char*) it1->first.c_str(), (int) it1->first.length() + 1, (void *) it1->second.c_str(), (int) it1->second.length()); ctr++; } stop = getTimeVal();
and retrieve code looks like this:
it1 = m.begin(); start = getTimeVal(); for (; it1 != m.end(); ++it1) { int len; char *value = (char *) art_search(&at, (unsigned char*) it1->first.c_str(), (int) it1->first.length() + 1, &len); ctr++; } stop = getTimeVal();
Any pointers as to what I could do further? Or is there an explanation for this from the platform perspective?
Dear Zhifei Yang - Thank you for the considered and categorical response.
About (1) and (2), I am unable to hotplug out CPUs or change the governor policy as I don't have root access on the new phone (Xiaomi Poco F1). However, I have read access to /sys/bus/cpu/devices/cpu?/cpufreq and I found some interesting things that may throw light on what is happening:
Since it appears that the CPU runs at constant speed (although not maximum 2.8GHz), can I conclude that the problem is not caused by CPU?
About (3), I can completely rule out reading from cache as I insert into multiple different index technologies (not just ART) before retrieving. So considering the size of data set there cannot be inserted data remaining in cache.
Another observation is that, when I use the top utility, initially the VIRT column shows 4.5G and after insert it shows 5.4G. My dataset including source buffer and index does not exceed 2GB. The phone has 6GB ram and 1.4GB free ram after insert.
I don't understand why VIRT column increases eventhough I have enough free ram. That is why I retrieve twice to rule out reading from virtual memory, but that also does not improve retrieve speed.
Could it be because of the type of memory caches used in SDM845? Or could it be because of type of low power memory used (LPDDR4x)?
I am stuck because I am unable to explain this anomaly on SDM845. Please let me know if you have any ideas.
Regards
Arun
For the 2), "schedutil" is the name of EAS cpu governor, which means your CPU frequency is managed by EAS dynamically. If your CPUs are running with different workloads ( workload will be calculated by EAS algorithm automatically ), your CPU frequencies may vary from time to time. So your benchmark tests may be run under different CPU frequencies.
Hi Zhifei Yang,
Thank you for replying. From what I see from stats/time_in_state and stats/total_trans in linux. there seems to be no change in speed transition (total_trans). So I am not sure if EAS varies the frequencies.
My insert operations mainly depend on memcpy. Is it possible that memcpy directly copies within external memory without first bringing data into cache? If this is so, then the behaviour is explained, because my retrieve operations have to bring data into cache and would be slower.
Yes. For some special data operation APIs such as memset(), memcpy(), they may trigger the Writing Steaming Mode for the CPU feature. In this mode, the memory data will not be allocated into cache to avoid the cache pollution because those kind of data usually are used only once. You have to check the Technical Reference Manual of the CPU you are using to know whether it supports Writing Steaming Mode or not.