Inexplicable performance on big.LITTLE technology (on Android)

I apologise for the long question, but I am trying to measure performance of different indexing techniques on various platforms, one of which is Adaptive Radix tree.

I have run tests where the basic steps look like this (c/c++):

Step 1: Generate or load data (few million key-value pairs)
Step 2: Insert into index and measure time taken (insert_time)
Step 3: Retrieve from index and measure time taken (retrieve_time)

I find that always insert_time > retrieve_time on most platforms such as Intel desktops (i386/amd64), iPad (Apple A9), Android (ARMv7) and Raspberry Pi 3 (ARMv8). This is expected, as insert complexity is higher than retrieve complexity.

But when I run the steps on big.LITTLE platforms, specifically Snapdragon 845 (Xiaomi POCO F1) and HiSilicon Kirin 659 (Honor 9 lite), I find insert_time < retrieve_time, except when data size is too low.

To diagnose what could be wrong, I went through the following steps:

  1. Ensure that the thread is running at maximum speed by using following code:

    void set_thread_priority() {
        nice(-20);
        int policy = 0;
        struct sched_param param;
        pthread_getschedparam(pthread_self(), &policy, &param);
        param.sched_priority = sched_get_priority_max(policy);
        pthread_setschedparam(pthread_self(), policy, &param);
    }

I could see that the nice value is reflected against the process and the thread runs 100% CPU in most cases (it is basically single thread algorithm).

  1. Set CPU affinity using following code:

    void set_affinity() {
        cpu_set_t mask;
        CPU_ZERO(&mask);
        CPU_SET(4, &mask);
        CPU_SET(5, &mask);
        CPU_SET(6, &mask);
        CPU_SET(7, &mask);
        sched_setaffinity(0, sizeof(mask), &mask);
    }

This code also reflects well on big.LITTLE because when I set CPUs as 0, 1, 2, 3, the code runs much slower than when I set CPUs as 4, 5, 6, 7. Even then insert_time < retrieve_time in both cases.

  1. Ensure that sufficient free RAM is available for my dataset

  2. To avoid the possibility that Step 3 might retrieve from virtual memory, I added Step 4, which is just repeating Step 3:

    Step 4: Retrieve from index and measure time taken again (retrieve_time2)

To my surprise, retrieve_time2 > retrieve_time > insert_time (by 2 to 3 seconds for 10 million records).

As for my code, the insert code looks like this:

    it1 = m.begin();
    start = getTimeVal();
    for (; it1 != m.end(); ++it1) {
        art_insert(&at, (unsigned char*) it1->first.c_str(),
               (int) it1->first.length() + 1, (void *) it1->second.c_str(),
               (int) it1->second.length());
        ctr++;
    }
    stop = getTimeVal();

and retrieve code looks like this:

    it1 = m.begin();
    start = getTimeVal();
    for (; it1 != m.end(); ++it1) {
        int len;
        char *value = (char *) art_search(&at,
            (unsigned char*) it1->first.c_str(), (int) it1->first.length() + 1, &len);
        ctr++;
    }
    stop = getTimeVal();

Any pointers as to what I could do further? Or is there an explanation for this from the platform perspective?