This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Performance of Memory Benchmark Slowly Improves on Cortex-A72

I have a 16-core machine with Cortex-A72 processors. The physical layout is shown at the end of the post. Each core has its own 48KB L1i cache and 32KB L1d cache. Clusters of 4 cores have a shared 2048KB L2 cache, and the machine has 31GB of memory in total.

I have a lightweight C program that runs a simple memory benchmark: it allocates a 400MB buffer and performs (pseudo)random accesses to it. I run this benchmark a few hundred times and record the running time in CPU cycles. The goal is to see how consistent I can get the performance to be. Here is the code in its entirety:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <inttypes.h>

#define KILOBYTE 1024
#define MEGABYTE (KILOBYTE * 1024)
#define numTrials 400

/* Small RNG to avoid calling rand() and going through a library. Magic numbers
 * taken from glibc source code.
 */
uint32_t myRand(uint32_t *state){
	 uint32_t newVal = ((*state * 1103515245U) + 12345U) & 0x7fffffff;
	 *state = newVal;
	 return newVal;
}

/**
 * Randomly read values from a buffer.
 */
int memRandAccessBenchmark(char *buf, size_t size){
	unsigned long long x = 0;
	uint32_t state = 1;
	for(size_t ind = 0; ind < 16 * 1024 * 1024; ind++){
		x = ((x ^ 0x123) + buf[myRand(&state) % size] * 3) % 123456;
	}
	return x;
}

/**
 * Allocate a buffer of size `sz` and populate it with arbitrary data.
 */
char *allocAndInitBuf(size_t sz){
	char *buf = calloc(sz, 1);
	for(int ind = 0; ind < sz; ind++){
		buf[ind] = ind % 255;
	}
	return buf;
}

int main(){
	size_t sz = 400 * MEGABYTE;
	char *mainBuf = allocAndInitBuf(sz);
	printf("Running %d trials of sz %lu\n", numTrials, sz);

	uint64_t
		timeStart[numTrials] = {0},
		timeEnd[numTrials] = {0};

	uint32_t val;
	asm volatile("mrs %0, pmcr_el0" : "=r" (val));
	asm volatile("msr pmcr_el0, %0" : : "r"
		((val & ~(((uint32_t)1) << 3)) | (((uint32_t)1) << 6)));

	// Run once to work up i-cache/d-cache.
	int accumulator = memRandAccessBenchmark(mainBuf, sz);

	for(int trial = 0; trial < numTrials; trial++){
		// Reset counters.
		asm volatile("mrs %0, pmcr_el0" : "=r" (val));
		asm volatile("msr pmcr_el0, %0" : : "r" ((val | (((uint32_t)1) << 1) | (((uint32_t)1) << 2))));

		asm volatile("isb; mrs %0, PMCCNTR_EL0" : "=r" (timeStart[trial]));
		accumulator += memRandAccessBenchmark(mainBuf, sz);
		asm volatile("isb; mrs %0, PMCCNTR_EL0" : "=r" (timeEnd[trial]));
	}

	for(int ind = 0; ind < numTrials; ind++){
		uint64_t
			start = timeStart[ind],
			end = timeEnd[ind];

		if(start > end){
			printf("%" PRId64 " > %" PRId64 " (overflow detected)\n", start, end);
			return 1;
		}

		printf("%" PRId64 "\n", end - start);
	}
	printf("Accumulator value: %d\n", accumulator);
	return EXIT_SUCCESS;
}

main() allocates the buffer via allocAndInitBuf(), and runs the memRandAccessBenchmark() benchmark on it 400 times. Note that each run of memRandAccessBenchmark() does the exact same random accesses, because we seed the RNG at the beginning of the function with a value of 1! What's extremely surprising to me is the resulting behavior:

The above graph plots the time taken by each memRandAccessBenchmark() call. It seems to gradually speed up! This would be less surprising if it sped up over the course of maybe 2 or 3 runs, maybe the memory caches take a bit of time to become consistent, but I can't imagine why the performance would improve over several dozen runs of the function. Is it the caches? This doesn't make sense, since in my simple mental model of caches, the caches would be in a consistent state after just 1 or 2 runs because each one does the exact same accesses. I'm not sure how to figure out what's causing this slow improvement. 

The testing environment

The machine has 16 cores, cores #8-15 of which are isolcpu'd. They also have nohz_full enabled (roughly speaking, they basically don't process any timer interrupts). All kernel workqueues and movable IRQs are pinned to core 1. This benchmark script is run on one of the isolated cores. A few necessary system processes are running, everything else is disabled. This setup gives me extremely consistent performance numbers in general. For a CPU-bound workload (e.g. a simple for-loop doing some math), I get 0-cycle variability. All this is to say that the numbers above should be meaningful.

Machine Layout: