Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.
We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.
Thank you for your understanding.
Hello,
I am trying to optimize the implementation of a FIR filter. One major improvement would be to use LDM for loading all data/coefficients to minimize the amount of cycles used for memory access.
How can I get the compiler do this consistently ?
long fir_coef[] = ...; long dbuf = ...; long out_buf; void test(void) { register long c1, c2, c3, c4; register long d1, d2, d3, d4; registerlong accu; c1 = fir_coef[0]; c2 = fir_coef[1]; c3 = fir_coef[2]; c4 = fir_coef[3]; d1 = dbuf[0]; d2 = dbuf[1]; d3 = dbuf[2]; d4 = dbuf[3]; accu = 0; accu += c1 * d1; accu += c2 * d2; accu += c3 * d3; accu += c4 * d4; out_buf = accu; } seems to use LDM sporadically to load two registers at once (at -O3), but ideally I would like to see only two LDM instructions in the above code. Can this be done in C, or is it time to get out the assembler ?
I unrolled the innermost loops (the filter acts on several channels of data, and produces several output sample per call), since the ARM architecture does not have zero-overhead-looping functions.
However, I found working with pointers that are incremented actually slows things down since after every output sample, I need to reset the coefficient pointer to the start of the filter. Instead, I used regular indexing:
acc = a[0] * b[0]; acc += a[1] * b[1]; ...
since it does not matter to the processor whether it writes back the modified address or just uses a temporary index.