This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Getting the compiler to consistently use LDM ?

Hello,

I am trying to optimize the implementation of a FIR filter. One major improvement would be to use LDM for loading all data/coefficients to minimize the amount of cycles used for memory access.

How can I get the compiler do this consistently ?

long fir_coef[] = ...;
long dbuf = ...;
long out_buf;

void test(void)
{
register long c1, c2, c3, c4;
register long d1, d2, d3, d4;
registerlong accu;

c1 = fir_coef[0];
c2 = fir_coef[1];
c3 = fir_coef[2];
c4 = fir_coef[3];
d1 = dbuf[0];
d2 = dbuf[1];
d3 = dbuf[2];
d4 = dbuf[3];

accu = 0;
accu += c1 * d1;
accu += c2 * d2;
accu += c3 * d3;
accu += c4 * d4;

out_buf = accu;
}

seems to use LDM sporadically to load two registers at once (at -O3), but ideally I would like to see only two LDM instructions in the above code.

Can this be done in C, or is it time to get out the assembler ?


  • It is not generally a good idea to declare variables with the register qualifier. The RV compiler is a very good optimizing compiler, and you will hamper its decisions when you force register variables.

    Your code needs 12 registers if it holds everything in the cpu context. The compiler uses some registers to hold memory base pointers and interworking veneers, so it might be deciding not to load all the coefficients at once.

    If you want to go to assembly, you have a few alternatives: a) you can use the inline assembler and code a 'virtual registers' version of the algorithm, since it is a very straightforward computation, or b) you can write it as a assembly function, and use the full cpu registers for the filter computation. The compiler will save the needed registers prior to call the assembly module.

  • It is not generally a good idea to declare variables with the register qualifier. The RV compiler is a very good optimizing compiler, and you will hamper its decisions when you force register variables.<p>

    I believe the compiler will ignore the register qualifier anyway. I put it there to show which variables I'd like to see put in registers.

    If you want to go to assembly, you have a few alternatives: a) you can use the inline assembler and code a 'virtual registers' version of the algorithm, since it is a very straightforward computation,

    Is there any explanation in the docs about these "virtual registers" ? Any time I try to use inline assembly, I keep running into

    main01.c(242): warning:  #d1267-D: Implicit physical register R0 should be defined as a variable
    

    warnings, and an

    main01.c(242): error  #549: variable "R0" is used before its value is set
    

    error when I try to access one of the function arguments which should be passed in R0. I believe this has something to do with "virtual registers", or do I need to do anything to be able to use the "raw" registers in my inline assembly (like all of the examples seem to do) ?

  • Ok, I think I found it. I should have checked the RealView docs earlier instead of looking at the CARM docs.

    Silly me.

  • Ok, after playing around with the source code for a while, the "pure C" version of the filter runs 4% faster than the version with optimized embedded C.

    Back to the drawing board.

  • It might sometimes be good to do something like:

    acc += *a++ * *b++;
    acc += *a++ * *b++;
    acc += *a++ * *b++;
    acc += *a++ * *b++;
    

    when filtering data. A lot depens on what external loops you need, i.e. how much code is part of a filter kernel in comparison to the amount of iterations with sample data.

  • I unrolled the innermost loops (the filter acts on several channels of data, and produces several output sample per call), since the ARM architecture does not have zero-overhead-looping functions.

    However, I found working with pointers that are incremented actually slows things down since after every output sample, I need to reset the coefficient pointer to the start of the filter. Instead, I used regular indexing:

    acc = a[0] * b[0];
    acc += a[1] * b[1];
    ...
    

    since it does not matter to the processor whether it writes back the modified address or just uses a temporary index.