This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Problem with _global_reg()

Hi guys,

I use a _global_reg var in my program to avoid save/load:

_global_var uint32_t finished_pixels;

this variable is updated in ADC interrupt.

In main, I have following code:

register uint32_t processed_pixels = 0;

finished_pixels = 0;     // Initial global reg

while (processed_pixels >= finished_pixels)

{

     // Process new pixel

     ...

}

But the armcc will optimize all code in while loop, i.e, "// Process new pixel" section.

I tried "volatile _global_var uint32_t finished_pixels;", but the compiler says it has no effect and the same thing happens;

Now I use "volatile register uint32_t processed_pixels = 0;" to avoid the trap, but it looks ugly.

Is that a compile bug?

Any bettrer solutions?

Thanks a lot!

  • Yes the volatile is necessary if you want to have something okay in the middle of an interrupt.

    However in

    ARM Information Center

    it says volatile is ignored for global regs

    Also this all seems rather worrying. If you want to reserve a register like this and have it valid in interrupts you need to have it reserved in all the user level code at all times and in all the library routines as well like memcpy and printf etc which really means a change to the ABI and recompiling everything and checking assembly routines. Fine for someone like an RTOS developer but I would't recommend it unless a reserved register is guaranteed as a facility in the OS you are using. And not just guaranteed at the interfaces but actually not used at all within assembler routines like memcpy. Typically such a register would hold something like a pointer to thread data and even then it is only guaranteed at interfaces not within routines, using it to hold a counter seems a bit of a waste to me.

    Without the volatile all that is guaranteed is that when you call or enter or return from a subroutine compiled with that then the register has the correct value at that point.

  • Hi daith,

    I use "--global_reg=1,2" in C/C++ option to make compiler to reserve r4, r5 for global var.

    Because my code has to process 1024 points using a specific algorithm within 0.25us, I must use "naked system".

    I reviewed the assembly code, "volatile register uint32_t processed_pixels = 0;" will cause memory save/load for local variable "processed_pixels".

    That is, "volatile" will make compiler forget the following "register".

    Now, I have to use "__ASM("mov r4, #0");" to reset "_global_var uint32_t finished_pixels;" and reserve code in while loop, which is necessary.

    Hope the armcc will fix the problem, or someone can offer a better solution.

    Thanks!

  • If you're not doing anything else you could just stay uninterruptible for 0.25 usec and poll for the values. I'd have limit counts on the wait loops just in case something goes wrong - you don't want to stay uninterruptible too long. I assume there's nothing else critical to be done within 0.25 usec. If it was just about possible to do the job with interrupts it should be easy this way.

    Um sorry, I see you said 1024 values. In 0.25usec. That's rather a bit over the top silly me! I guess you mean each point in 0.25usec so a total of 256usec. Which is still less than the critical time for most things so what I said might still be okay.

  • > I use a _global_reg var in my program to avoid save/load

    > I use "--global_reg=1,2" in C/C++ option to make compiler to reserve r4, r5 for global var.


    Do you have any evidence that you actually need to do this? You don't say what processor you are using, or what frequency it is running at, but this is one of those optimizations which is often more hassle than it is worth.


    Stacking variables on ARM is relatively painless so you are not saving much, and loss of two registers for "normal code" is likely to degrade performance significantly for algorithms with any kind of complexity. You are avoiding save and load on interrupt, but will force functions with a lot of live variables to stack a lot more aggressively during normal execution as they have fewer registers available. In particular this optimization can _force_ more save and load simply because you are giving the compiler less flexibility to schedule registers intelligently so you won't stack your two counters, but you'll end up stacking other variables instead - the end result is much the same.


    If you are processing bulk pixel data I would expect the cost of saving and restoring a couple of registers' worth of data to be in the noise.


    Pete

  • Sorry daith, I made a mistake. As you said, on pixel takes up to 0.25us.

    I use a sub pixel algorithm to accumulate pixel values and location weighted pixel values:

    note: val_sum is a global uint32_t and  weighted_sum is a uint64_t var.

    the assembly code uses  lots save/load as following:

                      ADC1_2_IRQHandler PROC

    ;;;787   

    ;;;788    void ADC1_2_IRQHandler(void)

    000254  b4c0              PUSH     {r6,r7}

    ;;;789    {

    ;;;790      /* Test on ADC end of conversion interrupt */

    ;;;791      // if(ADC_GetITStatus(ADC1, ADC_IT_EOC))

    ;;;792      {

    ;;;793        register uint32_t val;

    ;;;794   

    ;;;795    #if USE_INTERPOLATION

    ;;;796        register uint32_t interpolation;

    ;;;797    #endif

    ;;;798   

    ;;;799        /* Clear ADC end of conversion interrupt -- 读取ADC1->DR时自动清除 */

    ;;;800        // ADC_ClearITPendingBit(ADC1, ADC_IT_EOC)

    ;;;801   

    ;;;802        /* ADC value */

    ;;;803        val = ADC1->DR;

    000256  f04f40a0          MOV      r0,#0x50000000

    00025a  6c00              LDR      r0,[r0,#0x40]

    ;;;804   

    ;;;805    #if USE_INTERPOLATION

    ;;;806        interpolation = (prev_val + val) >> 1;

    ;;;807        prev_val = val;

    ;;;808   

    ;;;809        interpolation *= interpolation;

    ;;;810        val_sum += interpolation;

    ;;;811        curr_pixel++;

    00025c  1c64              ADDS     r4,r4,#1

    00025e  1829              ADDS     r1,r5,r0              ;806

    000260  0849              LSRS     r1,r1,#1              ;806

    000262  fb01f201          MUL      r2,r1,r1              ;809

    000266  4912              LDR      r1,|L1.688|

    000268  4605              MOV      r5,r0                 ;807

    00026a  68cb              LDR      r3,[r1,#0xc]          ;810  ; val_sum

    ;;;812        weighted_sum += interpolation * curr_pixel;

    00026c  e9d1c704          LDRD     r12,r7,[r1,#0x10]

    000270  189e              ADDS     r6,r3,r2              ;810

    000272  fb02f304          MUL      r3,r2,r4

    000276  2200              MOVS     r2,#0

    000278  eb130c0c          ADDS     r12,r3,r12

    00027c  eb420307          ADC      r3,r2,r7

    ;;;813    #endif

    ;;;814   

    ;;;815        val *= val;

    000280  4368              MULS     r0,r5,r0

    ;;;816        val_sum += val;

    000282  4406              ADD      r6,r6,r0

    ;;;817        curr_pixel++;

    000284  1c64              ADDS     r4,r4,#1

    ;;;818        weighted_sum += val * curr_pixel;

    000286  4360              MULS     r0,r4,r0

    000288  eb10000c          ADDS     r0,r0,r12

    00028c  415a              ADCS     r2,r2,r3

    00028e  60ce              STR      r6,[r1,#0xc]  ; val_sum

    ;;;819   

    ;;;820    #if USE_INTERPOLATION

    ;;;821        if (curr_pixel >= (PIXEL_CNT << 1))

    000290  e9c10204          STRD     r0,r2,[r1,#0x10]

    000294  f5b46f00          CMP      r4,#0x800

    ;;;822    #else

    ;;;823        if (curr_pixel >= PIXEL_CNT)

    ;;;824    #endif

    ;;;825        {

    ;;;826          /* 已采集足够像素,关掉ADC中断 */

    ;;;827          ADC_ITConfig(ADC1, ADC_IT_EOC, DISABLE);

    ;;;828        }

    ;;;829      }

    ;;;830    }

    000298  bf3c              ITT      CC

    00029a  bcc0              POPCC    {r6,r7}

    00029c  4770              BXCC     lr

    00029e  2200              MOVS     r2,#0                 ;827

    0002a0  bcc0              POP      {r6,r7}               ;827

    0002a2  2104              MOVS     r1,#4                 ;827

    0002a4  f04f40a0          MOV      r0,#0x50000000        ;827

    0002a8  f7ffbffe          B.W      ADC_ITConfig

    ;;;831   

                              ENDP

    For a STM32F302CBT6@72MHZ, only 93M/s*0.25us=23 single cycle instructions allowed.

    Also. the row pixel values should be send to UART in calibrate mode.

    So I tried get data in ADC interrupt & process data in main loop/calibrate loop to avoid save/load.

    Could you give me further suggestion?

    Thanks.

  • Hi peter,

    Thanks for your help.

    I use a STM32F302CBT6@72MHZ in my research. As I mentioned in the reply to daith, I have to save instruction to process one pixel within 23 instructions.

    Mybe I should use a STM32F2 chip, but now, the F2 family doesn't have ADC with more than 4Msps@12bit resolution.

    Anyway, experiments take me a lot of fun.

  • >If you are processing bulk pixel data I would expect the cost of saving and restoring a couple of registers' worth of data to be in the noise.

    That's true.

    I am now thinking of using DMA for one frame & SIMD to process batch data.

    But I will have a "long" DMA interrupt handler. I don't know if it's a problem that an interrupt takes almost all the CPU time to process one frame.

    Thanks.

  • Well yes SIMD is designed practically for work like this.

    Blocking up data and doing a batch is normally a very good thing too, though one would normally try and make them as small as reasonable to avoid latency delays and to take up less space - one doesn't want chunks of work to get bigger than the cache.

    I don't understand why you are saying much time would be taken up in the interrupt handler if you do what you say, wouldn't you use say three blocks in a loop and fill up one while processing another and having another free to be filled? The processing of the blocks could be interruptible.

  • Yes, double buffer + DMA will be a good solution.

    But I wonder if STM32F302@72MHZ could process 1024 pixels within 256us.

    It seams hard, but you two guys have give me useful suggestions.

    Anyway, is "_global_reg(x) in condition will be optimized as the value will never change" an armcc compile bug?

    Thanks again.

  • I'm not sure what you are saying about global_reg.

    The timing looks very tight. I think it should be possible. I think it may be possible to time the getting of the ADC values with the DMA which would be nice. If you do two values at a time and pack them in halfwords I think you probably can just about do the job using the SIMD instructions.

  • The problem is, a _global_reg variable can't be specified as volatile.

    Thus, ARMCC compiler will consider it never changes.

  • You could talk to them about marking it as volatile which I think is the cleanest way of saying it could change at any time, as their documentation says at the moment they don't currently support that. I would be rather loath to depend on such a facility unless the ABI and RTOS supported having a register reserved in such a way so I guess they haven't had any requests for it. If you are are careful placing your loads and stores so they don't cause holdups they an be quite cheap so if in C one had some calculations followed by testing the value of a volatille variable the load of the volatile can be moved back over the calculation.