This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Problem with _global_reg()

HE Bin over 12 years ago

Hi guys,

I use a _global_reg var in my program to avoid save/load:

_global_var uint32_t finished_pixels;

this variable is updated in ADC interrupt.

In main, I have following code:

finished_pixels = 0; // Initial global reg

while (processed_pixels >= finished_pixels)

{

// Process new pixel

...

}

But the armcc will optimize all code in while loop, i.e, "// Process new pixel" section.

I tried "volatile _global_var uint32_t finished_pixels;", but the compiler says it has no effect and the same thing happens;

Now I use "volatile register uint32_t processed_pixels = 0;" to avoid the trap, but it looks ugly.

Is that a compile bug?

Any bettrer solutions?

Thanks a lot!

0 daith over 12 years ago

Yes the volatile is necessary if you want to have something okay in the middle of an interrupt.
However in
ARM Information Center
it says volatile is ignored for global regs
Also this all seems rather worrying. If you want to reserve a register like this and have it valid in interrupts you need to have it reserved in all the user level code at all times and in all the library routines as well like memcpy and printf etc which really means a change to the ABI and recompiling everything and checking assembly routines. Fine for someone like an RTOS developer but I would't recommend it unless a reserved register is guaranteed as a facility in the OS you are using. And not just guaranteed at the interfaces but actually not used at all within assembler routines like memcpy. Typically such a register would hold something like a pointer to thread data and even then it is only guaranteed at interfaces not within routines, using it to hold a counter seems a bit of a waste to me.
Without the volatile all that is guaranteed is that when you call or enter or return from a subroutine compiled with that then the register has the correct value at that point.
Cancel
Vote up 0 Vote down

Cancel
0 HE Bin over 12 years ago in reply to daith

Hi daith,
I use "--global_reg=1,2" in C/C++ option to make compiler to reserve r4, r5 for global var.
Because my code has to process 1024 points using a specific algorithm within 0.25us, I must use "naked system".
I reviewed the assembly code, "volatile register uint32_t processed_pixels = 0;" will cause memory save/load for local variable "processed_pixels".
That is, "volatile" will make compiler forget the following "register".
Now, I have to use "__ASM("mov r4, #0");" to reset "_global_var uint32_t finished_pixels;" and reserve code in while loop, which is necessary.
Hope the armcc will fix the problem, or someone can offer a better solution.
Thanks!
Cancel
Vote up 0 Vote down

Cancel
0 daith over 12 years ago in reply to HE Bin

If you're not doing anything else you could just stay uninterruptible for 0.25 usec and poll for the values. I'd have limit counts on the wait loops just in case something goes wrong - you don't want to stay uninterruptible too long. I assume there's nothing else critical to be done within 0.25 usec. If it was just about possible to do the job with interrupts it should be easy this way.
Um sorry, I see you said 1024 values. In 0.25usec. That's rather a bit over the top silly me! I guess you mean each point in 0.25usec so a total of 256usec. Which is still less than the critical time for most things so what I said might still be okay.
Cancel
Vote up 0 Vote down

Cancel
0 Peter Harris over 12 years ago

> I use a _global_reg var in my program to avoid save/load
> I use "--global_reg=1,2" in C/C++ option to make compiler to reserve r4, r5 for global var.

Do you have any evidence that you actually need to do this? You don't say what processor you are using, or what frequency it is running at, but this is one of those optimizations which is often more hassle than it is worth.

Stacking variables on ARM is relatively painless so you are not saving much, and loss of two registers for "normal code" is likely to degrade performance significantly for algorithms with any kind of complexity. You are avoiding save and load on interrupt, but will force functions with a lot of live variables to stack a lot more aggressively during normal execution as they have fewer registers available. In particular this optimization can _force_ more save and load simply because you are giving the compiler less flexibility to schedule registers intelligently so you won't stack your two counters, but you'll end up stacking other variables instead - the end result is much the same.

If you are processing bulk pixel data I would expect the cost of saving and restoring a couple of registers' worth of data to be in the noise.

Pete
Cancel
Vote up 0 Vote down

Cancel
0 HE Bin over 12 years ago in reply to daith

Sorry daith, I made a mistake. As you said, on pixel takes up to 0.25us.
I use a sub pixel algorithm to accumulate pixel values and location weighted pixel values:
note: val_sum is a global uint32_t and weighted_sum is a uint64_t var.
the assembly code uses lots save/load as following:
                  ADC1_2_IRQHandler PROC
;;;787
;;;788    void ADC1_2_IRQHandler(void)
000254 b4c0              PUSH     {r6,r7}
;;;789    {
;;;790      /* Test on ADC end of conversion interrupt */
;;;791      // if(ADC_GetITStatus(ADC1, ADC_IT_EOC))
;;;792      {
;;;793        register uint32_t val;
;;;794
;;;795    #if USE_INTERPOLATION
;;;796        register uint32_t interpolation;
;;;797    #endif
;;;798
;;;799        /* Clear ADC end of conversion interrupt -- 读取ADC1->DR时自动清除 */
;;;800        // ADC_ClearITPendingBit(ADC1, ADC_IT_EOC)
;;;801
;;;802        /* ADC value */
;;;803        val = ADC1->DR;
000256 f04f40a0          MOV      r0,#0x50000000
00025a 6c00              LDR      r0,[r0,#0x40]
;;;804
;;;805    #if USE_INTERPOLATION
;;;806        interpolation = (prev_val + val) >> 1;
;;;807        prev_val = val;
;;;808
;;;809        interpolation *= interpolation;
;;;810        val_sum += interpolation;
;;;811        curr_pixel++;
00025c 1c64              ADDS     r4,r4,#1
00025e 1829              ADDS     r1,r5,r0              ;806
000260 0849              LSRS     r1,r1,#1              ;806
000262 fb01f201          MUL      r2,r1,r1              ;809
000266 4912              LDR      r1,|L1.688|
000268 4605              MOV      r5,r0                 ;807
00026a 68cb              LDR      r3,[r1,#0xc]          ;810 ; val_sum
;;;812        weighted_sum += interpolation * curr_pixel;
00026c e9d1c704          LDRD     r12,r7,[r1,#0x10]
000270 189e              ADDS     r6,r3,r2              ;810
000272 fb02f304          MUL      r3,r2,r4
000276 2200              MOVS     r2,#0
000278 eb130c0c          ADDS     r12,r3,r12
00027c eb420307          ADC      r3,r2,r7
;;;813    #endif
;;;814
;;;815        val *= val;
000280 4368              MULS     r0,r5,r0
;;;816        val_sum += val;
000282 4406              ADD      r6,r6,r0
;;;817        curr_pixel++;
000284 1c64              ADDS     r4,r4,#1
;;;818        weighted_sum += val * curr_pixel;
000286 4360              MULS     r0,r4,r0
000288 eb10000c          ADDS     r0,r0,r12
00028c 415a              ADCS     r2,r2,r3
00028e 60ce              STR      r6,[r1,#0xc] ; val_sum
;;;819
;;;820    #if USE_INTERPOLATION
;;;821        if (curr_pixel >= (PIXEL_CNT << 1))
000290 e9c10204          STRD     r0,r2,[r1,#0x10]
000294 f5b46f00          CMP      r4,#0x800
;;;822    #else
;;;823        if (curr_pixel >= PIXEL_CNT)
;;;824    #endif
;;;825        {
;;;826          /* 已采集足够像素，关掉ADC中断 */
;;;827          ADC_ITConfig(ADC1, ADC_IT_EOC, DISABLE);
;;;828        }
;;;829      }
;;;830    }
000298 bf3c              ITT      CC
00029a bcc0              POPCC    {r6,r7}
00029c 4770              BXCC     lr
00029e 2200              MOVS     r2,#0                 ;827
0002a0 bcc0              POP      {r6,r7}               ;827
0002a2 2104              MOVS     r1,#4                 ;827
0002a4 f04f40a0          MOV      r0,#0x50000000        ;827
0002a8 f7ffbffe          B.W      ADC_ITConfig
;;;831
                          ENDP
For a STM32F302CBT6@72MHZ, only 93M/s*0.25us=23 single cycle instructions allowed.
Also. the row pixel values should be send to UART in calibrate mode.
So I tried get data in ADC interrupt & process data in main loop/calibrate loop to avoid save/load.
Could you give me further suggestion?
Thanks.
Cancel
Vote up 0 Vote down

Cancel
0 HE Bin over 12 years ago in reply to Peter Harris

Hi peter,
Thanks for your help.
I use a STM32F302CBT6@72MHZ in my research. As I mentioned in the reply to daith, I have to save instruction to process one pixel within 23 instructions.
Mybe I should use a STM32F2 chip, but now, the F2 family doesn't have ADC with more than 4Msps@12bit resolution.
Anyway, experiments take me a lot of fun.
Cancel
Vote up 0 Vote down

Cancel
0 HE Bin over 12 years ago in reply to Peter Harris

>If you are processing bulk pixel data I would expect the cost of saving and restoring a couple of registers' worth of data to be in the noise.
That's true.
I am now thinking of using DMA for one frame & SIMD to process batch data.
But I will have a "long" DMA interrupt handler. I don't know if it's a problem that an interrupt takes almost all the CPU time to process one frame.
Thanks.
Cancel
Vote up 0 Vote down

Cancel
0 daith over 12 years ago in reply to HE Bin

Well yes SIMD is designed practically for work like this.
Blocking up data and doing a batch is normally a very good thing too, though one would normally try and make them as small as reasonable to avoid latency delays and to take up less space - one doesn't want chunks of work to get bigger than the cache.
I don't understand why you are saying much time would be taken up in the interrupt handler if you do what you say, wouldn't you use say three blocks in a loop and fill up one while processing another and having another free to be filled? The processing of the blocks could be interruptible.
Cancel
Vote up 0 Vote down

Cancel
0 HE Bin over 12 years ago in reply to daith

Yes, double buffer + DMA will be a good solution.
But I wonder if STM32F302@72MHZ could process 1024 pixels within 256us.
It seams hard, but you two guys have give me useful suggestions.
Anyway, is "_global_reg(x) in condition will be optimized as the value will never change" an armcc compile bug?
Thanks again.
Cancel
Vote up 0 Vote down

Cancel
0 daith over 12 years ago in reply to HE Bin

I'm not sure what you are saying about global_reg.
The timing looks very tight. I think it should be possible. I think it may be possible to time the getting of the ADC values with the DMA which would be nice. If you do two values at a time and pack them in halfwords I think you probably can just about do the job using the SIMD instructions.
Cancel
Vote up 0 Vote down

Cancel
0 HE Bin over 12 years ago in reply to daith

The problem is, a _global_reg variable can't be specified as volatile.
Thus, ARMCC compiler will consider it never changes.
Cancel
Vote up 0 Vote down

Cancel
0 daith over 12 years ago in reply to HE Bin

You could talk to them about marking it as volatile which I think is the cleanest way of saying it could change at any time, as their documentation says at the moment they don't currently support that. I would be rather loath to depend on such a facility unless the ABI and RTOS supported having a register reserved in such a way so I guess they haven't had any requests for it. If you are are careful placing your loads and stores so they don't cause holdups they an be quite cheap so if in C one had some calculations followed by testing the value of a volatille variable the load of the volatile can be moved back over the calculation.
Cancel
Vote up 0 Vote down

Cancel