We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hi guys,
I use a _global_reg var in my program to avoid save/load:
_global_var uint32_t finished_pixels;
this variable is updated in ADC interrupt.
In main, I have following code:
register uint32_t processed_pixels = 0;
finished_pixels = 0; // Initial global reg
while (processed_pixels >= finished_pixels)
{
// Process new pixel
...
}
But the armcc will optimize all code in while loop, i.e, "// Process new pixel" section.
I tried "volatile _global_var uint32_t finished_pixels;", but the compiler says it has no effect and the same thing happens;
Now I use "volatile register uint32_t processed_pixels = 0;" to avoid the trap, but it looks ugly.
Is that a compile bug?
Any bettrer solutions?
Thanks a lot!
Yes the volatile is necessary if you want to have something okay in the middle of an interrupt.
However in
ARM Information Center
it says volatile is ignored for global regs
Also this all seems rather worrying. If you want to reserve a register like this and have it valid in interrupts you need to have it reserved in all the user level code at all times and in all the library routines as well like memcpy and printf etc which really means a change to the ABI and recompiling everything and checking assembly routines. Fine for someone like an RTOS developer but I would't recommend it unless a reserved register is guaranteed as a facility in the OS you are using. And not just guaranteed at the interfaces but actually not used at all within assembler routines like memcpy. Typically such a register would hold something like a pointer to thread data and even then it is only guaranteed at interfaces not within routines, using it to hold a counter seems a bit of a waste to me.
Without the volatile all that is guaranteed is that when you call or enter or return from a subroutine compiled with that then the register has the correct value at that point.
Hi daith,
I use "--global_reg=1,2" in C/C++ option to make compiler to reserve r4, r5 for global var.
Because my code has to process 1024 points using a specific algorithm within 0.25us, I must use "naked system".
I reviewed the assembly code, "volatile register uint32_t processed_pixels = 0;" will cause memory save/load for local variable "processed_pixels".
That is, "volatile" will make compiler forget the following "register".
Now, I have to use "__ASM("mov r4, #0");" to reset "_global_var uint32_t finished_pixels;" and reserve code in while loop, which is necessary.
Hope the armcc will fix the problem, or someone can offer a better solution.
Thanks!
If you're not doing anything else you could just stay uninterruptible for 0.25 usec and poll for the values. I'd have limit counts on the wait loops just in case something goes wrong - you don't want to stay uninterruptible too long. I assume there's nothing else critical to be done within 0.25 usec. If it was just about possible to do the job with interrupts it should be easy this way.
Um sorry, I see you said 1024 values. In 0.25usec. That's rather a bit over the top silly me! I guess you mean each point in 0.25usec so a total of 256usec. Which is still less than the critical time for most things so what I said might still be okay.
> I use a _global_reg var in my program to avoid save/load
> I use "--global_reg=1,2" in C/C++ option to make compiler to reserve r4, r5 for global var.
Do you have any evidence that you actually need to do this? You don't say what processor you are using, or what frequency it is running at, but this is one of those optimizations which is often more hassle than it is worth.
Stacking variables on ARM is relatively painless so you are not saving much, and loss of two registers for "normal code" is likely to degrade performance significantly for algorithms with any kind of complexity. You are avoiding save and load on interrupt, but will force functions with a lot of live variables to stack a lot more aggressively during normal execution as they have fewer registers available. In particular this optimization can _force_ more save and load simply because you are giving the compiler less flexibility to schedule registers intelligently so you won't stack your two counters, but you'll end up stacking other variables instead - the end result is much the same.
If you are processing bulk pixel data I would expect the cost of saving and restoring a couple of registers' worth of data to be in the noise.
Pete
Sorry daith, I made a mistake. As you said, on pixel takes up to 0.25us.
I use a sub pixel algorithm to accumulate pixel values and location weighted pixel values:
note: val_sum is a global uint32_t and weighted_sum is a uint64_t var.
the assembly code uses lots save/load as following:
ADC1_2_IRQHandler PROC
;;;787
;;;788 void ADC1_2_IRQHandler(void)
000254 b4c0 PUSH {r6,r7}
;;;789 {
;;;790 /* Test on ADC end of conversion interrupt */
;;;791 // if(ADC_GetITStatus(ADC1, ADC_IT_EOC))
;;;792 {
;;;793 register uint32_t val;
;;;794
;;;795 #if USE_INTERPOLATION
;;;796 register uint32_t interpolation;
;;;797 #endif
;;;798
;;;799 /* Clear ADC end of conversion interrupt -- 读取ADC1->DR时自动清除 */
;;;800 // ADC_ClearITPendingBit(ADC1, ADC_IT_EOC)
;;;801
;;;802 /* ADC value */
;;;803 val = ADC1->DR;
000256 f04f40a0 MOV r0,#0x50000000
00025a 6c00 LDR r0,[r0,#0x40]
;;;804
;;;805 #if USE_INTERPOLATION
;;;806 interpolation = (prev_val + val) >> 1;
;;;807 prev_val = val;
;;;808
;;;809 interpolation *= interpolation;
;;;810 val_sum += interpolation;
;;;811 curr_pixel++;
00025c 1c64 ADDS r4,r4,#1
00025e 1829 ADDS r1,r5,r0 ;806
000260 0849 LSRS r1,r1,#1 ;806
000262 fb01f201 MUL r2,r1,r1 ;809
000266 4912 LDR r1,|L1.688|
000268 4605 MOV r5,r0 ;807
00026a 68cb LDR r3,[r1,#0xc] ;810 ; val_sum
;;;812 weighted_sum += interpolation * curr_pixel;
00026c e9d1c704 LDRD r12,r7,[r1,#0x10]
000270 189e ADDS r6,r3,r2 ;810
000272 fb02f304 MUL r3,r2,r4
000276 2200 MOVS r2,#0
000278 eb130c0c ADDS r12,r3,r12
00027c eb420307 ADC r3,r2,r7
;;;813 #endif
;;;814
;;;815 val *= val;
000280 4368 MULS r0,r5,r0
;;;816 val_sum += val;
000282 4406 ADD r6,r6,r0
;;;817 curr_pixel++;
000284 1c64 ADDS r4,r4,#1
;;;818 weighted_sum += val * curr_pixel;
000286 4360 MULS r0,r4,r0
000288 eb10000c ADDS r0,r0,r12
00028c 415a ADCS r2,r2,r3
00028e 60ce STR r6,[r1,#0xc] ; val_sum
;;;819
;;;820 #if USE_INTERPOLATION
;;;821 if (curr_pixel >= (PIXEL_CNT << 1))
000290 e9c10204 STRD r0,r2,[r1,#0x10]
000294 f5b46f00 CMP r4,#0x800
;;;822 #else
;;;823 if (curr_pixel >= PIXEL_CNT)
;;;824 #endif
;;;825 {
;;;826 /* 已采集足够像素,关掉ADC中断 */
;;;827 ADC_ITConfig(ADC1, ADC_IT_EOC, DISABLE);
;;;828 }
;;;829 }
;;;830 }
000298 bf3c ITT CC
00029a bcc0 POPCC {r6,r7}
00029c 4770 BXCC lr
00029e 2200 MOVS r2,#0 ;827
0002a0 bcc0 POP {r6,r7} ;827
0002a2 2104 MOVS r1,#4 ;827
0002a4 f04f40a0 MOV r0,#0x50000000 ;827
0002a8 f7ffbffe B.W ADC_ITConfig
;;;831
ENDP
For a STM32F302CBT6@72MHZ, only 93M/s*0.25us=23 single cycle instructions allowed.
Also. the row pixel values should be send to UART in calibrate mode.
So I tried get data in ADC interrupt & process data in main loop/calibrate loop to avoid save/load.
Could you give me further suggestion?
Thanks.
Hi peter,
Thanks for your help.
I use a STM32F302CBT6@72MHZ in my research. As I mentioned in the reply to daith, I have to save instruction to process one pixel within 23 instructions.
Mybe I should use a STM32F2 chip, but now, the F2 family doesn't have ADC with more than 4Msps@12bit resolution.
Anyway, experiments take me a lot of fun.
>If you are processing bulk pixel data I would expect the cost of saving and restoring a couple of registers' worth of data to be in the noise.
That's true.
I am now thinking of using DMA for one frame & SIMD to process batch data.
But I will have a "long" DMA interrupt handler. I don't know if it's a problem that an interrupt takes almost all the CPU time to process one frame.
Well yes SIMD is designed practically for work like this.
Blocking up data and doing a batch is normally a very good thing too, though one would normally try and make them as small as reasonable to avoid latency delays and to take up less space - one doesn't want chunks of work to get bigger than the cache.
I don't understand why you are saying much time would be taken up in the interrupt handler if you do what you say, wouldn't you use say three blocks in a loop and fill up one while processing another and having another free to be filled? The processing of the blocks could be interruptible.
Yes, double buffer + DMA will be a good solution.
But I wonder if STM32F302@72MHZ could process 1024 pixels within 256us.
It seams hard, but you two guys have give me useful suggestions.
Anyway, is "_global_reg(x) in condition will be optimized as the value will never change" an armcc compile bug?
Thanks again.
I'm not sure what you are saying about global_reg.
The timing looks very tight. I think it should be possible. I think it may be possible to time the getting of the ADC values with the DMA which would be nice. If you do two values at a time and pack them in halfwords I think you probably can just about do the job using the SIMD instructions.
The problem is, a _global_reg variable can't be specified as volatile.
Thus, ARMCC compiler will consider it never changes.
You could talk to them about marking it as volatile which I think is the cleanest way of saying it could change at any time, as their documentation says at the moment they don't currently support that. I would be rather loath to depend on such a facility unless the ABI and RTOS supported having a register reserved in such a way so I guess they haven't had any requests for it. If you are are careful placing your loads and stores so they don't cause holdups they an be quite cheap so if in C one had some calculations followed by testing the value of a volatille variable the load of the volatile can be moved back over the calculation.