We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hi, I'm searching for a fast way to substract a dark image from another image. If the dark image pixel value is greater than the corresponding image pixel, the resulting image pixel should be zero. Otherwise, it should be simply substracted.
Are there special functions for doing this job?
I got an AT91SAM9260 processor and the image data is 10bit depth, laying in a 16-bit array.
Many thank, Stefan
I think the approach with the LUT is only working for a constant dark image pixel value.
As I said - as long as the maximum value of the dark image isn't too large, you could use a two-dimensional lookup table if the dark image pixel value isn't constant.
Once again, let's assume that the image is 4 bits deep, and the dark image is only 2 bits deep. In that case, you'd build the following lookup table:
#define IMAGE_RANGE 16 #define DARK_IMAGE_RANGE 4 int lookup_table[DARK_IMAGE_RANGE][IMAGE_RANGE] = { {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}, {0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}, {0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}, {0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} };
and then you would do the following operation:
new_image[x][y] = lookup_table[dark_image[x][y]][image[x][y]];
Still, I am not sure whether this will be faster on an ARM than just doing a straight subtraction and comparison, as in
signed int tmp; ... tmp = (signed int) image[x][y] - dark_image[x][y]; if(tmp < 0) { tmp = 0; } new_image[x][y] = tmp;
.. especially when this approach is optimized (e.g. making it into an one-dimensional loop, which should not be a problem if the size of your images is fixed).
Christoph Franck wrote:
> As I said - as long as the maximum value of the dark image isn't > too large, you could use a two-dimensional lookup table if the dark > image pixel value isn't constant.
Wasn't Stefan talking about 10bit samples for both pics? That would be quite a lot of memory for the LUT.
A C based algorithmic solution is of course the easiest to set up and maintain. If the margin is big enough, do it that way.
Unfortunately, ARM9 doesn't have specific instruction to speed this up. An optimized C routine (reverse engineered assembler, in fact) is a bit quicker than that but hardly any prettier than writing assembler to begin with. The attached assembler code is about 30% faster than the straight forward C solution (two pixels/iteration):
__asm void subtract_pic_v5(uint16_t *restrict p, uint16_t *restrict d, uint16_t *restrict diff) { PUSH {r4,r5,r6,r7,lr} LDR lr, =0xFFFF0000 LDR r3, =__cpp(NUM_SAMPLES) loop1 LDR r4, [r0], #4 ; *p++ LDR r5, [r1], #4 ; *d++ AND r6, r4, lr AND r12, r5, lr SUBS r12, r6, r12 ; *(p+2) - *(d+2) MOVLT r12, #0 LSL r4, #16 SUBS r6, r4, r5, LSL #16 ; *(p+0) - *(d+0) MOVLT r6, #0 ORR r12, r12, r6, LSR #16 STR r12, [r2], #4 SUBS r3, #2 BNE loop1 POP {r4,r5,r6,r7,pc} }
Just for the kicks here is an ARMv6 (ARM11 in case you want to wait for AT91SAM11 :) version taking advantage of SIMD instructions. With this code we process eight pixels per iteration.
#if __TARGET_ARCH_ARM >= 6 __asm void subtract_pic_v6(uint16_t *restrict p, uint16_t *restrict d, uint16_t *restrict diff) { PUSH {r4-r11} MOV r12, #0 LDR r3, =__cpp(NUM_SAMPLES) loop2 LDM r0!, {r4,r6,r8,r10} LDM r1!, {r5,r7,r9,r11} USUB16 r4, r4, r5 SEL r4, r4, r12 USUB16 r6, r6, r7 SEL r6, r6, r12 USUB16 r8, r8, r9 SEL r8, r8, r12 USUB16 r10, r10, r11 SEL r10, r10, r12 STM r2!, {r4,r6,r8,r10} SUBS r3, #8 BNE loop2 POP {r4-r11} BX lr } #endif
In any case, make sure data is aligned to 32 byte boundaries, since that is the size of an ARM926 cache line. Look at the cache architecture. Try to avoid having the frame buffers at addresses that carry a risk of cache trashing.
Regards Marcus http://www.doulos.com/arm/
Wasn't Stefan talking about 10bit samples for both pics?
I assumed that he was talking about the format of the (dark) image, not its actual dynamic range. Depending on the application, it may be possible that the dark image only has a small dynamic range, in which case the LUT solution would be feasible, depending on the amount of available RAM. If the dark images dynamic range can occupy most of the 10 bits, the LUT isn't an option.
I believe the LDM/STM instructions aren't limited to ARMv6, and processing multiple samples per iteration would be a good way to speed up the first example even more even if there are no SIMD instructions, since it cuts down the number of cycles spent on accessing memory and the loop overhead
> Depending on the application, it may be possible that the dark image > only has a small dynamic range, in which case the LUT solution would > be feasible, depending on the amount of available RAM.
True. Although the drawback of a LUT in this case is that this type of data structure has a poor cache performance. In fact it might trash cache lines occupied by samples.
> I believe the LDM/STM instructions aren't limited to ARMv6, and
Never said so. But USUB16/SEL are. With a cached core, LDM/STM themselves don't give much of a performance benefit unless you can process many samples per iteration.
> processing multiple samples per iteration would be a good way to speed > up the first example even more even if there are no SIMD instructions, > since it cuts down the number of cycles spent on accessing memory and > the loop overhead
The v5 example already processes two samples per iteration. I don't believe that the loop would benefit much from increasing that number. Again, assuming that the cache is enabled. You'd have to benchmark this in the function's actual context.
Many thanks for all the answers. I will test the current proposals.
The v5 example already processes two samples per iteration. I don't believe that the loop would benefit much from increasing that number.
Probably not on an ARM9, where only the loop overhead would be smaller (LDR takes 1 cycles on ARM9 and LDM takes n cycles, so there's no reduction in cycle count by using LDM. I was thinking of ARM7, where LDR takes 3 cycles and LDM takes n+2 cycles, and the cycle count reduction by using the load-multiple instructions can be significant).