This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Fast dark picture subtraction

Hi,
I'm searching for a fast way to substract a dark image from another image.
If the dark image pixel value is greater than the corresponding image pixel, the resulting image pixel should be zero. Otherwise, it should be simply substracted.

Are there special functions for doing this job?

I got an AT91SAM9260 processor and the image data is 10bit depth, laying in a 16-bit array.

Many thank,
Stefan

Parents

0 Christoph Franck over 16 years ago in reply to Stefan Hartwig
I think the approach with the LUT is only working for a constant dark image pixel value.

As I said - as long as the maximum value of the dark image isn't too large, you could use a two-dimensional lookup table if the dark image pixel value isn't constant.

Once again, let's assume that the image is 4 bits deep, and the dark image is only 2 bits deep. In that case, you'd build the following lookup table:

#define IMAGE_RANGE 16 #define DARK_IMAGE_RANGE 4 int lookup_table[DARK_IMAGE_RANGE][IMAGE_RANGE] = { {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}, {0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}, {0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}, {0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} };

and then you would do the following operation:

new_image[x][y] = lookup_table[dark_image[x][y]][image[x][y]];

Still, I am not sure whether this will be faster on an ARM than just doing a straight subtraction and comparison, as in

signed int tmp; ... tmp = (signed int) image[x][y] - dark_image[x][y]; if(tmp < 0) { tmp = 0; } new_image[x][y] = tmp;

.. especially when this approach is optimized (e.g. making it into an one-dimensional loop, which should not be a problem if the size of your images is fixed).
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Christoph Franck over 16 years ago in reply to Stefan Hartwig
I think the approach with the LUT is only working for a constant dark image pixel value.

As I said - as long as the maximum value of the dark image isn't too large, you could use a two-dimensional lookup table if the dark image pixel value isn't constant.

Once again, let's assume that the image is 4 bits deep, and the dark image is only 2 bits deep. In that case, you'd build the following lookup table:

#define IMAGE_RANGE 16 #define DARK_IMAGE_RANGE 4 int lookup_table[DARK_IMAGE_RANGE][IMAGE_RANGE] = { {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}, {0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}, {0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}, {0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} };

and then you would do the following operation:

new_image[x][y] = lookup_table[dark_image[x][y]][image[x][y]];

Still, I am not sure whether this will be faster on an ARM than just doing a straight subtraction and comparison, as in

signed int tmp; ... tmp = (signed int) image[x][y] - dark_image[x][y]; if(tmp < 0) { tmp = 0; } new_image[x][y] = tmp;

.. especially when this approach is optimized (e.g. making it into an one-dimensional loop, which should not be a problem if the size of your images is fixed).
Cancel
Vote up 0 Vote down

Cancel

Children

0 Marcus Harnisch over 16 years ago in reply to Christoph Franck

Christoph Franck wrote:

> As I said - as long as the maximum value of the dark image isn't
> too large, you could use a two-dimensional lookup table if the dark
> image pixel value isn't constant.

Wasn't Stefan talking about 10bit samples for both pics? That would be quite a lot of memory for the LUT.

A C based algorithmic solution is of course the easiest to set up and maintain. If the margin is big enough, do it that way.

Unfortunately, ARM9 doesn't have specific instruction to speed this up. An optimized C routine (reverse engineered assembler, in fact) is a bit quicker than that but hardly any prettier than writing assembler to begin with. The attached assembler code is about 30% faster than the straight forward C solution (two pixels/iteration):

__asm void subtract_pic_v5(uint16_t *restrict p, uint16_t *restrict d, uint16_t *restrict diff)
{
        PUSH    {r4,r5,r6,r7,lr}

        LDR     lr, =0xFFFF0000
        LDR     r3, =__cpp(NUM_SAMPLES)

loop1
        LDR     r4, [r0], #4    ; *p++
        LDR     r5, [r1], #4    ; *d++
        AND     r6, r4, lr
        AND     r12, r5, lr
        SUBS    r12, r6, r12    ; *(p+2) - *(d+2)
        MOVLT   r12, #0
        LSL     r4, #16
        SUBS    r6, r4, r5, LSL #16 ; *(p+0) - *(d+0)
        MOVLT   r6, #0
        ORR     r12, r12, r6, LSR #16
        STR     r12, [r2], #4
        SUBS    r3, #2
        BNE     loop1
        POP     {r4,r5,r6,r7,pc}
}

Just for the kicks here is an ARMv6 (ARM11 in case you want to wait for AT91SAM11 :) version taking advantage of SIMD instructions. With this code we process eight pixels per iteration.

#if __TARGET_ARCH_ARM >= 6
__asm void subtract_pic_v6(uint16_t *restrict p, uint16_t *restrict d, uint16_t *restrict diff)
{
        PUSH    {r4-r11}

        MOV     r12, #0
        LDR     r3, =__cpp(NUM_SAMPLES)
loop2
        LDM     r0!, {r4,r6,r8,r10}
        LDM     r1!, {r5,r7,r9,r11}
        USUB16  r4, r4, r5
        SEL     r4, r4, r12
        USUB16  r6, r6, r7
        SEL     r6, r6, r12
        USUB16  r8, r8, r9
        SEL     r8, r8, r12
        USUB16  r10, r10, r11
        SEL     r10, r10, r12
        STM     r2!, {r4,r6,r8,r10}
        SUBS    r3, #8
        BNE     loop2
        POP     {r4-r11}
        BX      lr
}
#endif

In any case, make sure data is aligned to 32 byte boundaries, since that is the size of an ARM926 cache line. Look at the cache architecture. Try to avoid having the frame buffers at addresses that carry a risk of cache trashing.

Regards
Marcus
http://www.doulos.com/arm/

0 Christoph Franck over 16 years ago in reply to Marcus Harnisch

Wasn't Stefan talking about 10bit samples for both pics?

I assumed that he was talking about the format of the (dark) image, not its actual dynamic range. Depending on the application, it may be possible that the dark image only has a small dynamic range, in which case the LUT solution would be feasible, depending on the amount of available RAM. If the dark images dynamic range can occupy most of the 10 bits, the LUT isn't an option.

Just for the kicks here is an ARMv6 (ARM11 in case you want to wait for AT91SAM11 :) version taking advantage of SIMD instructions. With this code we process eight pixels per iteration.

I believe the LDM/STM instructions aren't limited to ARMv6, and processing multiple samples per iteration would be a good way to speed up the first example even more even if there are no SIMD instructions, since it cuts down the number of cycles spent on accessing memory and the loop overhead
Cancel
Vote up 0 Vote down

Cancel
0 Marcus Harnisch over 16 years ago in reply to Christoph Franck

> Depending on the application, it may be possible that the dark image
> only has a small dynamic range, in which case the LUT solution would
> be feasible, depending on the amount of available RAM.

True. Although the drawback of a LUT in this case is that this type of data structure has a poor cache performance. In fact it might trash cache lines occupied by samples.

> I believe the LDM/STM instructions aren't limited to ARMv6, and

Never said so. But USUB16/SEL are. With a cached core, LDM/STM themselves don't give much of a performance benefit unless you can process many samples per iteration.

> processing multiple samples per iteration would be a good way to speed
> up the first example even more even if there are no SIMD instructions,
> since it cuts down the number of cycles spent on accessing memory and
> the loop overhead

The v5 example already processes two samples per iteration. I don't believe that the loop would benefit much from increasing that number. Again, assuming that the cache is enabled. You'd have to benchmark this in the function's actual context.

Regards
Marcus
http://www.doulos.com/arm/
Cancel
Vote up 0 Vote down

Cancel
0 Stefan Hartwig over 16 years ago in reply to Marcus Harnisch

Many thanks for all the answers.
I will test the current proposals.
Cancel
Vote up 0 Vote down

Cancel
0 Christoph Franck over 16 years ago in reply to Marcus Harnisch

The v5 example already processes two samples per iteration. I don't believe that the loop would benefit much from increasing that number.

Probably not on an ARM9, where only the loop overhead would be smaller (LDR takes 1 cycles on ARM9 and LDM takes n cycles, so there's no reduction in cycle count by using LDM. I was thinking of ARM7, where LDR takes 3 cycles and LDM takes n+2 cycles, and the cycle count reduction by using the load-multiple instructions can be significant).
Cancel
Vote up 0 Vote down

Cancel