This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Fast dark picture subtraction

Hi,
I'm searching for a fast way to substract a dark image from another image.
If the dark image pixel value is greater than the corresponding image pixel, the resulting image pixel should be zero. Otherwise, it should be simply substracted.

Are there special functions for doing this job?

I got an AT91SAM9260 processor and the image data is 10bit depth, laying in a 16-bit array.

Many thank,
Stefan

  • I suppose the answer is hand-crafted assembly code. Get a ARM9 instruction set guide, with instruction timings. And don't forget to maximize memory throughput as well.

  • I suppose the answer is hand-crafted assembly code.

    Possibly, but the compiler should be able to take care of much of the necessary optimization. At least that's my experience - I tried beating the compiler with hand-crafted assembly in a similar scenario (by maximizing register usage and memory throughput by using load/store multiple instructions), and the compiler still ended up ahead.

    However, some things (like manually unrolling the loop, at least partially) might still have to be done by hand.

  • Stefan;
    I don't know your image size or your dynamic range but for limited range a Look Up Table (LUT) would be a very fast routine. The table would be filled with your dark image values and the real image value would index into the dark image array for the new value. This is a common hardware approach to pre-processing of images for machine control. Of course if your dark image values are dynamic as well as the real image this would not work.
    For a number of algorithms, look at "The Handbook of Astronomical Image Processing" by Richard Berry. ISBN 0-943396-67-0. I'm sure there is a much later edition.
    While it leans to Astrometry, images are images.

  • Hi,
    thanks for the current answers.

    Al: Can you post a short example of the idea you mean? I don't got it...
    The pictures are 752*480 pixel with 10bit color depth (grayscale).

  • Stephan;
    I must leave the office quickly. Just go to Wikipedia and look up LUTs.

  • Al: Can you post a short example of the idea you mean? I don't got it...

    I'll give it a try. I think the approach assumes that each pixel of the dark image has the same value, else you'd be looking at a two-dimensional lookup table that's [maximum image value] long in the first and [maximum dark image value] long in the second dimension. (This _might_ be feasible if the maximum dark image value is very small).

    For simplicity, let's assume that the image is only 4 bits deep. The dark image value is set to be 5. Then you would build the following lookup table:

    int image_lut[16] = {0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    

    and perform the following operation for each pixel of the image:

    new_pixel[x][y] = image_lut[old_pixel[x][y]];
    

  • Note that, it can mean a lot if you can do the operation in-place, with only one to walk over the image data, instead of having a source and a destination pointer.

  • Note that, it can mean a lot if you can do the operation in-place, with only one to walk over the image data, instead of having a source and a destination pointer.<p>

    Possibly, but I'd guess it's irrelevant on the ARM architecture, which has plenty of registers to keep the pointers in.

    I would think that if the dark image really has a constant value, doing the actual subtraction + max operation instead of using a lookup table might be faster on an ARM. Memory accesses eat up cycles like crazy, compared to doing operations on registers.

    If the OP is using assembly, he might also be able to use some clever tricks like loading two 16-bit pixels with one 32-bit read operation.

  • I think the approach with the LUT is only working for a constant dark image pixel value.
    But what I want to do, is to substract the fixed pattern noise (which is part of the dark image) and extraneous light.

    So, what I have is:
    dark image [752][480]
    and
    another image [752][480]

    E.g:

      {{4,5,4,4}{7,4,4,3}{1,4,4,6}{1,4,5,5}} == image
    - {{1,3,2,0}{7,2,1,2}{0,1,1,4}{3,0,0,1}} == dark image
    ________________________________________
      {{3,2,2,4}{0,2,3,1}{1,3,3,2}{0,4,5,4}}
    

  • You also mentioned saturation. Is it sorrect to assume that what you want is this:

    for (x = 0; x < 752; x++)
        for (y = 0; y < 480; y++)
            if (dark[x][y] < image[x][y])
                image[x][y] -= dark[x][y];
            else
                image[x][y] = 0;
    


    What's wrong with implementation in C? Did you try it? Is it too slow?

  • Currently I just use a similary implementation in C.

    Since I have strange performance constraints, I want it to run as fast as possible.
    Maybe there is a way to get it faster using "special" commands or inline assembler (e.g. command QSUB - if I'm rigth).

  • enum {
        // max value in dark image, which also represents
        // most negative result from *img - *dark
        OFFSET = 100,
    };
    // Switch from two-dimensional representation to a
    // linear array, to simplify loop.
    uint16_t* pimg = (uint16_t*)image;
    uint16_t* pdark = (uint16_t*)dark_image;
    uint16_t *pend = pimg + width*height;
    const uint16_t lut[1024+offset] = {
        // Clamp all negative values - i.e image darker than
        // the dark image, to zero. Note correlation with
        // OFFSET constant above.
        0,0,0,0,0,...,
        1,2,3,4,5,6
    };
    
    while (pimg < pend) {
        *pimg++ = lut[OFFSET + *pimg-*pdark++];
        *pimg++ = lut[OFFSET + *pimg-*pdark++];
        *pimg++ = lut[OFFSET + *pimg-*pdark++];
        *pimg++ = lut[OFFSET + *pimg-*pdark++];
    }
    


    Normally with a bit of loop unroll. If the pixel count isn't evenly divisable by the unroll factor, then it is often good to allocate a couple of dummy pixels at the end of the image, and convert these too.

  • I think the approach with the LUT is only working for a constant dark image pixel value.

    As I said - as long as the maximum value of the dark image isn't too large, you could use a two-dimensional lookup table if the dark image pixel value isn't constant.

    Once again, let's assume that the image is 4 bits deep, and the dark image is only 2 bits deep. In that case, you'd build the following lookup table:

    #define IMAGE_RANGE 16
    #define DARK_IMAGE_RANGE 4
    
    int lookup_table[DARK_IMAGE_RANGE][IMAGE_RANGE] =
    {
    {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15},
    {0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14},
    {0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13},
    {0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}
    };
    

    and then you would do the following operation:

    new_image[x][y] = lookup_table[dark_image[x][y]][image[x][y]];
    

    Still, I am not sure whether this will be faster on an ARM than just doing a straight subtraction and comparison, as in

    signed int tmp;
    ...
    tmp = (signed int) image[x][y] - dark_image[x][y];
    if(tmp < 0)
        {
        tmp = 0;
        }
    new_image[x][y] = tmp;
    

    .. especially when this approach is optimized (e.g. making it into an one-dimensional loop, which should not be a problem if the size of your images is fixed).

  • Christoph Franck wrote:

    > As I said - as long as the maximum value of the dark image isn't
    > too large, you could use a two-dimensional lookup table if the dark
    > image pixel value isn't constant.

    Wasn't Stefan talking about 10bit samples for both pics? That would be quite a lot of memory for the LUT.

    A C based algorithmic solution is of course the easiest to set up and maintain. If the margin is big enough, do it that way.

    Unfortunately, ARM9 doesn't have specific instruction to speed this up. An optimized C routine (reverse engineered assembler, in fact) is a bit quicker than that but hardly any prettier than writing assembler to begin with. The attached assembler code is about 30% faster than the straight forward C solution (two pixels/iteration):

    __asm void subtract_pic_v5(uint16_t *restrict p, uint16_t *restrict d, uint16_t *restrict diff)
    {
            PUSH    {r4,r5,r6,r7,lr}
    
            LDR     lr, =0xFFFF0000
            LDR     r3, =__cpp(NUM_SAMPLES)
    
    loop1
            LDR     r4, [r0], #4    ; *p++
            LDR     r5, [r1], #4    ; *d++
            AND     r6, r4, lr
            AND     r12, r5, lr
            SUBS    r12, r6, r12    ; *(p+2) - *(d+2)
            MOVLT   r12, #0
            LSL     r4, #16
            SUBS    r6, r4, r5, LSL #16 ; *(p+0) - *(d+0)
            MOVLT   r6, #0
            ORR     r12, r12, r6, LSR #16
            STR     r12, [r2], #4
            SUBS    r3, #2
            BNE     loop1
            POP     {r4,r5,r6,r7,pc}
    }
    

    Just for the kicks here is an ARMv6 (ARM11 in case you want to wait for AT91SAM11 :) version taking advantage of SIMD instructions. With this code we process eight pixels per iteration.

    #if __TARGET_ARCH_ARM >= 6
    __asm void subtract_pic_v6(uint16_t *restrict p, uint16_t *restrict d, uint16_t *restrict diff)
    {
            PUSH    {r4-r11}
    
            MOV     r12, #0
            LDR     r3, =__cpp(NUM_SAMPLES)
    loop2
            LDM     r0!, {r4,r6,r8,r10}
            LDM     r1!, {r5,r7,r9,r11}
            USUB16  r4, r4, r5
            SEL     r4, r4, r12
            USUB16  r6, r6, r7
            SEL     r6, r6, r12
            USUB16  r8, r8, r9
            SEL     r8, r8, r12
            USUB16  r10, r10, r11
            SEL     r10, r10, r12
            STM     r2!, {r4,r6,r8,r10}
            SUBS    r3, #8
            BNE     loop2
            POP     {r4-r11}
            BX      lr
    }
    #endif
    

    In any case, make sure data is aligned to 32 byte boundaries, since that is the size of an ARM926 cache line. Look at the cache architecture. Try to avoid having the frame buffers at addresses that carry a risk of cache trashing.

    Regards
    Marcus
    http://www.doulos.com/arm/

  • Wasn't Stefan talking about 10bit samples for both pics?

    I assumed that he was talking about the format of the (dark) image, not its actual dynamic range. Depending on the application, it may be possible that the dark image only has a small dynamic range, in which case the LUT solution would be feasible, depending on the amount of available RAM. If the dark images dynamic range can occupy most of the 10 bits, the LUT isn't an option.

    Just for the kicks here is an ARMv6 (ARM11 in case you want to wait for AT91SAM11 :) version taking advantage of SIMD instructions. With this code we process eight pixels per iteration.

    I believe the LDM/STM instructions aren't limited to ARMv6, and processing multiple samples per iteration would be a good way to speed up the first example even more even if there are no SIMD instructions, since it cuts down the number of cycles spent on accessing memory and the loop overhead