This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Fast dark picture subtraction

Hi,
I'm searching for a fast way to substract a dark image from another image.
If the dark image pixel value is greater than the corresponding image pixel, the resulting image pixel should be zero. Otherwise, it should be simply substracted.

Are there special functions for doing this job?

I got an AT91SAM9260 processor and the image data is 10bit depth, laying in a 16-bit array.

Many thank,
Stefan

Parents
  • Wasn't Stefan talking about 10bit samples for both pics?

    I assumed that he was talking about the format of the (dark) image, not its actual dynamic range. Depending on the application, it may be possible that the dark image only has a small dynamic range, in which case the LUT solution would be feasible, depending on the amount of available RAM. If the dark images dynamic range can occupy most of the 10 bits, the LUT isn't an option.

    Just for the kicks here is an ARMv6 (ARM11 in case you want to wait for AT91SAM11 :) version taking advantage of SIMD instructions. With this code we process eight pixels per iteration.

    I believe the LDM/STM instructions aren't limited to ARMv6, and processing multiple samples per iteration would be a good way to speed up the first example even more even if there are no SIMD instructions, since it cuts down the number of cycles spent on accessing memory and the loop overhead

Reply
  • Wasn't Stefan talking about 10bit samples for both pics?

    I assumed that he was talking about the format of the (dark) image, not its actual dynamic range. Depending on the application, it may be possible that the dark image only has a small dynamic range, in which case the LUT solution would be feasible, depending on the amount of available RAM. If the dark images dynamic range can occupy most of the 10 bits, the LUT isn't an option.

    Just for the kicks here is an ARMv6 (ARM11 in case you want to wait for AT91SAM11 :) version taking advantage of SIMD instructions. With this code we process eight pixels per iteration.

    I believe the LDM/STM instructions aren't limited to ARMv6, and processing multiple samples per iteration would be a good way to speed up the first example even more even if there are no SIMD instructions, since it cuts down the number of cycles spent on accessing memory and the loop overhead

Children
  • > Depending on the application, it may be possible that the dark image
    > only has a small dynamic range, in which case the LUT solution would
    > be feasible, depending on the amount of available RAM.

    True. Although the drawback of a LUT in this case is that this type of data structure has a poor cache performance. In fact it might trash cache lines occupied by samples.

    > I believe the LDM/STM instructions aren't limited to ARMv6, and

    Never said so. But USUB16/SEL are. With a cached core, LDM/STM themselves don't give much of a performance benefit unless you can process many samples per iteration.

    > processing multiple samples per iteration would be a good way to speed
    > up the first example even more even if there are no SIMD instructions,
    > since it cuts down the number of cycles spent on accessing memory and
    > the loop overhead

    The v5 example already processes two samples per iteration. I don't believe that the loop would benefit much from increasing that number. Again, assuming that the cache is enabled. You'd have to benchmark this in the function's actual context.

    Regards
    Marcus
    http://www.doulos.com/arm/

  • Many thanks for all the answers.
    I will test the current proposals.

  • The v5 example already processes two samples per iteration. I don't believe that the loop would benefit much from increasing that number.

    Probably not on an ARM9, where only the loop overhead would be smaller (LDR takes 1 cycles on ARM9 and LDM takes n cycles, so there's no reduction in cycle count by using LDM. I was thinking of ARM7, where LDR takes 3 cycles and LDM takes n+2 cycles, and the cycle count reduction by using the load-multiple instructions can be significant).