We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hi, I'm searching for a fast way to substract a dark image from another image. If the dark image pixel value is greater than the corresponding image pixel, the resulting image pixel should be zero. Otherwise, it should be simply substracted.
Are there special functions for doing this job?
I got an AT91SAM9260 processor and the image data is 10bit depth, laying in a 16-bit array.
Many thank, Stefan
I suppose the answer is hand-crafted assembly code. Get a ARM9 instruction set guide, with instruction timings. And don't forget to maximize memory throughput as well.
I suppose the answer is hand-crafted assembly code.
Possibly, but the compiler should be able to take care of much of the necessary optimization. At least that's my experience - I tried beating the compiler with hand-crafted assembly in a similar scenario (by maximizing register usage and memory throughput by using load/store multiple instructions), and the compiler still ended up ahead.
However, some things (like manually unrolling the loop, at least partially) might still have to be done by hand.
Stefan; I don't know your image size or your dynamic range but for limited range a Look Up Table (LUT) would be a very fast routine. The table would be filled with your dark image values and the real image value would index into the dark image array for the new value. This is a common hardware approach to pre-processing of images for machine control. Of course if your dark image values are dynamic as well as the real image this would not work. For a number of algorithms, look at "The Handbook of Astronomical Image Processing" by Richard Berry. ISBN 0-943396-67-0. I'm sure there is a much later edition. While it leans to Astrometry, images are images.
Hi, thanks for the current answers.
Al: Can you post a short example of the idea you mean? I don't got it... The pictures are 752*480 pixel with 10bit color depth (grayscale).
Stephan; I must leave the office quickly. Just go to Wikipedia and look up LUTs.
Al: Can you post a short example of the idea you mean? I don't got it...
I'll give it a try. I think the approach assumes that each pixel of the dark image has the same value, else you'd be looking at a two-dimensional lookup table that's [maximum image value] long in the first and [maximum dark image value] long in the second dimension. (This _might_ be feasible if the maximum dark image value is very small).
For simplicity, let's assume that the image is only 4 bits deep. The dark image value is set to be 5. Then you would build the following lookup table:
int image_lut[16] = {0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
and perform the following operation for each pixel of the image:
new_pixel[x][y] = image_lut[old_pixel[x][y]];
Note that, it can mean a lot if you can do the operation in-place, with only one to walk over the image data, instead of having a source and a destination pointer.
Note that, it can mean a lot if you can do the operation in-place, with only one to walk over the image data, instead of having a source and a destination pointer.<p>
Possibly, but I'd guess it's irrelevant on the ARM architecture, which has plenty of registers to keep the pointers in.
I would think that if the dark image really has a constant value, doing the actual subtraction + max operation instead of using a lookup table might be faster on an ARM. Memory accesses eat up cycles like crazy, compared to doing operations on registers.
If the OP is using assembly, he might also be able to use some clever tricks like loading two 16-bit pixels with one 32-bit read operation.
I think the approach with the LUT is only working for a constant dark image pixel value. But what I want to do, is to substract the fixed pattern noise (which is part of the dark image) and extraneous light.
So, what I have is: dark image [752][480] and another image [752][480]
E.g:
{{4,5,4,4}{7,4,4,3}{1,4,4,6}{1,4,5,5}} == image - {{1,3,2,0}{7,2,1,2}{0,1,1,4}{3,0,0,1}} == dark image ________________________________________ {{3,2,2,4}{0,2,3,1}{1,3,3,2}{0,4,5,4}}
You also mentioned saturation. Is it sorrect to assume that what you want is this:
for (x = 0; x < 752; x++) for (y = 0; y < 480; y++) if (dark[x][y] < image[x][y]) image[x][y] -= dark[x][y]; else image[x][y] = 0;
What's wrong with implementation in C? Did you try it? Is it too slow?
Currently I just use a similary implementation in C.
Since I have strange performance constraints, I want it to run as fast as possible. Maybe there is a way to get it faster using "special" commands or inline assembler (e.g. command QSUB - if I'm rigth).
enum { // max value in dark image, which also represents // most negative result from *img - *dark OFFSET = 100, }; // Switch from two-dimensional representation to a // linear array, to simplify loop. uint16_t* pimg = (uint16_t*)image; uint16_t* pdark = (uint16_t*)dark_image; uint16_t *pend = pimg + width*height; const uint16_t lut[1024+offset] = { // Clamp all negative values - i.e image darker than // the dark image, to zero. Note correlation with // OFFSET constant above. 0,0,0,0,0,..., 1,2,3,4,5,6 }; while (pimg < pend) { *pimg++ = lut[OFFSET + *pimg-*pdark++]; *pimg++ = lut[OFFSET + *pimg-*pdark++]; *pimg++ = lut[OFFSET + *pimg-*pdark++]; *pimg++ = lut[OFFSET + *pimg-*pdark++]; }
Normally with a bit of loop unroll. If the pixel count isn't evenly divisable by the unroll factor, then it is often good to allocate a couple of dummy pixels at the end of the image, and convert these too.
I think the approach with the LUT is only working for a constant dark image pixel value.
As I said - as long as the maximum value of the dark image isn't too large, you could use a two-dimensional lookup table if the dark image pixel value isn't constant.
Once again, let's assume that the image is 4 bits deep, and the dark image is only 2 bits deep. In that case, you'd build the following lookup table:
#define IMAGE_RANGE 16 #define DARK_IMAGE_RANGE 4 int lookup_table[DARK_IMAGE_RANGE][IMAGE_RANGE] = { {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}, {0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}, {0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}, {0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} };
and then you would do the following operation:
new_image[x][y] = lookup_table[dark_image[x][y]][image[x][y]];
Still, I am not sure whether this will be faster on an ARM than just doing a straight subtraction and comparison, as in
signed int tmp; ... tmp = (signed int) image[x][y] - dark_image[x][y]; if(tmp < 0) { tmp = 0; } new_image[x][y] = tmp;
.. especially when this approach is optimized (e.g. making it into an one-dimensional loop, which should not be a problem if the size of your images is fixed).
Christoph Franck wrote:
> As I said - as long as the maximum value of the dark image isn't > too large, you could use a two-dimensional lookup table if the dark > image pixel value isn't constant.
Wasn't Stefan talking about 10bit samples for both pics? That would be quite a lot of memory for the LUT.
A C based algorithmic solution is of course the easiest to set up and maintain. If the margin is big enough, do it that way.
Unfortunately, ARM9 doesn't have specific instruction to speed this up. An optimized C routine (reverse engineered assembler, in fact) is a bit quicker than that but hardly any prettier than writing assembler to begin with. The attached assembler code is about 30% faster than the straight forward C solution (two pixels/iteration):
__asm void subtract_pic_v5(uint16_t *restrict p, uint16_t *restrict d, uint16_t *restrict diff) { PUSH {r4,r5,r6,r7,lr} LDR lr, =0xFFFF0000 LDR r3, =__cpp(NUM_SAMPLES) loop1 LDR r4, [r0], #4 ; *p++ LDR r5, [r1], #4 ; *d++ AND r6, r4, lr AND r12, r5, lr SUBS r12, r6, r12 ; *(p+2) - *(d+2) MOVLT r12, #0 LSL r4, #16 SUBS r6, r4, r5, LSL #16 ; *(p+0) - *(d+0) MOVLT r6, #0 ORR r12, r12, r6, LSR #16 STR r12, [r2], #4 SUBS r3, #2 BNE loop1 POP {r4,r5,r6,r7,pc} }
Just for the kicks here is an ARMv6 (ARM11 in case you want to wait for AT91SAM11 :) version taking advantage of SIMD instructions. With this code we process eight pixels per iteration.
#if __TARGET_ARCH_ARM >= 6 __asm void subtract_pic_v6(uint16_t *restrict p, uint16_t *restrict d, uint16_t *restrict diff) { PUSH {r4-r11} MOV r12, #0 LDR r3, =__cpp(NUM_SAMPLES) loop2 LDM r0!, {r4,r6,r8,r10} LDM r1!, {r5,r7,r9,r11} USUB16 r4, r4, r5 SEL r4, r4, r12 USUB16 r6, r6, r7 SEL r6, r6, r12 USUB16 r8, r8, r9 SEL r8, r8, r12 USUB16 r10, r10, r11 SEL r10, r10, r12 STM r2!, {r4,r6,r8,r10} SUBS r3, #8 BNE loop2 POP {r4-r11} BX lr } #endif
In any case, make sure data is aligned to 32 byte boundaries, since that is the size of an ARM926 cache line. Look at the cache architecture. Try to avoid having the frame buffers at addresses that carry a risk of cache trashing.
Regards Marcus http://www.doulos.com/arm/
Wasn't Stefan talking about 10bit samples for both pics?
I assumed that he was talking about the format of the (dark) image, not its actual dynamic range. Depending on the application, it may be possible that the dark image only has a small dynamic range, in which case the LUT solution would be feasible, depending on the amount of available RAM. If the dark images dynamic range can occupy most of the 10 bits, the LUT isn't an option.
I believe the LDM/STM instructions aren't limited to ARMv6, and processing multiple samples per iteration would be a good way to speed up the first example even more even if there are no SIMD instructions, since it cuts down the number of cycles spent on accessing memory and the loop overhead