We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
With an image that big there is a large chance you are spending all of your time waiting for data from main memory, because it is a lot bigger than your cache.Can you try with a smaller image (say half the size of your L2 cache) and loop the benchmark inside the application multiple times and average the result, so that the timing is using a "warm cache". That should at least rule out memory system effects and ensure you are timing the algorithm, not the memory system latency.If you need to handle large data consider using "preload data (PLD)" instructions to pull the data into the cache a few hundred cycles ahead of when you need it. This ensure that the CPU doesn't stall waiting for data. Most compilers have an intrinsic for this when you are using C code.