We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Yes the two implementations of NEON are different, so I'd expect different performance numbers between the two cores.Can you give as an example of an algorithm you are trying, and how you are building it? The fact you see absolutely no performance difference is "suspicious" - I'd expect some difference, even if only small. Check you are not running the same binary 3 times - it seems like the obvious conclusion to three identical performance numbers =)
That's strange...It's possible that your tests take the same time if you have made a good code that check that NEON is available...In this case, may be you don't have NEON on your Cortex A9. (Tegra 2 for example)I do not find any information about your processorhttp://www.amlogic.com/product01.htmthey don't speak about NEON... so !In this case all your function call the basic ARM assembly code !That could explain the same time result !Etienne
or by trying this codehttp://pulsar.websha...lng=fr&sample=3
Sure.If you haven't made a specific test, your app can't used default code if NEON is not here.What is the size of your pixel array ?107 ms is very slow in fact !!!
I haven't tested NEON on Cortex-A9 directly, but according to available information the following should be true:- On Cortex-A8 a NEON instruction can dual issue a load, store, or permute type instruction with any other type of instruction. On Cortex-A9 the NEON unit is described as only accepting one dispatch per cycle, so this probably precludes this sort of dual-issue.- On Cortex-A8 the NEON pipeline begins after the main pipeline is completely done, which on Cortex-A9 it runs in parallel, with dispatch to it (presumably to a queue like in A8) occurring fairly early in the pipeline. However, in the A8 pipeline loads to NEON registers are queued and serviced well before the NEON pipeline itself begins. This allows for hiding latency, not only from L1 cache (load-use penalty) but even some or all from L2 cache. The queuing also allows for limited out-of-order loading (allowing hit under miss). So on A9 NEON loads will suffer from higher latency.- On the other hand, preloads on Cortex-A9 go to L1 cache instead of L2 cache, and there's now an automatic preload engine (at least as an option, don't know if the amlogic SoC implements it). So there'll be a higher L1 hit-rate for streaming data.So you can see the interface between the NEON unit and the rest of the core changed, but as far as I'm aware the NEON unit itself didn't. So the dispatch and latencies of the instructions should be the same, and would appear to be from the cycle charts. Note that on A9 NEON instructions still execute in order.These differences could have a major change in performance if you're loading from L2 cache or main memory, if there's no automatic prefetch or somehow it isn't kicking in. But I agree with everyone else that getting the exact same performance looks extremely suspicious. The amlogic SoC does have NEON (I've seen its datasheet), it also only has 128KB of L2 cache. It's possible NEON is disabled, but the only way you'd get the same performance is if a non-NEON path was compiled and executed. And if the non-NEON path is compiled from intrinsics it's hard to imagine that it'd end up being the same as the non-vectorized version, but for simple code like this it's possible. But that still wouldn't explain the ASM version performing the same. Benchmarking error seems like the most viable explanation...I think the best way to get your bearings straight on this is to start with the simplest possible control loops and ensure that you're getting the right timings for some integer code running for some number of cycles. Like, start with a loop with some nops, and grow it by a cycle or so at a time adding independent instructions. Then start adding NEON instructions and see what happens.
With an image that big there is a large chance you are spending all of your time waiting for data from main memory, because it is a lot bigger than your cache.Can you try with a smaller image (say half the size of your L2 cache) and loop the benchmark inside the application multiple times and average the result, so that the timing is using a "warm cache". That should at least rule out memory system effects and ensure you are timing the algorithm, not the memory system latency.If you need to handle large data consider using "preload data (PLD)" instructions to pull the data into the cache a few hundred cycles ahead of when you need it. This ensure that the CPU doesn't stall waiting for data. Most compilers have an intrinsic for this when you are using C code.
Hum. I can't believe that this is the problem.It does not explain why on Cortex A8 the time are different...Except if in it's lowcost soc, the is not cache. May be you're right !
Ok.and finaly :Does the cortex A9 faster than the cortex A8 ?Can you give your result (c / asm / neon) for the both proc with the small picture ?Can you give the freqency of your proc too ?thank's
Could you tell us precisely how large the image is (in pixels, an exact count) and how many times you're calling the function to get the numbers you're getting? Then we can put together some rough cycles/iteration counts and analyze the loop to see how the numbers compare with what we expect.It's actually interesting that the memory performance was holding you back more on the amlogic board than the i.MX51. I was actually considering using AML8276-M for a device over i.MX535.. guess there would have been a good reason not to..
I mean if my A9 doesn't have NEON, I think the app should crash and exit and I cannot get any results from it, right?
Thank you so much!!!You are right, when I changed the image size from 10MB to 50KB, I got the wanted time----about 5-6 times fasterI didn't know the memory access is so time consuming before.I can move forward now, thanks again.
So 16 cycles like predicted. Note that you'd get a lot better performance if you unrolled this loop to fill up the latency after the last multiply and shift. Doing it 4 times should be sufficient.