<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="https://community.arm.com/utility/feedstylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/developer/tools-software/tools/f/armds-forum/898/differences-between-neon-in-cortex-a8-and-a9</link><description> Note: This was originally posted on 25th July 2011 at http://forums.arm.com Currently i am working on a Cortex-A9 single-core chip(AML8726-m if you want to know more), and in the datasheet it&amp;#39;s said there is a neon in it. But when i test the code here</description><dc:language>en-US</dc:language><generator>Telligent Community 10</generator><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2606?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:20 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:376175df-a5d4-46ec-b762-9163076e42b4</guid><dc:creator>Mohamed Jauhar</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 29th November 2012 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Any one have document about Cortex-A9 pipeline ?&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2607?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:20 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:8d6c86de-e425-4065-96fe-f9bbaf55233a</guid><dc:creator>Liad Weinberger</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 8th August 2012 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; vld3.8&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; {d0-d2}, [r1]!&amp;#160;&amp;#160; @ cycles 0-3, result in N2 of last cycle&lt;br /&gt;&lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; vmull.u8&amp;#160;&amp;#160;&amp;#160; q3, d0, d5&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; @ cycle 4 (can&amp;#39;t dual issue due to previous result in N2)&lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; vmlal.u8&amp;#160;&amp;#160;&amp;#160; q3, d1, d4&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; @ cycle 5&lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; vmlal.u8&amp;#160;&amp;#160;&amp;#160; q3, d2, d3&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; @ cycle 6, result in N6&lt;br /&gt;&lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; vshrn.u16&amp;#160;&amp;#160; d6, q3, #8&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; @ cycle 12 (value needed in N1, 5 cycle stall), result in N3&lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; vst1.8&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; {d6}, [r0]!&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; @ cycle 15 (value needed in N1, 2 cycle stall)&lt;br /&gt;&lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; subs&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; r2, r2, #1&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; @ overlaps w/NEON&lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; bne&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; .loop&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; @ overlaps w/NEON&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;So 16 cycles like predicted. Note that you&amp;#39;d get a lot better performance if you unrolled this loop to fill up the latency after the last multiply and shift. Doing it 4 times should be sufficient.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;I&amp;#39;m coming off a bit late to this so sorry if this doesn&amp;#39;t interest you anymore. However I found some problems in this analysis, or maybe I&amp;#39;m falling short on something. Please correct me if I&amp;#39;m wrong:&lt;/span&gt;&lt;ul&gt;&lt;li&gt;In the &lt;strong&gt;vshrn.u16&lt;/strong&gt; instruction you said 5-sycle stall, which I agree on, however you counted 6 cycles. Same extra cycle is counted in the &lt;strong&gt;vst1.8&lt;/strong&gt; instruction which is supposed to stall for 2 cycle, yet stalls for 3. If this is correct than you analysis should have shown 14 cycles, not 16.&lt;/li&gt;&lt;li&gt;Now, don&amp;#39;t the &lt;strong&gt;vmlal.u8&lt;/strong&gt; instructions require &lt;strong&gt;q3&lt;/strong&gt; as source in N3 which would stall their execution by 3 cycles each?&lt;/li&gt;&lt;li&gt;This is just an observation about the reasoning, but the fact that the &lt;strong&gt;vmull.u8&lt;/strong&gt; instruction is at cycle 4 has nothing to do with waiting for the result of the load instruction. The load instruction just takes 4 cycles to issue.&lt;/li&gt;&lt;/ul&gt;&lt;span&gt;If I&amp;#39;m correct, than this could be scheduled in 18 cycles, not 16 (or 14).&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2605?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:19 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:5e7bcbbf-13cd-48be-8619-61dddcbcd4c9</guid><dc:creator>Krish ks</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 23rd March 2013 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Hi, &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;I executed NEON operation test on Linux platform board. I am doing&amp;#160; for 4*4 matrix multiplication using arm &amp;amp; neon instructions.&lt;/span&gt;&lt;br /&gt;&lt;span&gt; (1)&amp;#160;&amp;#160; Matrix multiplication: Method of&amp;#160; calculating one bye one.Here I have used only S registers. (Normal ARM instructions)&lt;/span&gt;&lt;br /&gt;&lt;span&gt;&amp;#160; Here I am loading the float array content to S registers (32-bit)&amp;#160; using &amp;quot;vldmia&amp;quot; and then &amp;quot;vmul.f32&amp;quot; and &amp;quot;vmla.f32&amp;quot; to perform matrix&amp;#160; multiplication using S registers as operand and to to hold the result.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt; (2)&amp;#160;&amp;#160; Matrix multiplication: Since 128&amp;#160; bit calculation is done, the&amp;#160; number of instructions will become 1/4 compared to&amp;#160; (1). Here I have&amp;#160; used Q and D registers. (Neon instructions)&lt;/span&gt;&lt;br /&gt;&lt;span&gt;Here&amp;#160; I am loading the complete float array content to Q registers(128&amp;#160; bit) using &amp;quot;vldmia&amp;quot; and then &amp;quot;vmul.f32&amp;quot; and &amp;quot;vmla.f32&amp;quot; to perform matrix&amp;#160; multiplication using Q(128 bit) and D (64 bit) registers which will&amp;#160; obviously reduce the number of instructions ( Load, store,&amp;#160; multiplication) to 1/4 th of (1) code.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;I am using linux 3.0.35&amp;#160; and test code is executed on Linux platform (Cortex-a9 architecture) .&lt;/span&gt;&lt;br /&gt;&lt;span&gt;But there is no speed difference between (1) and (2). &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;In my Linux kernel configuration following options enabled&lt;/span&gt;&lt;br /&gt;&lt;span&gt;CONFIG_VFP=y&lt;/span&gt;&lt;br /&gt;&lt;span&gt;CONFIG_VFPv3=y&lt;/span&gt;&lt;br /&gt;&lt;span&gt;CONFIG_NEON=y&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Following gcc command I have used to build the NEON application and&amp;#160; gcc compiler version is gcc 4.6.2&lt;/span&gt;&lt;br /&gt;&lt;span&gt;gcc&amp;#160; -march=armv7-a -mtune=cortex-a9 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=hard&amp;#160; -o test.out test.c&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Why I dint find any performance difference between normal ARM and NEON codes?&lt;/span&gt;&lt;br /&gt;&lt;span&gt;I have tested same code with the Cortex-A8 and I am able to achieve the performance difference.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Thanks in advance&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2604?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:19 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:6e8f35ea-a09f-4e22-be86-d03df324c49a</guid><dc:creator>Krish ks</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 23rd March 2013 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Hi Shervin,&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;I asked&amp;#160; question in this post because I have got performance difference with Cortex-a8 cpu, but not with cortex-A9 cpu.&lt;/span&gt;&lt;br /&gt;&lt;span&gt;So just I was eager to know why that is happening..(May be because of any Neon difference in Cortex-a8 and A9).&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Regards,&lt;/span&gt;&lt;br /&gt;&lt;span&gt;Krishna&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2603?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:19 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:23d099aa-4c7e-4953-9fce-befcc31c7ea1</guid><dc:creator>Peter Harris</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 8th August 2012 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;R.E. (1) - it&amp;#39;s a 5 cycle stall and 1 cycles to issue - 6 cycles in total.&lt;/span&gt;&lt;br /&gt;&lt;span&gt;R.E. (2) - I&amp;#39;m not sure in this specific case, but this is a common usage so most MAC instructions tend to have a special forwarding path for the accumulator register, so there is no stall.&lt;/span&gt;&lt;br /&gt;&lt;span&gt;R.E. (3) - Correct.&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2602?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:19 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:5992fda8-215c-438f-818c-7e13af37ec4e</guid><dc:creator>Peter Harris</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 26th July 2011 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;With an image that big there is a large chance you are spending all of your time waiting for data from main memory, because it is a lot bigger than your cache.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Can you try with a smaller image (say half the size of your L2 cache) and loop the benchmark inside the application multiple times and average the result, so that the timing is using a &amp;quot;warm cache&amp;quot;. That should at least rule out memory system effects and ensure you are timing the algorithm, not the memory system latency.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;If you need to handle large data consider using &amp;quot;preload data (PLD)&amp;quot; instructions to pull the data into the cache a few hundred cycles ahead of when you need it. This ensure that the CPU doesn&amp;#39;t stall waiting for data. Most compilers have an intrinsic for this when you are using C code.&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2601?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:18 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:f39fa24f-b51b-4ff9-96ad-d3eda00295cf</guid><dc:creator>Peter Harris</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 25th July 2011 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Yes the two implementations of NEON are different, so I&amp;#39;d expect different performance numbers between the two cores.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Can you give as an example of an algorithm you are trying, and how you are building it? The fact you see absolutely no performance difference is &amp;quot;suspicious&amp;quot; - I&amp;#39;d expect some difference, even if only small. Check you are not running the same binary 3 times - it seems like the obvious conclusion to three identical performance numbers =)&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2600?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:18 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:694cd15c-7660-4684-b13b-8ee1aabf8196</guid><dc:creator>Gilead Kutnick</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 27th July 2011 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Yeah, should have said unrolling and/or software pipelining. Although you still left one stall cycle there &lt;/span&gt;&lt;a href="http://forums.arm.com/public/style_emoticons/default/wink.gif"&gt;&lt;img alt=";)" src="http://forums.arm.com/public/style_emoticons/default/wink.gif" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2599?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:18 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:0f0b63e3-83db-4c73-a4ba-432ebff20cce</guid><dc:creator>Gilead Kutnick</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 27th July 2011 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Okay, so with 16384 pixels and 8 pixels per iteration that&amp;#39;s 2048 iterations per loop. That&amp;#39;s a little low for trying to remove the function overhead, but it should still be a small fraction of a percent so I&amp;#39;ll just ignore it for now. The bigger error is going to be the roundoff on the time measurement. 400 calls makes 819200 iterations. 17ms&amp;#160;&amp;#160;&amp;#160;&amp;#160; 13ms&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; 20ms&amp;#160;&amp;#160;&amp;#160; NEON-ASM-CODE&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;That makes:&lt;/span&gt;&lt;br /&gt;&lt;span&gt;20.75ns/loop on the i.MX51&lt;/span&gt;&lt;br /&gt;&lt;span&gt;15.86ns/loop on the S5PC110&lt;/span&gt;&lt;br /&gt;&lt;span&gt;24.41ns/loop on the AML8726-M&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;In cycles:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;16.6 cycles/loop on the i.MX51&lt;/span&gt;&lt;br /&gt;&lt;span&gt;15.86 cycles/loop on the S5PC110&lt;/span&gt;&lt;br /&gt;&lt;span&gt;19.52 cycles/loop on the AML8726-M&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;On the A8 you&amp;#39;d expect:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; vld3.8&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; {d0-d2}, [r1]!&amp;#160;&amp;#160; @ cycles 0-3, result in N2 of last cycle&lt;br /&gt;&lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; vmull.u8&amp;#160;&amp;#160;&amp;#160; q3, d0, d5&amp;#160;&amp;#160;&amp;#160; @ cycle 4 (can&amp;#39;t dual issue due to previous result in N2)&lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; vmlal.u8&amp;#160;&amp;#160;&amp;#160; q3, d1, d4&amp;#160;&amp;#160;&amp;#160; @ cycle 5&lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; vmlal.u8&amp;#160;&amp;#160;&amp;#160; q3, d2, d3&amp;#160;&amp;#160;&amp;#160; @ cycle 6, result in N6&lt;br /&gt;&lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; vshrn.u16&amp;#160;&amp;#160; d6, q3, #8&amp;#160;&amp;#160;&amp;#160; @ cycle 12 (value needed in N1, 5 cycle stall), result in N3&lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; vst1.8&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; {d6}, [r0]!&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; @ cycle 15 (value needed in N1, 2 cycle stall)&lt;br /&gt;&lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; subs&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; r2, r2, #1&amp;#160;&amp;#160;&amp;#160; @ overlaps w/NEON&lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; bne&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; .loop&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; @ overlaps w/NEON&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;So 16 cycles like predicted. Note that you&amp;#39;d get a lot better performance if you unrolled this loop to fill up the latency after the last multiply and shift. Doing it 4 times should be sufficient.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Your total image size is exhausting L1 data-cache on all platforms, so at least some of the time the loads will come from L2. This is where you might be hit by latency on the Cortex-A9. It wouldn&amp;#39;t seem like you&amp;#39;re hitting the full latency, although it&amp;#39;s possible you&amp;#39;re only missing in L1 cache 33% of the time on the AM8726-M (32KB of L1 data cache), and the vld itself would be hiding some of the latency.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;It&amp;#39;d be interesting to try it again with a smaller image that fits entirely in L1 cache, and with far more calls to the function (to get in the thousands of ms instead of tens)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;From the very beginning, I&lt;br /&gt;don&amp;#39;t think AML8726-M is a good platform for its 128KB L2 and 65nm fab&lt;br /&gt;process, but its multimedia performance is pretty well, 1080P, Mali&lt;br /&gt;400. &lt;br /&gt;What is the differences between imx515 and imx535, freq?&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;i.MX53 is a die shrink to 45nm with some new features. The CPU clock is increased to up to 1.2GHz, memory clock up to 400MHz (but the CPU bus clock only 200MHz), support for LPDDR2 and DDR3, GPU up to 200MHz and with its SRAM doubled, and has 1080p decoding. This document describes it: &lt;/span&gt;&lt;a href="http://www.freescale.com/files/32bit/doc/app_note/AN4271.pdf" rel="nofollow"&gt;http://www.freescale...note/AN4271.pdf&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;To me Mali-400 in AM8726-M doesn&amp;#39;t seem like a strong competitor since it&amp;#39;s probably only single core. The SGX 540 in S5PC110 can surely beat it.. if given a choice between the two I&amp;#39;d definitely go for the Samsung part.&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2598?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:18 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:75e82547-9719-46dd-8d18-595e187897a7</guid><dc:creator>Gilead Kutnick</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 26th July 2011 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Could you tell us precisely how large the image is (in pixels, an exact count) and how many times you&amp;#39;re calling the function to get the numbers you&amp;#39;re getting? Then we can put together some rough cycles/iteration counts and analyze the loop to see how the numbers compare with what we expect.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;It&amp;#39;s actually interesting that the memory performance was holding you back more on the amlogic board than the i.MX51. I was actually considering using AML8276-M for a device over i.MX535.. guess there would have been a good reason not to..&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2596?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:18 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:fd061551-8d08-430d-81fd-4412941e98a7</guid><dc:creator>Gilead Kutnick</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 8th August 2012 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;isogen&amp;#39;s answers are right.. to elaborate a little bit more: if you have an instruction that outputs in N3 and the next one right after it needs its result in N2 then there&amp;#39;ll be a cycle in between where the NEON unit is doing nothing. So the second one will start two cycles after the first one.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;You should try setting up a test loop that runs iterations of code like this many times, so you can time how long it takes and see for yourself. Then you can change instructions one at a time and see what happens.&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2597?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:18 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:93f5e4d3-4c04-422e-bd4d-9cd595ece983</guid><dc:creator>Gilead Kutnick</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 25th July 2011 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;I haven&amp;#39;t tested NEON on Cortex-A9 directly, but according to available information the following should be true:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;- On Cortex-A8 a NEON instruction can dual issue a load, store, or permute type instruction with any other type of instruction. On Cortex-A9 the NEON unit is described as only accepting one dispatch per cycle, so this probably precludes this sort of dual-issue.&lt;/span&gt;&lt;br /&gt;&lt;span&gt;- On Cortex-A8 the NEON pipeline begins after the main pipeline is completely done, which on Cortex-A9 it runs in parallel, with dispatch to it (presumably to a queue like in A8) occurring fairly early in the pipeline. However, in the A8 pipeline loads to NEON registers are queued and serviced well before the NEON pipeline itself begins. This allows for hiding latency, not only from L1 cache (load-use penalty) but even some or all from L2 cache. The queuing also allows for limited out-of-order loading (allowing hit under miss). So on A9 NEON loads will suffer from higher latency.&lt;/span&gt;&lt;br /&gt;&lt;span&gt;- On the other hand, preloads on Cortex-A9 go to L1 cache instead of L2 cache, and there&amp;#39;s now an automatic preload engine (at least as an option, don&amp;#39;t know if the amlogic SoC implements it). So there&amp;#39;ll be a higher L1 hit-rate for streaming data.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;So you can see the interface between the NEON unit and the rest of the core changed, but as far as I&amp;#39;m aware the NEON unit itself didn&amp;#39;t. So the dispatch and latencies of the instructions should be the same, and would appear to be from the cycle charts. Note that on A9 NEON instructions still execute in order.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;These differences could have a major change in performance if you&amp;#39;re loading from L2 cache or main memory, if there&amp;#39;s no automatic prefetch or somehow it isn&amp;#39;t kicking in. But I agree with everyone else that getting the exact same performance looks extremely suspicious. The amlogic SoC does have NEON (I&amp;#39;ve seen its datasheet), it also only has 128KB of L2 cache. It&amp;#39;s possible NEON is disabled, but the only way you&amp;#39;d get the same performance is if a non-NEON path was compiled and executed. And if the non-NEON path is compiled from intrinsics it&amp;#39;s hard to imagine that it&amp;#39;d end up being the same as the non-vectorized version, but for simple code like this it&amp;#39;s possible. But that still wouldn&amp;#39;t explain the ASM version performing the same. Benchmarking error seems like the most viable explanation...&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;I think the best way to get your bearings straight on this is to start with the simplest possible control loops and ensure that you&amp;#39;re getting the right timings for some integer code running for some number of cycles. Like, start with a loop with some nops, and grow it by a cycle or so at a time adding independent instructions. Then start adding NEON instructions and see what happens.&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2595?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:17 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:ab145226-6d9c-48e1-bd96-7d9bc56e6623</guid><dc:creator>Shervin Emami</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 23rd March 2013 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;KP100, please don&amp;#39;t ask the same question on 2 different posts. I already answered on your other post, saying that float multiply hardware is just 32-bits wide so it doesn&amp;#39;t matter if you use S registers or Q registers, there wont be a speed difference, whereas other operations like addition have wider hardware so they can be faster in Q registers than S registers.&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2594?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:17 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:217d2e68-a86e-43e6-aaa0-23aed3b413b4</guid><dc:creator>Shervin Emami</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 9th August 2012 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;I&amp;#39;ve also found the same sort of problems in most of my image processing code, where NEON typically gives about 20x boost on a Cortex-A8 but only about 3x boost on a Cortex-A9 CPU! Like the guys have mentioned already in this post, there are many reasons why Cortex-A9 is faster in some ways and slower in other ways (I also compare Cortex-A8 with Cortex-A9 on my webpage &amp;quot;&lt;/span&gt;&lt;a href="http://www.shervinemami.info/armAssembly.html" rel="nofollow"&gt;http://www.shervinemami.info/armAssembly.html&lt;/a&gt;&lt;span&gt;&amp;quot;). But as you&amp;#39;ve noticed, it&amp;#39;s very important that you try different amounts &amp;amp; positions for Cache Preloading using PLD instructions, because like someone else mentioned early in the post, your device is mostly just waiting on data from memory, rather than doing NEON operations on it!&lt;/span&gt;&lt;br /&gt;&lt;span&gt; &lt;/span&gt;&lt;br /&gt;&lt;span&gt;So if you are working with megapixel images then you should worry less about counting NEON clock cycles and think in terms of memory stalls, because that is where most of the time will go to!&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2592?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:17 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:a7b5652c-42a7-47e7-acb8-227597790031</guid><dc:creator>Shervin Emami</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 30th November 2012 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;Any one have document about Cortex-A9 pipeline ?&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;ARM does :-) Actually the full specs for Cortex-A9 are in several different documents. Google for &amp;quot;ARM Cortex-A9 TRM&amp;quot; to get the main official document, and &amp;quot;ARM Cortex-A9 NEON TRM&amp;quot; for the one about NEON. I also highly recommend reading the Programmer&amp;#39;s Guide (Google for &amp;quot;ARM Cortex-A Series Programmers Guide&amp;quot;), it provides a lot of useful info.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;-Shervin.&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2591?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:17 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:0d127485-d492-439b-9b7a-b89af3c333d3</guid><dc:creator>Etienne SOBOLE</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 1st August 2011 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;This time, with a small image, 128*128 resolution, the time is shorten from 16.7ms to 11.3ms on my i.MX51.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;I dont&amp;#39; remember the inprove performance I&amp;#39;ve had when I had made the test!&lt;/span&gt;&lt;br /&gt;&lt;span&gt;I though it was near 2 time faster !&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;&lt;br /&gt;But on my A9, the improvement is so tiny, just 1ms, from 20ms to 19ms.&lt;br /&gt;So I&amp;#39;m confused again.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Well.&lt;/span&gt;&lt;br /&gt;&lt;span&gt;I don&amp;#39;t know why but it is not really a surprised.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;The cortex A9 focus on the out of order execution, and the high frequency soc.&lt;/span&gt;&lt;br /&gt;&lt;span&gt;The cycle table is not detailled but what is given let me suppose the cortex A9 is slower than the cortex A8 (at same frequency).&lt;/span&gt;&lt;br /&gt;&lt;span&gt;With NEON (and then the code you tried) it should not have difference for same frequency proc.&lt;/span&gt;&lt;br /&gt;&lt;span&gt;By the other side, the Cortex A9 should be able to work at higher frequency than the Cortex A8.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;To finish, the cortex A9 seem&amp;#39;s to be done to improve the bad code produced by compiler and should not be good for cortex A8 optimized code.&lt;/span&gt;&lt;br /&gt;&lt;span&gt;For me, this cpu (the A9) is not a good choice for the moment. Under 1.2 ou 1.5 Ghz, this is not a valid choice for assembly coder.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;May be one day, ARM will give us the pipeline stage of A9 instructions, and then we&amp;#39;ll be able to know a little bit more about it.&lt;/span&gt;&lt;br /&gt;&lt;span&gt;But that not seem&amp;#39;s to be for now !&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Etienne&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2590?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:17 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:655e8f0e-d94d-4746-81e0-6cde56fccc7f</guid><dc:creator>Etienne SOBOLE</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 28th July 2011 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;Although you still left one stall cycle there &lt;a href="http://forums.arm.com/public/style_emoticons/default/wink.gif"&gt;&lt;img alt=";)" src="http://forums.arm.com/public/style_emoticons/default/wink.gif" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Yes, like that developers will have the pleasure to optimize a little bit more &lt;/span&gt;&lt;a href="http://forums.arm.com/public/style_emoticons/default/wink.gif"&gt;&lt;img alt=";)" src="http://forums.arm.com/public/style_emoticons/default/wink.gif" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2589?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:17 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:b1b87957-a1f2-4584-9d40-303dddf39d58</guid><dc:creator>Etienne SOBOLE</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 27th July 2011 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;So 16 cycles like predicted. Note that you&amp;#39;d get a lot better performance if you unrolled this loop to fill up the latency after the last multiply and shift. Doing it 4 times should be sufficient.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;or by trying this code&lt;/span&gt;&lt;br /&gt;&lt;a href="http://pulsar.webshaker.net/ccc/result.php?lng=fr&amp;amp;sample=3" rel="nofollow"&gt;http://pulsar.webshaker.net/ccc/result.php?lng=fr&amp;amp;sample=3&lt;/a&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2588?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:16 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:a06279f2-29a8-4be5-8a39-12074ff24c86</guid><dc:creator>Etienne SOBOLE</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 26th July 2011 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;Thank you so much!!!&lt;br /&gt;You are right, when I changed the image size from 10MB to 50KB, I got the wanted time----about 5-6 times faster&lt;br /&gt;I didn&amp;#39;t know the memory access is so time consuming before.&lt;br /&gt;I can move forward now, thanks again.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Ok.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;and finaly :&lt;/span&gt;&lt;br /&gt;&lt;span&gt;Does the cortex A9 faster than the cortex A8 ?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Can you give your result (c / asm / neon) for the both proc with the small picture ?&lt;/span&gt;&lt;br /&gt;&lt;span&gt;Can you give the freqency of your proc too ?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;thank&amp;#39;s&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2587?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:16 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:e841517b-f1b7-48e0-a874-f70cac00e1f0</guid><dc:creator>Etienne SOBOLE</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 25th July 2011 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;That&amp;#39;s strange...&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;It&amp;#39;s possible that your tests take the same time if you have made a good code that check that NEON is available...&lt;/span&gt;&lt;br /&gt;&lt;span&gt;In this case, may be you don&amp;#39;t have NEON on your Cortex A9. (Tegra 2 for example)&lt;/span&gt;&lt;br /&gt;&lt;span&gt;I do not find any information about your processor&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.amlogic.com/product01.htm" rel="nofollow"&gt;http://www.amlogic.com/product01.htm&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;they don&amp;#39;t speak about NEON... so !&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;In this case all your function call the basic ARM assembly code !&lt;/span&gt;&lt;br /&gt;&lt;span&gt;That could explain the same time result !&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Etienne&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2586?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:16 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:42d0326a-699a-440d-bbb9-39588ae62347</guid><dc:creator>Etienne SOBOLE</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 26th July 2011 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;With an image that big there is a large chance you are spending all of your time waiting for data from main memory, because it is a lot bigger than your cache.&lt;br /&gt;&lt;br /&gt;Can you try with a smaller image (say half the size of your L2 cache) and loop the benchmark inside the application multiple times and average the result, so that the timing is using a &amp;quot;warm cache&amp;quot;. That should at least rule out memory system effects and ensure you are timing the algorithm, not the memory system latency.&lt;br /&gt;&lt;br /&gt;If you need to handle large data consider using &amp;quot;preload data (PLD)&amp;quot; instructions to pull the data into the cache a few hundred cycles ahead of when you need it. This ensure that the CPU doesn&amp;#39;t stall waiting for data. Most compilers have an intrinsic for this when you are using C code.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Hum. I can&amp;#39;t believe that this is the problem.&lt;/span&gt;&lt;br /&gt;&lt;span&gt;It does not explain why on Cortex A8 the time are different...&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Except if in it&amp;#39;s lowcost soc, the is not cache.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;May be you&amp;#39;re right !&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2585?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:16 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:15e2809d-681b-4028-92f0-a56a19f3850f</guid><dc:creator>Etienne SOBOLE</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 26th July 2011 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;I mean if my A9 doesn&amp;#39;t have NEON, I think the app should crash and exit and I cannot get any results from it, right?&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Sure.&lt;/span&gt;&lt;br /&gt;&lt;span&gt;If you haven&amp;#39;t made a specific test, your app can&amp;#39;t used default code if NEON is not here.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;What is the size of your pixel array ?&lt;/span&gt;&lt;br /&gt;&lt;span&gt;107 ms is very slow in fact !!!&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2584?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:16 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:d22974b0-2eb2-4e2c-ba1a-67cf1c1d624f</guid><dc:creator>Kun Feng</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 27th July 2011 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;Could you tell us precisely how large the image is (in pixels, an exact count) and how many times you&amp;#39;re calling the function to get the numbers you&amp;#39;re getting? Then we can put together some rough cycles/iteration counts and analyze the loop to see how the numbers compare with what we expect.&lt;br /&gt;&lt;br /&gt;It&amp;#39;s actually interesting that the memory performance was holding you back more on the amlogic board than the i.MX51. I was actually considering using AML8276-M for a device over i.MX535.. guess there would have been a good reason not to..&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;The resolution is 128*128, and I repeated it 400 times. The freq is 800MHz, so it&amp;#39;s about 20ms*800MHz/400times/128*128pixels=2.44 cycle/pixel? I don&amp;#39;t know how to calculate it actually.&lt;/span&gt;&lt;br /&gt;&lt;span&gt;From the very beginning, I don&amp;#39;t think AML8726-M is a good platform for its 128KB L2 and 65nm fab process, but its multimedia performance is pretty well, 1080P, Mali 400. &lt;/span&gt;&lt;br /&gt;&lt;span&gt;What is the differences between imx515 and imx535, freq?&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2583?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:16 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:25e9ac65-d448-4f9b-a786-05caedd09a80</guid><dc:creator>Kun Feng</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 26th July 2011 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;Ok.&lt;br /&gt;&lt;br /&gt;and finaly :&lt;br /&gt;Does the cortex A9 faster than the cortex A8 ?&lt;br /&gt;&lt;br /&gt;Can you give your result (c / asm / neon) for the both proc with the small picture ?&lt;br /&gt;Can you give the freqency of your proc too ?&lt;br /&gt;&lt;br /&gt;thank&amp;#39;s&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;I got two A8 and one A9&lt;/span&gt;&lt;br /&gt;&lt;span&gt;A8: i.MX515(800MHz, 256KB L2) S5PC110(1GHz, 256KB L2)&lt;/span&gt;&lt;br /&gt;&lt;span&gt;A9: AML8726-M(800MHz, 128KB L2)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Here is the results on three platforms:&lt;/span&gt;&lt;br /&gt;&lt;span&gt;i.MX515&amp;#160;&amp;#160;&amp;#160;&amp;#160; S5PC110&amp;#160;&amp;#160;&amp;#160;&amp;#160; AML8726-M&lt;/span&gt;&lt;br /&gt;&lt;span&gt;135ms&amp;#160;&amp;#160;&amp;#160;&amp;#160; 108ms&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; 117ms&amp;#160;&amp;#160;&amp;#160; ARM-C-CODE&lt;/span&gt;&lt;br /&gt;&lt;span&gt;76ms&amp;#160;&amp;#160; 60ms&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; 48ms&amp;#160;&amp;#160;&amp;#160; NEON-C-CODE&lt;/span&gt;&lt;br /&gt;&lt;span&gt;17ms&amp;#160;&amp;#160;&amp;#160;&amp;#160; 13ms&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; 20ms&amp;#160;&amp;#160;&amp;#160; NEON-ASM-CODE&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;So, A9 is not the fastest.&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Differences between NEON in Cortex-A8 and A9</title><link>https://community.arm.com/thread/2582?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:04:15 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:f16b0627-a24f-4b8c-8960-8e2e4f21a126</guid><dc:creator>Kun Feng</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 26th July 2011 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;Hum. I can&amp;#39;t believe that this is the problem.&lt;br /&gt;It does not explain why on Cortex A8 the time are different...&lt;br /&gt;&lt;br /&gt;Except if in it&amp;#39;s lowcost soc, the is not cache.&lt;br /&gt; &lt;br /&gt;May be you&amp;#39;re right !&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;I fix it now, it&amp;#39;s the memory latency caused this problem&lt;/span&gt;&lt;br /&gt;&lt;span&gt;And I think I am a little bit &amp;quot;lucky&amp;quot; to get almost three identical time on my A9 platform.&lt;/span&gt;&lt;br /&gt;&lt;span&gt;So my A9 has a poor memory performance which I have to take care in future.&lt;/span&gt;&lt;br /&gt;&lt;span&gt;Thanks for your help!!!&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item></channel></rss>