<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="https://community.arm.com/utility/feedstylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Cascading multiplication; NEON;</title><link>https://community.arm.com/developer/tools-software/tools/f/armds-forum/1121/cascading-multiplication-neon</link><description> Note: This was originally posted on 5th March 2013 at http://forums.arm.com Hi Guys! I&amp;#39;ve just found out the next confusing me thing in cascading multiplication in NEON module. source1- http://pulsar.websha...sample-25fce1da source2- http://pulsar.websha</description><dc:language>en-US</dc:language><generator>Telligent Community 10</generator><item><title>RE: Cascading multiplication; NEON;</title><link>https://community.arm.com/thread/3488?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:08:53 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:b632f44a-4b17-4c0a-9a26-191857cf382c</guid><dc:creator>Gilead Kutnick</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 19th March 2013 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Well, this isn&amp;#39;t something that ARM documents, but there are a lot of things about NEON timing that you can&amp;#39;t find in datasheets.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;I hadn&amp;#39;t noticed originally that the second set of examples was using 128-bit registers. I think it&amp;#39;s this, and not the fact that you&amp;#39;re using scalars, that&amp;#39;s causing the stalls. That you don&amp;#39;t get them with 64-bit multiplications with scalars also supports this..&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;If you think about it, this behavior is what you would expect. The NEON unit on Cortex-A8 and Cortex-A9 can do 4x16-bit multiplications in one cycle. Doing an 8x16-bit multiplication is like doing two 4x16-bit ones back to back, and takes two cycles. Because it&amp;#39;s like alternating between two different multiplications it can&amp;#39;t forward the result because it&amp;#39;d have to forward two things where there&amp;#39;s probably only one internal accumulator.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;If the forwarding gets broken you have to wait the full latency which is 6 cycles. The second multiplication hides one of the latency cycles so you get a 4 cycle stall instead of 5 cycles.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;You can also find the forwarding broken if you put some (maybe any, haven&amp;#39;t tested) NEON instructions between the multiplications.&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Cascading multiplication; NEON;</title><link>https://community.arm.com/thread/3489?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:08:52 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:3403f6b5-32af-4414-bb38-bb53ceaec9b5</guid><dc:creator>Gilead Kutnick</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 5th March 2013 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;You need to try timing it on real hardware to see if it&amp;#39;s a limitation there and not just a problem with webshaker&amp;#39;s simulator. I can confirm that you can issue dependently vmla back to back but I haven&amp;#39;t tried the scalar operand version.&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Cascading multiplication; NEON;</title><link>https://community.arm.com/thread/3487?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:08:52 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:15ea7063-b454-4e92-9a17-57e03361fee1</guid><dc:creator>Gilead Kutnick</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 2nd April 2013 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;4x16-bit and 8x8-bit both take one cycle to issue. 8x16-bit and 16x8-bit take two cycles. You can see this in the Cortex-A8 and A9 TRMs.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;You can&amp;#39;t issue the two cycle versions back to back without stalling for more cycles since it breaks the forwarding.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Basically, as far as the NEON unit is concerned, what you&amp;#39;re doing here:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;vmul.s16 q12,q4 ,d0[0]&lt;br /&gt;vmla.s16 q12,q5 ,d0[1]&lt;br /&gt;vmla.s16 q12,q6 ,d0[2]&lt;br /&gt;vmla.s16 q12,q7 ,d0[3]&lt;br /&gt;vmla.s16 q12,q8 ,d1[0]&lt;br /&gt;vmla.s16 q12,q9 ,d1[1]&lt;br /&gt;vmla.s16 q12,q10,d1[2]&lt;br /&gt;vmla.s16 q12,q11,d1[3]&lt;br /&gt;vmla.s16 q12,q1 ,q2&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Is the same as this:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;vmul.s16 d24,d8 ,d0[0]&lt;br /&gt;vmul.s16 d25,d9 ,d0[0]&lt;br /&gt;vmla.s16 q24,d10,d0[1]&lt;br /&gt;vmla.s16 d25,d11,d0[1]&lt;br /&gt;vmla.s16 d24,d12,d0[2]&lt;br /&gt;vmla.s16 d25,d13,d0[2]&lt;br /&gt;vmla.s16 d24,d14,d0[3]&lt;br /&gt;vmla.s16 d25,d15,d0[3]&lt;br /&gt;vmla.s16 d24,d16,d1[0]&lt;br /&gt;vmla.s16 d25,d17,d1[0]&lt;br /&gt;vmla.s16 d24,d18,d1[1]&lt;br /&gt;vmla.s16 d25,d19,d1[1]&lt;br /&gt;vmla.s16 d24,d20,d1[2]&lt;br /&gt;vmla.s16 d25,d21,d1[2]&lt;br /&gt;vmla.s16 d24,d22,d1[3]&lt;br /&gt;vmla.s16 d25,d23,d1[3]&lt;br /&gt;vmla.s16 d24,d2 ,d4&lt;br /&gt;vmla.s16 d25,d3 ,d5&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;It stalls because the operations aren&amp;#39;t really back to back and it can&amp;#39;t forward between two interleaved operations.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;If you instead did this manually:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;vmul.s16 d24,d8 ,d0[0]&lt;br /&gt;vmla.s16 q24,d10,d0[1]&lt;br /&gt;vmla.s16 d24,d12,d0[2]&lt;br /&gt;vmla.s16 d24,d14,d0[3]&lt;br /&gt;vmla.s16 d24,d16,d1[0]&lt;br /&gt;vmla.s16 d24,d18,d1[1]&lt;br /&gt;vmla.s16 d24,d20,d1[2]&lt;br /&gt;vmla.s16 d24,d22,d1[3]&lt;br /&gt;vmla.s16 d24,d2 ,d4&lt;br /&gt;&lt;br /&gt;vmul.s16 d25,d9 ,d0[0]&lt;br /&gt;vmla.s16 d25,d11,d0[1]&lt;br /&gt;vmla.s16 d25,d13,d0[2]&lt;br /&gt;vmla.s16 d25,d15,d0[3]&lt;br /&gt;vmla.s16 d25,d17,d1[0]&lt;br /&gt;vmla.s16 d25,d19,d1[1]&lt;br /&gt;vmla.s16 d25,d21,d1[2]&lt;br /&gt;vmla.s16 d25,d23,d1[3]&lt;br /&gt;vmla.s16 d25,d3 ,d5&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;The result should be the same but you shouldn&amp;#39;t get any stalls.&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Cascading multiplication; NEON;</title><link>https://community.arm.com/thread/3485?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:08:52 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:f4648287-8255-404d-94c8-7728cb459819</guid><dc:creator>Green Troll</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 2nd April 2013 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;If you think about it, this behavior is what you would expect. The NEON unit on Cortex-A8 and Cortex-A9 can do 4x16-bit multiplications in one cycle. Doing an 8x16-bit multiplication is like doing two 4x16-bit ones back to back, and takes two cycles. Because it&amp;#39;s like alternating between two different multiplications it can&amp;#39;t forward the result because it&amp;#39;d have to forward two things where there&amp;#39;s probably only one internal accumulator.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Exophase!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Where did you get information about &amp;quot;4x16-bit&amp;quot;, &amp;quot;8x16-bit&amp;quot;? Actually in the first example there was 8x8bit multiplications, which was well....&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Cascading multiplication; NEON;</title><link>https://community.arm.com/thread/3484?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:08:52 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:2c583f08-141a-4ebe-b2de-c4af59751b3c</guid><dc:creator>Green Troll</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 15th March 2013 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;UP!&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Cascading multiplication; NEON;</title><link>https://community.arm.com/thread/3486?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 11:08:52 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:750b6867-b757-443a-be26-72f0276e0924</guid><dc:creator>Green Troll</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 6th March 2013 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Exophase!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;To be truth, I tryed this web simulator after I got the significant difference in performance on the Motorola phone (cortex-a9, 1.2 GHz). &lt;/span&gt;&lt;br /&gt;&lt;span&gt;As you can see the function is interpolate data block. For block 8x8 launched 500*10^6 times I got 188 sec in one case against 101 sec in another, so software simulator and hardware gave almost the same results. There are Perf1/Perf2 = 2.02 and 1.86 in simulator and hardware respectively.&lt;/span&gt;&lt;br /&gt;&lt;span&gt;So it seems that the simulator works correctly.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Also, you can see another varient of multiplication - &lt;/span&gt;&lt;a href="http://pulsar.webshaker.net/ccc/sample-42bd665a" rel="nofollow"&gt;http://pulsar.websha...sample-42bd665a&lt;/a&gt;&lt;br /&gt;&lt;span&gt;In this case, all multiplications follow each other without delays, in spite of using scalar operands.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;n.9-0&amp;#160;&amp;#160;&amp;#160; 1c n0&amp;#160;&amp;#160;&amp;#160;&amp;#160; vmull.s16 q13,d9 ,d0[0]&lt;br /&gt;n.10-0&amp;#160;&amp;#160; 1c n0&amp;#160;&amp;#160;&amp;#160;&amp;#160; vmlal.s16 q13,d11,d0[1]&lt;br /&gt;n.11-0&amp;#160;&amp;#160; 1c n0&amp;#160;&amp;#160;&amp;#160;&amp;#160; vmlal.s16 q13,d13,d0[2]&lt;br /&gt;n.12-0&amp;#160;&amp;#160; 1c n0&amp;#160;&amp;#160;&amp;#160;&amp;#160; vmlal.s16 q13,d15,d0[3]&lt;br /&gt;n.13-0&amp;#160;&amp;#160; 1c n0&amp;#160;&amp;#160;&amp;#160;&amp;#160; vmlal.s16 q13,d17,d1[0]&lt;br /&gt;n.14-0&amp;#160;&amp;#160; 1c n0&amp;#160;&amp;#160;&amp;#160;&amp;#160; vmlal.s16 q13,d19,d1[1]&lt;br /&gt;n.15-0&amp;#160;&amp;#160; 1c n0&amp;#160;&amp;#160;&amp;#160;&amp;#160; vmlal.s16 q13,d21,d1[2]&lt;br /&gt;n.16-0&amp;#160;&amp;#160; 1c n0&amp;#160;&amp;#160;&amp;#160;&amp;#160; vmlal.s16 q13,d23,d1[3]&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;I&amp;#39;m completely confused by all this options&lt;/span&gt;&lt;a href="/cfs-file/__key/communityserver-discussions-components-files/15/2146.blink.gif"&gt;&lt;img border="0" height="20" src="/cfs-file/__key/communityserver-discussions-components-files/15/2146.blink.gif" width="20" alt=" " /&gt;&lt;/a&gt;&lt;span&gt;.&lt;/span&gt;&lt;br /&gt;&lt;span&gt;All in all, my question&amp;#39;s opened yet.&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item></channel></rss>