<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="https://community.arm.com/utility/feedstylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Neon reg to ARM reg data transfer</title><link>https://community.arm.com/developer/tools-software/tools/f/armds-forum/647/neon-reg-to-arm-reg-data-transfer</link><description> Note: This was originally posted on 30th July 2009 at http://forums.arm.com I m transferring data from neon register to arm register, which is very costly. i.e., it takes each vmov.32 r6,do[2] takes around 13 to 18 cycles. This is proving to be very</description><dc:language>en-US</dc:language><generator>Telligent Community 10</generator><item><title>RE: Neon reg to ARM reg data transfer</title><link>https://community.arm.com/thread/1630?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 10:59:12 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:26640574-4ca6-40bf-b6f6-e679306d2b42</guid><dc:creator>Peter Harris</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 30th July 2009 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;You cannot reduce the NEON-ARM or ARM-NEON register move costs - these are caused by the microarchitecture of the NEON unit for the Cortex-A8.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;One of the key parts of vectorizing for the Neon unit is to try and minimize the interaction between the main ARM pipeline and the NEON unit - pushing relatively large blocks of code through the NEON without too much dependence on what is happening on the ARM-pipeline. Any interaction is fairly time consuming as you have noted, but for many common tasks such a media CODECs the need for interaction is actually quite low.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;It is worth noting that the NEON unit has its own load store hardware so the NEON code can make its own memory accesses, and does *not* need the ARM to load data for it (this then has to be vmov&amp;#39;d in to the NEON which is slow).&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Neon reg to ARM reg data transfer</title><link>https://community.arm.com/thread/1631?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 10:59:12 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:e1ad0302-0783-4122-ac51-f95e1ad5dae9</guid><dc:creator>unclep unclep</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 31st July 2009 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;You might find example Cortex-A8-optimized code useful, such as this:&lt;/span&gt;&lt;br /&gt;&lt;span&gt;&lt;span&gt;[url=&amp;quot;&lt;/span&gt;&lt;a href="http://www.arm.com/products/multimedia/openmax/" target="_blank"&gt;http://www.arm.com/products/multimedia/openmax/&lt;/a&gt;&lt;span&gt;&amp;quot;]&lt;/span&gt;&lt;a target="_blank"&gt;http://www.arm.com/products/multimedia/openmax/[/url]&lt;/a&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Neon reg to ARM reg data transfer</title><link>https://community.arm.com/thread/1629?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 10:59:12 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:10b9a679-85d2-4ccf-a003-4e8892a6f3ca</guid><dc:creator>Vishwa Vishwa</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 13th August 2009 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Hi guys,&lt;/span&gt;&lt;br /&gt;&lt;span&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; I found a way out of this. There were some 16 such instructions(neon to arm transfer instructions) in my function. So accounting for some 208 cycles per call, (latency for vmov.u32 r6,d26[0] = 14).And this function was getting called some 20&amp;#160; thousand times. It was eating up lot of time.&lt;/span&gt;&lt;br /&gt;&lt;span&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; But i had a opportunity in it. Interleaving 14 independent instructions in between. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;code was something like this:&lt;/span&gt;&lt;br /&gt;&lt;span&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; vmov.32 r6,d[0]&lt;/span&gt;&lt;br /&gt;&lt;span&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; - instrns which use r6- &lt;/span&gt;&lt;br /&gt;&lt;span&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; - instrn- &lt;/span&gt;&lt;br /&gt;&lt;span&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; - instrn- &lt;/span&gt;&lt;br /&gt;&lt;span&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; - instrn-&amp;#160;&amp;#160; ........&lt;/span&gt;&lt;br /&gt;&lt;span&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; &lt;/span&gt;&lt;br /&gt;&lt;span&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; vmov.32 r6,d[0]&lt;/span&gt;&lt;br /&gt;&lt;span&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; - instrns which use r6- &lt;/span&gt;&lt;br /&gt;&lt;span&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; - instrn- &lt;/span&gt;&lt;br /&gt;&lt;span&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; - instrn- &lt;/span&gt;&lt;br /&gt;&lt;span&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; - instrn-&amp;#160;&amp;#160; ..........&lt;/span&gt;&lt;br /&gt;&lt;span&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; ................etc &lt;/span&gt;&lt;br /&gt;&lt;span&gt;So looking at the code structure, an option left is interleave. Take out the instructions which use the immediate result of r6 and insert them just before next neon to arm transfer instrn. This is assuming that the instrns in between are independent. This gave a gain of around 4%.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Thanks a lot for your help!!!!!!!&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Neon reg to ARM reg data transfer</title><link>https://community.arm.com/thread/1627?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 10:59:12 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:f575034b-5389-4a26-9178-2663efed27bc</guid><dc:creator>Vishwa Vishwa</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 3rd August 2009 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Thanks !!!!&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: Neon reg to ARM reg data transfer</title><link>https://community.arm.com/thread/1628?ContentTypeID=1</link><pubDate>Wed, 11 Sep 2013 10:59:12 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:dfd10693-6739-4d3b-9481-9e30ced90ccd</guid><dc:creator>Vishwa Vishwa</dc:creator><description>&lt;div&gt;&lt;i&gt;Note: This was originally posted on 31st July 2009 at &lt;a href="http://forums.arm.com"&gt;http://forums.arm.com&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;Thanks !!&lt;/span&gt;&lt;br /&gt;&lt;span&gt;So one option wud b if possible convert all arm involved operations to Neon operations.........&lt;/span&gt;&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item></channel></rss>