<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="https://community.arm.com/utility/feedstylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>M4 (Thumb2) to M0+ (Thumb) assembly language</title><link>https://community.arm.com/developer/ip-products/processors/f/cortex-m-forum/47974/m4-thumb2-to-m0-thumb-assembly-language</link><description> I have spent a long time trying to find the fastest ARM M0+ macros for C code but I am trying to write an M0+ processor and I have discovered that for an MP3 decoder, the processors ability is it&amp;#39;s speed at the polyphase section. Just 9 lines of code</description><dc:language>en-US</dc:language><generator>Telligent Community 10</generator><item><title>RE: M4 (Thumb2) to M0+ (Thumb) assembly language</title><link>https://community.arm.com/thread/168519?ContentTypeID=1</link><pubDate>Sat, 07 Nov 2020 17:55:50 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:358e4057-392a-4dbd-8a48-581da826b9b7</guid><dc:creator>42Bastian Schick</dc:creator><description>&lt;p&gt;I understand, you have 32 bit fix points. Do you really need the accuracy of summing up all 64 bit of the multiplication?&lt;/p&gt;
&lt;p&gt;When writing games or demos, we make assumptions like rez of 160x102, so we know that of boundaries which cannot be crossed when doing calculations. Are there any for the decoder?&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: M4 (Thumb2) to M0+ (Thumb) assembly language</title><link>https://community.arm.com/thread/168504?ContentTypeID=1</link><pubDate>Fri, 06 Nov 2020 14:57:01 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:f7f7bbf5-c9ff-495f-886d-c3d12e201544</guid><dc:creator>Sean Dunlevy</dc:creator><description>&lt;p&gt;Another stunning piece of code Bastian. The mulshift32 is your 32-bit x 32-bit signed multiply. Only the top 32 bits are used which is why I keep looking and looking. To calculate each 16-bit PCM output uses 32 mulshift32s. I&amp;#39;m sure you can see how much just 1 cycle less would save. I have looked at &amp;#39;Hackers Delight&amp;#39;, &amp;#39;Bit Twiddling&amp;#39; and all of those places. I even looked at Karatsuba multiplication to see if the error (due to overflow when adding ints) can be pushed into the bottom 2 bits....&lt;br /&gt;&lt;br /&gt;There IS a form of simplified MP3 decoder which uses signed 16-bit x 16-bit signed multiplies but given that the SMULS takes 3-5 cycles, your software solution using 17 cycles is a lot better than people might think...&lt;br /&gt;&lt;br /&gt;I do suspect that it will have to be 100% assembly language.&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: M4 (Thumb2) to M0+ (Thumb) assembly language</title><link>https://community.arm.com/thread/168490?ContentTypeID=1</link><pubDate>Fri, 06 Nov 2020 06:45:32 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:37509e3b-0e05-4588-aed1-f11732672991</guid><dc:creator>42Bastian Schick</dc:creator><description>&lt;p&gt;After some brain wrecking, I think you can save 3 cycles if you unroll:&lt;br /&gt;&lt;pre class="ui-code" data-mode="text"&gt;;r0-r4 used by MULSHIFT32
;r12 lo &amp;amp; r5 hi of sum1L
;r14 lo &amp;amp; r6 hi of sum2L
;r7 base-address of vb1+23
;r8 c1
;r9 c2
;r10 vLo/vHi
;r11 address of vb1

MC2M:
mov r11,r7	; vb1+x
add r7,#$5c	; vb1+(23-x)

REPEAT 23
.inner_loop:
pop r0-r1 ;get c1 &amp;amp; c2
mov r9,r1 ;store c2
mov r8,r0 ;store c1

mov r2,r11
ldmia r2!,{r0}
mov r11,r2
mov r10,r0

mulshift32

add r12,r0 ;
cmp r0,r12 ;sum1L += (vLo x c1)
adcs r5,r1 ;

mov r0,r9 ;c2
mov r1,r10 ;vLo

mulshift32

add r14,r0 ;
cmp r0,r14 ;sum2L += (vLo x c2)
adcs r6,r1 ;

ldr r0,[r7] ;vHi
subs r7,r7,#4
mov r10,r0

mov r1,r9 ;-c2
neg r1,r1 ;

mulshift32

add r12,r0 ;
cmp r0,r12 ;sum1L += (vHi x -c2)
adcs r5,r1 ;

mov r0,r8 ;c1
mov r1,r10 ;vHi

mulshift32

add r14,r0 ;
cmp r0,r14 ;sum2L += (vHi x c1)
adcs r6,r1 ;
ENDR
----
29
&lt;/pre&gt;&lt;/p&gt;
&lt;p&gt;&lt;/p&gt;
&lt;p&gt;After the unrolling, r7 is again vb1.&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: M4 (Thumb2) to M0+ (Thumb) assembly language</title><link>https://community.arm.com/thread/168487?ContentTypeID=1</link><pubDate>Fri, 06 Nov 2020 05:46:23 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:be1d1ca1-6f55-4ba1-8eb0-6f37f0d8010e</guid><dc:creator>42Bastian Schick</dc:creator><description>&lt;p&gt;From the other thread:&lt;/p&gt;
&lt;p&gt;&lt;br /&gt;mov r0,r10&lt;br /&gt; cmp r0,#$24&lt;br /&gt; bl .inner_loop&lt;/p&gt;
&lt;p&gt;Seems wrong, did you mean:&lt;/p&gt;
&lt;p&gt;&lt;/p&gt;
&lt;p&gt;mov r0,r11&lt;br /&gt; cmp r0,#$24&lt;br /&gt; blo .inner_loop&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>RE: M4 (Thumb2) to M0+ (Thumb) assembly language</title><link>https://community.arm.com/thread/168477?ContentTypeID=1</link><pubDate>Thu, 05 Nov 2020 15:23:10 GMT</pubDate><guid isPermaLink="false">dd9e70c8-6d3c-4c71-b136-2456382a7b5c:3367f2ee-9cfe-44ea-8dc3-b4a590f5c585</guid><dc:creator>42Bastian Schick</dc:creator><description>[quote userid="16207" url="~/developer/ip-products/processors/f/cortex-m-forum/47974/m4-thumb2-to-m0-thumb-assembly-language"]Jens provided us all with a 17-cycles 32-bit x 32-bit --&amp;gt;64 bit signed multiply which corrupts r0-r5 while giving the results in ro &amp;amp; r1[/quote]
&lt;p&gt;My version uses r0-r4 only:&lt;/p&gt;
&lt;p&gt;&lt;pre class="ui-code" data-mode="text"&gt;    mov	r12,r4
	//  17 cycles (if muls takes 1 cycle)
	// 141 cycles (if muls takes 32 cycles)
	//
	// ab*cd
	// ac
	//  ad
	//  bc
	//   bd
	// ------

	uxth	r2,r0			// b
	lsrs	r0,r0,#16		// a
	lsrs	r3,r1,#16		// c
	uxth	r1,r1			// d
	movs	r4,r1			// d

	muls	r1,r2			// bd
	muls	r4,r0			// ad
	muls	r0,r3			// ac
	muls	r3,r2			// bc

	lsls	r2,r4,#16		// ad =&amp;gt; d0
	lsrs	r4,r4,#16		// ad =&amp;gt; 0a
	adds	r1,r1,r2		// bd + d0
	adcs	r0,r4			// ac + 0a + C
	lsls	r2,r3,#16		// bc =&amp;gt; c0
	lsrs	r3,r3,#16		// bc =&amp;gt; 0b
	adds	r1,r1,r2		// bd + c0
	adcs	r0,r3			// ac + 0b + C

	mov	r4,r12&lt;/pre&gt;&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item></channel></rss>