Hi, i have just got a cortex-m0(LPC1114) based dev board. I'm reading about the architecture and instructions. My understanding is that it supports most thumb 16-bit instructions and a handful thumb-2 32-bit instructions. If the processor has a 32-bit bus which instructions are fetched(im assuming this, also i have limited knowledge of a CPU's inner workings), why don't it support more thumb-2 instructions? It seems like a waste, or maybe it fetches two 16-bit instructions?
My other issue is related to to my first question. I'm trying to set the core registers with immediate values using MOV. I read that you can use MOV for the for the 16 lower bits, and MOVT for the 16 higher bit(this is only for cores which supports ARM32 i suppose). However first it seems MOVT is not supported by cortex-m0, in arm-none-eabi-as: "Error: selected processor does not support Thumb mode `movt r0 , r1'. Also when i read the ARMv6-M reference manual, i read that MOV can only set a immediate value up to 8-bits long. This all seems very strange to me. I got a hint on a IRC that your really supposed to use PC relative addressing to set registers "directly", which i haven't read much into. Are there no efficient way to set immediate 32-bit values for registers, using MOV or other data instructions?
Thanks for responses!
Hello,
what do you mean by the word 'collision'?
Does the 'IF' mean the Instruction Fetch Stage?
Cortex-M0 fetches 32 bit data at every fetch stage.
That is, they are 2 instructions if both instruction's length are 16 bit and 1 instruction if the instruction length is 32 bit.
The length of 'LDR PC' is 16 bit and it will be possible to fetch also the next instruction at one time.
Best regards,
Yasuhiko Koumoto.
So if the first instruction is 16-bit and is a B, the following instruction is thrown away? if so, does that mean the following instruction is thrown away? I ask because some 32-bit CPUs with 16-bit instructions still perform the instruction IF it is nuclear e.g.LD R0,#0. You can see the advantage. No empty slots in the pipeline.
I ask because the BBC are having a custom M0 produced for the BBC Microbit (my target) and I wondered if it's a possible modification to approach the dhrystone per MIP boundary.
Thank you for your expert advice
So if the first instruction is 16-bit and is a B, the following instruction is thrown away? if so, does that mean the following instruction is thrown away?
Yes.
I ask because some 32-bit CPUs with 16-bit instructions still perform the instruction IF it is nuclear e.g.LD R0,#0. You can see the advantage. No empty slots in the pipeline.
Do you say about the branch delay slot?
I don't know the 32-bit CPU of which instruction length is 16-bit and which supports the branch delay slot other than SH.
In some case, it will be useful but there will be sometimes a disadvantage of a code size.
It is because why it is a rare case which a compiler could fill the delay slot with a certain instruction.
Dear Koumoto San, Since the BBC is buying over 1 million custom chips, I am wondering if it is difficult to alter the M0 to remove the cache-clear. I found that I could speed up SH2 code significantly by using this simple trick. This was using the GNU compiler-chain so I rewrote all of the felide-constructors to use this feature. I was able to achieve 0 NOP instructions in the whole object code of Tombraider on the Saturn. In most cases, it would move the result of a C call into R0. A minor thing, but it reduced the size of object code.
I fear I have an un natural hatred of a single cycle being wasted and so I look very hard at an instruction set and the pipeline as well as cache (if any) to ensure that the chip operates without a single wasted cycle. Like SH, M0 always reads 32-bit instructions and so, to prevent an instruction being thrown away, placing a B as the second of 2 16-bit instructions, at least less is wasted. Of course, it isn't as simple as 32-bit boundaries. A coder would have to work through code from it's start to ensure the order... but I will do that if it gives me extra performance. As you know, games programming in the 80s & 90s relied on someone hand-optimizing a mostly C project to get 95%+ of the theoretical CPU performance to compete. If I have a task that takes almost all of the CPU time (CELP decode for example), such tricks can mean the difference between success and failure.
I thank you for your valuable time and I will try to keep my questions to a minimum.
As I have posted, I am looking at software for the BBC Microbit and specifically an audiobook (CELP) and a language-lab in which the teacher broadcasts the language to all pupils and can listen to a single pupil. The Nordic Semiconductors chip also has an M0 core so I'm hoping, depending on how the CPU & bluetooth chips communicate, that when not in use, this CPU may also be used.
If all else fails, I CAN use the ARM7TDMI used by all current Sandisk memory sticks, but that may prove difficult as they have a habit of altering control codes.
I thank you for your patience and wish you a good day,
With sincere thanks,Sean Wain Dunlevy