We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hi, i have just got a cortex-m0(LPC1114) based dev board. I'm reading about the architecture and instructions. My understanding is that it supports most thumb 16-bit instructions and a handful thumb-2 32-bit instructions. If the processor has a 32-bit bus which instructions are fetched(im assuming this, also i have limited knowledge of a CPU's inner workings), why don't it support more thumb-2 instructions? It seems like a waste, or maybe it fetches two 16-bit instructions?
My other issue is related to to my first question. I'm trying to set the core registers with immediate values using MOV. I read that you can use MOV for the for the 16 lower bits, and MOVT for the 16 higher bit(this is only for cores which supports ARM32 i suppose). However first it seems MOVT is not supported by cortex-m0, in arm-none-eabi-as: "Error: selected processor does not support Thumb mode `movt r0 , r1'. Also when i read the ARMv6-M reference manual, i read that MOV can only set a immediate value up to 8-bits long. This all seems very strange to me. I got a hint on a IRC that your really supposed to use PC relative addressing to set registers "directly", which i haven't read much into. Are there no efficient way to set immediate 32-bit values for registers, using MOV or other data instructions?
Thanks for responses!
Hello,I think the reason why Cortex-M0 only supports Thumb (not Thumb-2) is to reduce core size by simplifying the decode logic. As you imagine, the internal bus width is 32 bit and the prefetch logic can 2 Thumb instructions at one read cycle, reducing the prefetch power.Regarding 32 bit immediate, the Thumb usually takes the load pc-relative instruction. It loads the 32 bit immediate word from the literal pool on the memory. It would be the most code size effect way. That is, a 16 bit length instruction can load a 32 bit immediate.For example,
LDR Rt,[pc,#imm] @ pc+imm indicates the label Mem ..... .....Mem: .word 0x12345678
acts as
MOV Rt,#0x12345678.
Best regards,Yasuhiko Koumoto.
Thanks for the great answer, that explains it! I'll just learn to use pc relative load instead.
Best Regards!
Hi ei24 and welcome to the community!
As Yasuhikok explained, the Cortex-M0 only supports 16-bit instructions, no 32-bit instructions are supported
That means that the MOVT instruction does not exist for the Cortex-M0.
A different way of loading a 32-bit constant into a register from the literal pool, is to use the LDR rT,=imm pseudo instruction.
This pseudo-instruction automatically puts the immediate value in the literal pool, and creates a LDR instruction that fetches it.
You should make sure that your literal pool is close by, since the Cortex-M0's addressing range for the LDR instruction is quite limited.
That means right after a b or bx (branch) instruction, you should place the keyword '.pool' (when using the GNU assembler).
I do not know the keyword which is used in other assemblers, though, it might be the same, but without a dot prefix.
At the ARM Information Center, you may find the Cortex-M0 Technical Reference Manual and the Cortex-M0 Devices Generic User Guide helpful in addition to the ARMv6-M reference (which you have already).
Some time ago I wrote an article on ARM Cortex-M0 assembly programming tips and tricks - it seems that it's now revived and has gotten quite popular again.
Thanks it seems like a nice community!
I cannot place the fetched word right before the instruction? I guess it doesn't matter either way, but i wonder.
I used .word, i read your article, maybe i will use .pool if i need to port code in the future.
The community definitely is a nice place if you ask me (I'm just a user like you). Unfortunately, we might not be able to answer all questions as there are not so many of us developers here; but of course we want to if we can.
For LPC-specific questions, you may also find the lpcware.com forum helpful; this is usually where developers who work with LPC devices hang out.
The reason that you can't place the fetched word right before the instruction is that the 16-bit instruction set is quite limited.
As far as I recall, negative offsets are not available for PC-relative addressing, thus you'll have to place them after the instruction at some point.
Since negative offsets are not available, there will be a wider range of positive offsets available, thus you can have some code between the loading and the .pool.
In some cases, you may want to have a register point permanently to an area in memory, which contains data or literals, but normally the literal pool is preferred when you need to load constant values (because it does not cause any overhead).
If you need to port the code, it might be a good idea to have an include file, which contains macros. Doing this makes it possible to 'emulate' directives from other assembers; for instance, the KEIL assembler does not use the dot prefix as far as I know, so you could write a macro called "pool", which just contains ".pool" if you're using the GNU assembler.
You may also find my article on Useful assembler directives and macros for the GNU assembler interesting.
Ill keep the LPC forum in mind
Yeah, there is no real reason why i would want to fetch a word thats placed before the LDR.
Thanks but i already bookmarked your article previously today
Since the bus is 32-bit, are instructions fetched in pairs? If so, will a memory fetch occurs when the IF isn't using the bus be faster than when a collision occurs?
Hello,
what do you mean by the word 'collision'?
Does the 'IF' mean the Instruction Fetch Stage?
Cortex-M0 fetches 32 bit data at every fetch stage.
That is, they are 2 instructions if both instruction's length are 16 bit and 1 instruction if the instruction length is 32 bit.
The length of 'LDR PC' is 16 bit and it will be possible to fetch also the next instruction at one time.
Best regards,
Yasuhiko Koumoto.
So if the first instruction is 16-bit and is a B, the following instruction is thrown away? if so, does that mean the following instruction is thrown away? I ask because some 32-bit CPUs with 16-bit instructions still perform the instruction IF it is nuclear e.g.LD R0,#0. You can see the advantage. No empty slots in the pipeline.
I ask because the BBC are having a custom M0 produced for the BBC Microbit (my target) and I wondered if it's a possible modification to approach the dhrystone per MIP boundary.
Thank you for your expert advice
muffin wrote: Since the bus is 32-bit, are instructions fetched in pairs? If so, will a memory fetch occurs when the IF isn't using the bus be faster than when a collision occurs?
muffin wrote:
Unfortunately it won't, as the instruction timing is fixed on the Cortex-M0.
The disadvantage is obvious, but there are two advantages to this:
1: The core will use less silicon space.
2: It's easier to calculate timings "by hand".
So if the first instruction is 16-bit and is a B, the following instruction is thrown away? if so, does that mean the following instruction is thrown away?
Yes.
I ask because some 32-bit CPUs with 16-bit instructions still perform the instruction IF it is nuclear e.g.LD R0,#0. You can see the advantage. No empty slots in the pipeline.
Do you say about the branch delay slot?
I don't know the 32-bit CPU of which instruction length is 16-bit and which supports the branch delay slot other than SH.
In some case, it will be useful but there will be sometimes a disadvantage of a code size.
It is because why it is a rare case which a compiler could fill the delay slot with a certain instruction.
Dear Koumoto San, Since the BBC is buying over 1 million custom chips, I am wondering if it is difficult to alter the M0 to remove the cache-clear. I found that I could speed up SH2 code significantly by using this simple trick. This was using the GNU compiler-chain so I rewrote all of the felide-constructors to use this feature. I was able to achieve 0 NOP instructions in the whole object code of Tombraider on the Saturn. In most cases, it would move the result of a C call into R0. A minor thing, but it reduced the size of object code.
I fear I have an un natural hatred of a single cycle being wasted and so I look very hard at an instruction set and the pipeline as well as cache (if any) to ensure that the chip operates without a single wasted cycle. Like SH, M0 always reads 32-bit instructions and so, to prevent an instruction being thrown away, placing a B as the second of 2 16-bit instructions, at least less is wasted. Of course, it isn't as simple as 32-bit boundaries. A coder would have to work through code from it's start to ensure the order... but I will do that if it gives me extra performance. As you know, games programming in the 80s & 90s relied on someone hand-optimizing a mostly C project to get 95%+ of the theoretical CPU performance to compete. If I have a task that takes almost all of the CPU time (CELP decode for example), such tricks can mean the difference between success and failure.
I thank you for your valuable time and I will try to keep my questions to a minimum.
As I have posted, I am looking at software for the BBC Microbit and specifically an audiobook (CELP) and a language-lab in which the teacher broadcasts the language to all pupils and can listen to a single pupil. The Nordic Semiconductors chip also has an M0 core so I'm hoping, depending on how the CPU & bluetooth chips communicate, that when not in use, this CPU may also be used.
If all else fails, I CAN use the ARM7TDMI used by all current Sandisk memory sticks, but that may prove difficult as they have a habit of altering control codes.
I thank you for your patience and wish you a good day,
With sincere thanks,Sean Wain Dunlevy