Has anybody come across a list of ARM & THUMB instructions that cause deviation from the linear instruction stream?
I've been trying to figure out gdb-stub single stepping using software interrupts, and in single stepping you need to find
the next instruction(s) where the next breakpoint instruction needs to be set.
There are three cases:
1) current instruction doesn't change the execution path. Next instruction is the next word.
2) current instruction is a jump. The operand defines the next instruction address
3) current instruction is conditional branch. One possible next instruction is the next word, the other possible
instruction address is defined by the operand. (That includes conditional add with PC as the target, and the like).
To implement single stepping, I need to tell those cases apart and figure out how to find out the possible branching address.
I could go through manuals of numerous processors instruction by instruction and maybe I'd be done within the next couple of years,
or I could find a list of instructions to check, or a paper that explains how to "decode" the instructions in a useful way.
Also, there doesn't seem to be lots of sources of ARM gdb servers or stubs around that use software breakpoints.
A couple of observations...
Depends on what you mean by "word". Usually word means 4 bytes for ARM processors. ARM instructions are word sized, Thumb instructions can be word or halfword sized.
On (2/3), these category covers a fairly hefty portion of the instruction set. Most data processing and load instructions allow the PC as the destination. For example "ADD pc, r0, r1" would cause the processor to branch to the address "r0+r1".
There are also a number of special cases. For example SVC, SMC and HVC all cause exceptions, which you could think of as a special kind of branch. Similarly, "SUBS pc, lr, #4" would perform an exception return - which again you could think of as special type of branch.
Then there are other types of exception to think of. Something like "VADD.F32 s0, s1, s2" would either perform a 32-bit floating point addition, or trigger an exception if the FPU wasn't enabled. Any kind of load/store could trigger an exception due to MMU/MPU checks.
I do not have a list of instructions, but try to think in the "disassembler" direction.
First of all, make a table, where each entry consist of a mask and a value and a handler.
The actual instruction is ANDed with the mask, then compared against the value.
If the value match, you jump to the handler.
The handler could for instance handle adding r0+r1 like in Martin's example above.
This would simplify the task a little.
I know that OpenOCD can disassemble, so I just made a quick grep -iR 'rrx' * in order to find the disassembler; it seems it's located in src/target/arm_disassembler.c
Unfortunately, it appears to be a bit bulky, but I think it would be helpful anyway.
In addition to this, you might want to head over to the ARM Information Center > ARM architecture > Reference Manuals and download these PDF files. You can download them for free; you just need to register first.
I especially find the Thumb Instruction Set Encoding chapter in ARMv7-M_ARM.pdf interesting.
Sorry about my sloppy use of 'words' (pun intended).
It was just easier to talk about words (as 32-bit entities,like ARM instructions).
I'm painfully aware of the abundance of instruction flow changing instructions, but the main thing is single stepping
(implementation of a gdb stub),
Even if ADD can have PC as target (causing a jump) there are many adds that do not. Only those with PC as the target count. Similarly, interrupts and HW exceptions are not single stepped. SW excptions _might_ be (although maybe later).
Usually FP errors are considered HW faults in these kinds of situations. Also, conditional execution doesn't change the instruction flow, but just changes the instruction functionality (if condition is false, the instruction becomes NOP).
There is such a code in the gdb client, but it looks like it uses the disassembler code, so it's quite complicated.
On the Atari ST, I wrote a debugger, which could single-step in various ways.
One of the ways included copying the instruction to a local buffer in RAM (because when the instruction is located in ROM, you can't set a breakpoint, since a breakpoint is an instruction).
So I copied the instruction to the local buffer and placed a return-from-exception right after it, then executed the instruction by temporarily changing PC to that local buffer.
Doing such things requires you to know the size and behaviour of the instruction. For instance, a relative jump would not be suitable for copying there.
When executing conditional ARM Thumb instructions (eg. Cortex-M), you may have to take the IT instruction into account; I have not experimented with this myself, so I do not have any experience - but I can imagine that you might want to set breakpoints on all of the 4 instructions. There will probably not be any problems with the IT cache, though, because it's possible to have IT instructions inside an interrupt, so I believe the state is saved on the stack.
I'm playing with Cortex A7, and I've been reading The ARMv7-AR Architecture Reference Manual (downloaded) and Cortex-A Series Programmer’s Guide (downloaded).
It's just that going through every instruction (both ARM and thumb) and figuring out the 'rules' to tell where to put the next breakpoint instruction(s) would take quite long time. First going through all instructions and picking those that could change the address where the next instruction is fetched. Then going through the picked instruction decodings and then finding out the 'common factors'. I'm quite new to ARM-world, and I don't remember the instructions. I have to look all of them up in the manuals.
BTW, jensbauer, you might remember me asking about a standalone gdb-stub. I decided to write it, and I have resume-from-breakpoint and single-stepping missing from my initial code. Most of the commands (for an almost-minimall stub), exception handling and serial I/O are written. When I get the missing parts cleared and coded, then I get to try running it and see where the smoke comes out.
Nice 'blinky'? (Except that it doesn't blink anything.)
bash-4.2$ wc *.[chS]
1201 3820 28965 gdb.c
18 33 244 gdb.h
28 62 443 io_dev.h
36 74 520 loader.c
396 1561 10754 rpi2.c
70 238 1659 rpi2.h
458 1605 10316 serial.c
29 71 569 serial.h
41 115 666 start.S
24 55 348 start1.c
228 730 4101 util.c
29 116 861 util.h
2558 8480 59446 total
Great work! I'm convinced that you'll get there soon.
It might be a good idea to put the instruction size in the table-entry I mentioned earlier.
You should probably also have in mind that Cortex-A can optionally be Big Endian. This does not mean that the instruction set changes, but I think it means that the load/store changes for 16/32/64 bit access.
-Thus if you load an instruction with a mask from memory, I believe you would be safe as long as you do not use two (or more) instructions to construct the 'immediate' values.
Eg. If you use MOV to load the value into a register and then use LDR to read from memory, you may get the value byte-reversed when using the LDR, but the value that MOV loaded would not be byte-reversed, thus there would be no match when you expect it to.
-So in this case, I believe loading from a table is the best solution.
I trust C-compiler knows its business.
But I probably have to use some kind of table. Depends on the complexity of the instruction handling needed.
Most C-compilers would store the 32-bit words in the literal pool, however, if turning on optimizing this part would break on Big Endian platforms.
I highly recommend the table. That would make things easier and also make sure it would work on any Endian machine.
To check at runtime whether your code is running on big or little endian, you could do this:
static const uint32_t isLittleEndian = 0x00000001;
if(*(const uint8_t *)&isLittleEndian)
On little endian, the low-byte would be first, on big endian, the lowbyte would be last, thus on big endian, the result would be 0, on little endian, the result would be 1.
-You could make it an inline function if necessary - or just a global.
I know. I once had to rewrite a endianness test for autotools (compilation-time) test, because we were cross-compiling and the default test needed to be run. Intel processor didn't run PowerPC code too well...
This is news to me, though:
"Most C-compilers would store the 32-bit words in the literal pool, however, if turning on optimizing this part would break on Big Endian platforms."
Actually, I better correct this. It only applies if you use the same binary on a multi-endian platform.
Imagine that your code is built for - say - Little Endian. It's now moved to a platform, where you do not know if the platform is running big or little endian. This can be switched by hardware; usually setting a pin high or low at boot.
Thus your program would need to determine the endianness at runtime.
If the masks and data constants are stored in table entries, they will match what you read from memory, no matter whether your load instruction swaps the data or not.
But MOVW and MOVT do not swap the data (here the data is fixed).
This means that the data would not be correct if your architecture's endianness does not match your binary. Thus you would need two binaries.
If you're running an operating system, such as Linux, then you would not have that problem, because it would most likely allow only loading .elf files where the endianness match the architecture.
Another caveat is when you load your data and bit-shift / mask out bits, you'll probably need to load byte-by-byte and insert the data into a 32-bit word and then do the comparison; again because the load instruction swaps the byte on loading from memory on one type of endianness compared to the other type of endianness.
A few days ago, I had to make some code, which I wrote for PowerPC (I'm still on PPC) work on an Intel-Mac, and I really got in trouble because it seems Apple changed the picture format for 32-bit ARGB (offscreen) pictures.
In addition, they made some changes to how caching works, and finally I had to fight against endian-problems on bit-shifting.
This was really a brain-twisting experience I won't recommend!
-I'm pretty convinced that there are still some bugs hidden in my code, because I ended up not knowing what I was doing, but it does work for now...
I think I have to try to get the thing together with a subset of the instructions first.
I've been going through the encoding of A1 (been doing it for a couple of days) and I still have quite some instructions to go through. There doesn't seem to be a single document that says it all - I've been reading 3 documents in parallel, and I still have done some guesswork too. The documents are "ARM® Cortex™-A Series, Version: 4.0, Programmer’s Guide",
"ARM® Architecture Reference Manual, ARMv7-A and ARMv7-R edition, Issue C.c" and "ARM Architecture Reference Manual, Issue I". Figuring out enough about all the instructions will take a couple of weeks still - probably longer than everything else together. Quite tiring and frustrating work.
Funny how some info seems to be dropped in updates. Like the bits PUNWL for LDC/LDC2/STC/STC2. I couldn't find the explanation in the ARMv7-A ARM anywhere, and the main instruction encoding table in ARM ARM could have been nice in ARMv7-A ARM too.
To not lose all I've done this far, I put my effort in the github. It compiles, but quite some code is still missing.
My struggle with the instructions is there in the file: instr.txt in case someone is interested.
The code is still "initial draft" so don't shoot me.
The repo is: turboscrew/rpi_stub · GitHub
It looks like figuring out the ARM ISA on bit level is becoming the most tedious and time consuming task.
When (if?)I get it figured out, I hope I still remember there was a project it was done for.
It doesn't help that aliases and pseudo instructions are treated in the document just like the 'native' instructions.
I think I just have to go through the instructions in the ARMv7-A ARM one by one and manually list all instructions and the bit patterns of all encodings in a text file for easier manipulation and sort them out there.
The HTML-pages are slow for that, and the copying works funny with PDFs.
The time estimate to finish the project just got fourfold (at least).
I know OpenOCD does single-stepping too. Perhaps this can be of some help to you ?
In the file cortex_a.c there is a breakpoint setting function and single stepping function, but they get the address as a parameter. I still haven't found where the address is decided, but I think it's somewhere there, because in the file /src/server/gdb_server.c the function fetch_packet implements the remote serial protocol - that's what I've been working on.
Looks very helpful. Thanks, jensbauer.
(there seems to be no 'helpful answer'-button, so I clicked 'correct answer' even if I'm not yet sure if this solves my problem. Odds look good though, so it can't go very wrong.
Just to let possibly interested people to know:
I generated a list of encodings (both ARM and Thumb) for Cortex-A7.
The lists may still contain errors, like missing decodings or the like - they were awk-script generated from ARMv7-A/R ARM.
Since I didn't find that kind of lists in the net, I decided to make them.
The lists can be downloaded from https://github.com/turboscrew/rpi_stub (the files ARM_instructions.txt and Thumb_instructions.txt ).
They are text files to allow easier manipulation.