This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why the different encodings?

Juha Aaltonen over 11 years ago

Why are there different encodings of instructions?

What's the idea/background/etc for their co-existence?

Can different encodings be mixed in the code? (Not ARM encodings with Thumb encodings- without ARM/Thumb mode change,

but, like A1 and A2 or T1 and T2)?

I'm trying to put together a gdb stub, and for single stepping the machine code needs to be partially decoded.

How can one tell apart the encoding of an instruction in a machine code program (binary)?

Oh, and an additional question: what do the bit values in parenthesis mean?

For the case when cond is 0b1111, see Unconditional instructions on page A5-216.

t = UInt(Rt); n = UInt(Rn); imm32 = Zeros(32); // Zero offset

if t == 15 || n == 15 then UNPREDICTABLE;

Encoding T1 ARMv6T2, ARMv7

LDREX<c> <Rt>, [<Rn>{, #<imm>}]

Encoding A1 ARMv6*, ARMv7

LDREX<c> <Rt>, [<Rn>]

1 1 0 1 0 0 0 0 1 0 1 Rn Rt (1) (1) (1) (1) imm8

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

1514131211 10 9 8 7 6 5 4 3 2 1 0

cond 0 0 0 1 1 0 0 1 Rn Rt (1) (1) (1) (1) 1 0 0 1 (1) (1) (1) (1)

0 Yasuhiko Koumoto over 11 years ago

Hello,

the difference encoding of the same instructions of which one is ARM and another is Thumb would come from the encoding policy.
ARM instruction is defined by bit 27-25 and bit 4 of the instruction word.
Thumb instruction is defined by bit 15-10 of the instruction half-word.
Basically ARM instruction is 32 bit length and Thumb instruction is 16 bit length.
Because Thumb is made after ARM and the basic instruction length is different, it would be impossible to take the same encoding.
In the elf file, binary codes can vary the encoding whether ARM or Thumb for every function.
However, I don't know how the elf file contains the function attributes.
Regarding '(n)' notation, the bit of which value is other than 'n' shows that (in general) the instruction is undefined.
According to the A5.1.2 of the ARM Architecture Reference Manual, the following statements exist.

An instruction is UNPREDICTABLE if:
• it is declared as UNPREDICTABLE in an instruction description or in this chapter
• the pseudocode for that encoding does not indicate that a different special case applies, and a bit marked (0) or (1) in the encoding diagram of an instruction is not 0 or 1 respectively.

From this, the dis-assembler MIGHT ignore the bit which has parenthesises.
The result of UNDEFINED is one reported by objdump command.
Best regards,
Yasuhiko Koumoto.
Cancel
Vote up 0 Vote down

Cancel
0 Juha Aaltonen over 11 years ago in reply to Yasuhiko Koumoto

Thanks, it didn't come to my mind to look into the "UNDEFINED and UNPREDICTABLE" description (I didn't know the "()" notation was related to that). I tried to search from notation conventions and the like.
"ARM instruction is defined by bit 27-25 and bit 4 of the instruction word."
I guess it would be a good idea to re-check the ARM instructions from this point of view. That might simplify things.
I've been decoding ARM instructions (mainly A1) starting from 27-25 and then using 24,23 and 4.
But why are there different ARM encodings (A1, A2) and Thumb encodings (T1, T2, T3, T4)?
I can understand that ARM and Tumb encodings can have same bit patterns, because which instruction is "seen", depends on the mode bit, but can A1 and A2 encoded instructions have same bit patterns?
Or is it so, that instructions with different encodings in the same instruction set (ARM or Thumb) are different instructions and the bit patterns don't overlap. And the encoding class (like A1, A2) just tells that part of the instruction is encoded using different 'encoding rule' to 're-use' instruction bits?
I'm asking this, because I haven't found a complete instruction list (either with or without the binary form) anywhere.
I could make it myself, but I think it's a (time consuming) project of it's own and I don't have time for that at the moment.
Cancel
Vote up 0 Vote down

Cancel
0 Mike Clark over 11 years ago in reply to Juha Aaltonen

Turbs
Have you looked at the ARMv7-A and v7-R Architecture Reference Manual? That has, I believe, a complete list of the -A and -R instruction sets. As for the T1, T2 notation, these refer to Thumb 1 and Thumb 2 respectively. What T3 and T4 are, I can find no references whatever; it may be that T3 refers to the ill-fated ThumbEE. Likewise, the difference between A1 and A2 is a mystery, although the ARM instruction set was upgraded at the time that Thumb 2 was released, for compatibility reasons. Does anyone out there know where we can find an explanation of these mnemonics?
Cancel
Vote up 0 Vote down

Cancel
0 Juha Aaltonen over 11 years ago in reply to Mike Clark

From ARM ARM, ARMv7-A and ARMv7-R edition, Issue C:
(Sorry, copy from that pdf works funny. Boldings are mine.)

A8.8.203 STR (immediate, Thumb)

Store Register (immediate) calculates an address from a base register value and an immediate offset, and stores a

word from a register to memory. It can use offset, post-indexed, or pre-indexed addressing. For information about

memory accesses see Memory accesses on page A8-294.

Encoding T1 ARMv4T, ARMv5T*, ARMv6*, ARMv7

STR<c> <Rt>, [<Rn>{, #<imm>}]

Encoding T2 ARMv4T, ARMv5T*, ARMv6*, ARMv7

STR<c> <Rt>, [SP, #<imm>]

Encoding T3 ARMv6T2, ARMv7

STR<c>.W <Rt>, [<Rn>, #<imm12>]

Encoding T4 ARMv6T2, ARMv7

STR<c> <Rt>, [<Rn>, #-<imm8>]

STR<c> <Rt>, [<Rn>], #+/-<imm8>

STR<c> <Rt>, [<Rn>, #+/-<imm8>]!

0 1 1 0 0 imm5 Rn Rt

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

1 0 0 1 0 Rt imm8

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

1 1 1 1 0 0 0 1 1 0 0 Rn Rt imm12

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

1

1514131211 10 9 8 7 6 5 4 3 2 1 0

1 1 1 1 0 0 0 0 1 0 0 Rn Rt 1 P U W imm8

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

and

A8.8.71 LDRBT

Load Register Byte Unprivileged loads a byte from memory, zero-extends it to form a 32-bit word, and writes it to

a register. For information about memory accesses see Memory accesses on page A8-294.

The memory access is restricted as if the processor were running in User mode. This makes no difference if the

processor is actually running in User mode.

LDRBT is UNPREDICTABLE in Hyp mode.

The Thumb instruction uses an offset addressing mode, that calculates the address used for the memory access from

a base register value and an immediate offset, and leaves the base register unchanged.

The ARM instruction uses a post-indexed addressing mode, that uses a base register value as the address for the

memory access, and calculates a new address from a base register value and an offset and writes it back to the base

register. The offset can be an immediate value or an optionally-shifted register value.

if Rn == ‘1111’ then SEE LDRB (literal);

t = UInt(Rt); n = UInt(Rn); postindex = FALSE; add = TRUE;

register_form = FALSE; imm32 = ZeroExtend(imm8, 32);

if t IN {13,15} then UNPREDICTABLE;

For the case when cond is 0b1111, see Unconditional instructions on page A5-216.

t = UInt(Rt); n = UInt(Rn); postindex = TRUE; add = (U == ‘1’);

register_form = FALSE; imm32 = ZeroExtend(imm12, 32);

if t == 15 || n == 15 || n == t then UNPREDICTABLE;

For the case when cond is 0b1111, see Unconditional instructions on page A5-216.

t = UInt(Rt); n = UInt(Rn); m = UInt(Rm); postindex = TRUE; add = (U == ‘1’);

register_form = TRUE; (shift_t, shift_n) = DecodeImmShift(type, imm5);

if t == 15 || n == 15 || n == t || m == 15 then UNPREDICTABLE;

if ArchVersion() < 6 && m == n then UNPREDICTABLE;

Encoding T1 ARMv6T2, ARMv7

LDRBT<c> <Rt>, [<Rn>, #<imm8>]

Encoding A1 ARMv4*, ARMv5T*, ARMv6*, ARMv7

LDRBT<c> <Rt>, [<Rn>], #+/-<imm12>

Encoding A2 ARMv4*, ARMv5T*, ARMv6*, ARMv7

LDRBT<c> <Rt>, [<Rn>],+/-<Rm>{, <shift>}

1 1 1 1 0 0 0 0 0 0 1 Rn Rt 1 1 1 0 imm8

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

1

1514131211 10 9 8 7 6 5 4 3 2 1 0

cond 0 1 0 0 U 1 1 1 Rn Rt imm12

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

cond 0 1 1 0 U 1 1 1 Rn Rt imm5 type 0 Rm

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

I just haven't found an explanation for the T1-T4 and A1-A2.
And because I haven't found any simple instruction list either, and the binary forms are about one page apart in the ARM ARM in the average, it would be a small project to compare the binary forms of instructions just to see if the encodings are just classification of the instruction structures (classes of instructions that have similar bit pattern) or something that affects the decoding (same bit patterns interpreted in a different way).
It LOOKS like the encodings are just classes of instruction structures that are used for discussing the way instructions map into bits. But what's their purpose? Why do they exist?
Cancel
Vote up 0 Vote down

Cancel
0 Mike Clark over 11 years ago in reply to Juha Aaltonen

There are several issues here, really:
1. There IS a need for a succinct description of what T1 -T4, and A1 - A2 mean, and it should be placed in all documents which refer to them. Any comments, Mr Rampon?
2. If you want a useful list of all ARM and THUMB instructions, have you seen the various ARM Instruction Set Reference Cards? Google et al will find you the PDF's; there is one for ARM and THUMB2, and at least one for ARM only. They're a bit old by now, but very useful.
3. On the question of the different encodings, there are two factors; firstly, there are different generations (age-wise) of both instruction sets, and the newer the version, typically, the more functionality it provides. This is done in the immutable context that, whatever the instruction does, it has to fit into either a 16-bit or a 32-bit binary field. So, what was a good design for version 1 might be pretty useless for version 3. You know the saying: "If I was going there, I wouldn't start here!"
Cancel
Vote up 0 Vote down

Cancel
0 Juha Aaltonen over 11 years ago in reply to Mike Clark

At least QRC0001_UAL.pdf doesn't contain the binary representation, and I need to be sure that all possible (working) instructions of ARMv7-A are handled.
Cancel
Vote up 0 Vote down

Cancel
0 Mike Clark over 11 years ago in reply to Juha Aaltonen

Your starting point is the Quick Reference Card - that gives you your target instruction checklist. Then, for all its flaws, you need the Architecture Reference Manual, because it has all of the binary patterns IN CONTEXT. Finally, you might want to download the official GDB Source Code - it's free! The truth is, which ever way you slice it, you've got a Magnum Opus on your hands here! Best of Luck, and please let us know how you get on.
Cancel
Vote up 0 Vote down

Cancel
0 Juha Aaltonen over 11 years ago in reply to Mike Clark

The truth is, which ever way you slice it, you've got a Magnum Opus on your hands here!

I've become to realize. Long laborous and frustrating task.
I did get gdb sources (gdb_7.9.0), but without some crossreferencer it's very hard to find stuff there.
The same goes with OpenOCD. Function pointers everywhere, and no clue where they are set.
[edit]
I found a pretty good example of how to do the decoding far enough to handle single stepping.
gdbserver sources: arm_tdep.c, rm_get_next_pc_raw() and thumb_get_next_pc_raw().
The decoding is really done with style! The experience with ARM instruction sets shows.
I think I'll go with it and return to the full instruction list later.
Cancel
Vote up 0 Vote down

Cancel