I was going through the ARMv8 Architecture Reference Manual and I came to know that it does not support many instructions that were previously supported by ARMv7 architecture. For example ARMv8 does not support conditional codes and have a seperate instruction CSEL for implementing the same. On further reading I came to know that to reduce the load on instruction encoding and because of better branch predictor conditional support has been removed. However as I came across more instructions I found out that ARM has also removed the RSB instruction instead you need to use a combination of SUB and NEG instruction to achieve the same. My question is that whether increasing the number of instruction does not increase the number of cycles required to execute that instruction. Similarly you can't load multiple registers onto stack, instead only a pair of registers at a time. Similarly there are various neon instructions in ARMv7 that do not have an equivalent instruction in ARMv8 so doesn't that affect the performance of the program?
Hi Natesh,
ARMv8A is focused on high performance and high throughput. What you concern like conditional code, LDM/STM affects the high performance implementation of ARM architecture when using superscalar out of order execution method. So they are removed from architecture level.That's why A72 has huge performance increase than A15.
Does that mean even if we are using more number of instructions to achieve the same functionality still the performance is better than ARMv7. I find it a bit difficult to grasp. Can u please elaborate more on this part?
Do you read computer architecture? You can get more knowledge about out of order execution.
Take conditional execution as an example, it will limit instruction issue rate and increase hardware effect, but the software test shows that conditional execution can't get good code density.
LDM/STM: the hardware needs to split the LDM/STM into many uops, then sends them to the function unit. In the write back stage, the hardware needs to merge them. The hardware complexity is increased but the memory access throughput is not balanced. A64 use pair LD/ST instruction to replace them.
regarding RSB, I don't know more about it. I agree with you that it is a good instruction. Maybe it is not good for C/C++ compiler.
Any comment is welcome.
The performance metric relevant to your question is total execution time consumed on a particular job. Total execution time (T) can be determined by
T = (total number of clock cycles) x (clock period)
T = (total number of clock cycles) ÷ (clock frequency)
T = Σ((cycles per instruction)n x (clock period))
T = Σ((cycles per instruction)n ÷ (clock frequency))
If all the instructions used have the same number of clock cycles to execute
T = (total number of instructions) x (cycles per instruction) x (clock period)
T = (total number of instructions) x (cycles per instruction) ÷ (clock frequency)
One approach is to minimize T by minimizing the number of instructions needed for the overall job. To achieve this, complex instructions that perform more operations are employed. Complex instructions (typically) require more cycles per instructions (CPI) and has a side effect of lowering the maximum clock frequency, both factors counteract the objective to minimize T. The other approach is to minimize T by decreasing the CPI and increasing the clock frequency. So, even when more instructions are needed to accomplish a job if CPI is minimal and higher clock frequency is attainable, performance in terms of T can be improved.