One of the key features of the fast performance of ARM microcontrollers is Pipelining. ARM7 Core has a three-stage pipeline that increases instruction flow through the processor up to three times. So each instruction is executed in three stages:
- Fetch – instruction is fetched from memory and placed in the pipeline;
- Decode – instruction is decoded and data-path signals prepared for the next cycle;
- Execute – instruction from prepared data-path reads from registry bank, shifts operand to ALU, and writes generated result to dominant register.
Pipelining is implemented at the hardware level. The pipeline is linear, which means that in simple data processing processor executes one instruction in a single clock cycle while individual instruction takes three clock cycles. But when the program structure has branches, the pipeline faces difficulties because it cannot predict which command will be next. In this case, the pipeline flushes and has to be refilled what means execution speed drops to 1 instruction per 3 clock cycles. But it isn’t true, actually. ARM instructions have nice features that allow for the smooth performance of small branches in code that assure optimal performance. This is achieved at a hardware level where PC (Program Counter) is calculated 8 bytes ahead of the current instruction.