One of the key features of the fast performance of ARM microcontrollers is Pipelining. ARM7 Core has three-stage pipeline that increase instruction flow through processor up to three times. So each instruction is executed in three stages:
Fetch – instruction is fetched from memory and placed in pipeline;
Decode – instruction is decoded and data-path signals prepared for next cycle;
Execute – instruction from prepared data-path reads from registry bank, shifts operand to ALU and writes generated result to dominant register.
Pipelining is implemented in hardware level. Pipeline is linear, what means that in simple data processing processor executes one instruction in single clock cycle while individual instruction takes three clock cycles. But when program structure has branches then pipeline faces difficulties, because it cannot predict which command will be next. In this case pipeline flushes and has to be refilled what means execution speed drops to 1 instruction per 3 clock cycles. But it isn’t true actually. ARM instructions has nice feature that allow to smooth performance of small branches in code that assures optimal performance. This is achieved in hardware level where PC (Program Counter) is calculated 8 bytes ahead of current instruction.