I have been working on optimisation of one of my C codes. I needed one function to be as optimal as possible. I decided to use inline ASM to achieve this. I decided to write few lines about this.
There are few rules that is necessary to follow. Each ASM statement is divided by colons into 3(up to four parts):
- Assembler instructions part;
- A list of output operands (comma separated);
- A list of input operands (comma separated);
- Clobbered register – usually left empty.
asm(code : output operand list : input operand list [: clobber list]);
Due to optimization strategy, compiler may decide which registers will be used for ASM code, or even it may decide not to use inserted inline ASM code. To avoid this it is recommended to use keyword volatile:
asm volatile(code : output operand list : input operand list [: clobber list]);
Lets go through it with some examples.
Let us say, we want to enable or disable global interrupts. The simple inline ASM sentence will do this:
empty command may be inserted like this:
asm volatile( “nop ;this is comment“ ”\n\t”
“nop ;this ASM inline includes 2 nops“ ”\n\t”
Note: “\n\t” is used only for listing purposes- new line and tabbed commands.
When inserting inline ASM code to c program, there is possible to use some special register, that doesn’t have to be assigned to any variables:
||Status register at address 0x3F|
||Stack pointer high byte at address 0x3E|
||Stack pointer low byte at address 0x3D|
||Register r0, used for temporary storage|
||Register r1, always zero|
Input and output operands are described by a constraint string followed by C expression:
|a||Simple upper registers||r16 to r23|
|b||Base pointer registers pairs||y, z|
|d||Upper register||r16 to r31|
|e||Pointer register pairs||x, y, z|
|G||Floating point constant||0.0|
|I||6-bit positive integer constant||0 to 63|
|J||6-bit negative integer constant||-63 to 0|
|l||Lower registers||r0 to r15|
|M||8-bit integer constant||0 to 255|
|O||Integer constant||8, 16, 24|
|q||Stack pointer register||SPH:SPL|
|r||Any register||r0 to r31|
|w||Special upper register pairs||r24, r26, r28, r30|
|x||Pointer register pair X||x (r27:r26)|
|y||Pointer register pair Y||y (r29:r28)|
|z||Pointer register pair Z||z (r31:r30)|
The following table shows all assembler mnemonics which require operands and related constraints.
Constraint characters may be prepended by a single constraint modifier. Contraints without a modifier specify read-only operands. Modifiers are:
|=||Write-only operand, usually used for all output operands.|
|+||Read-write operand (not supported by inline assembler)|
|&||Register should be used for output only|
Note: Output operands always must be write-only.
Input operand doesn’t have to be read-only, for instance if you need same register for input and output. Then you may use digit in the constraint string:
asm volatile("swap %0" : "=r" (value) : "0" (value)); Constraint “0” tells compiler to use a register with number 0 (%0). Lets look at the other example: asm volatile("in %0,%1" "\n\t" "out %1, %2" "\n\t" : "=&r" (input) : "I" (_SFR_IO_ADDR(PORTD)), "r" (output) );
Lets take a look at first line “in %0,%1”. The operand %0 is replaced with register where is input value stored. Register is write only and it is used for output oly(& modifier). The operand %1 is replaced with “I” (_SFR_IO_ADDR(PORTD)) which respond as PORTD address.
Note: IO register has to be always input operand.
The second line of ASM code is similar. Just %2 operand is tied to any register from range (r0 to r31).
What if wee need to pass 32 bit value to inline ASM? Then there are ability to use different letters, which refer to different 8 bit registers:
uint32_t value=0xffffffff; asm volatile("mov __tmp_reg__, %A0" "\n\t" "mov %A0, %D0" "\n\t" "mov %D0, __tmp_reg__" "\n\t" "mov __tmp_reg__, %B0" "\n\t" "mov %B0, %C0" "\n\t" "mov %C0, __tmp_reg__" "\n\t" : "=r" (value) : "0" (value) );
%A0 is lowest byte of 32 bit value and %D0 is the highest byte. And then all operations are made with these bytes separately. And then can be return as 32bit output parameter by using number as modifier (“0” in this example).
The last thing I would like to cover is pointers. The input parameter can be defined like:
Then compiler selects registter z(r30:r31). Then:
%A0 refers to r30
%B0 refers to r31
But if you need to point to address location with address stored in Z register like
ld r24, Z
then you need to use variable with lower case letter like:
ld r24, %a0
Few words about Clobbers. Clobbers are necessary when you are using registers which has not been passed as operands, you need to inform the compiler. For instance:
asm volatile( "cli" "\n\t" "ld r24, %a0" "\n\t" "inc r24" "\n\t" "st %a0, r24" "\n\t" "sei" "\n\t" : : "e" (ptr) : "r24" );
In this example we are using r24 register. The compiler produces the following code fragment in listing:
cli ld r24, Z inc r24 st Z, r24 sei
another clobber definition may be “memory”, which means that assembler may modify any memory location. But it forces compiler to update all variables before executing the ASM code. Try not to use clobbers it it is possible, because this gives more freedom to compiler to optimize the code.
If you need to reuse some assembler parts more than one time it is recommended to define macros. In AVRLibc you may find many of them. To avoid compiler warnings use __asm__ instead of asm and __volatile__ instead of volatile. Other options re same as in regular inline assembler:
#define loop_until_bit_is_clear(port,bit) __asm__ __volatile__ ( "1: " "sbic %0, %1" "\n\t" "rjmp 1b" : /* no outputs */ : "I" (_SFR_IO_ADDR(port)), "I" (bit) )
For my AVR controlled generator I wrote a stub function (the function contains nothing but assembler code). For larger routines it is better to make those stub functions because using macro asm routines may be painful because of code size which is inserted (not called) when macro is called. My stub function for AVR DDS generator:
void signalOUT(const uint8_t *signal, uint8_t ad2, uint8_t ad1, uint8_t ad0)
asm volatile( “eor r28, r28 ;r28<-0” “\n\t”
“eor r29, r29 ;r29<-0” “\n\t”
“add r28, %0 ;1 cycle” “\n\t”
“adc r29, %1 ;1 cycle” “\n\t”
“adc %A0, %2 ;1 cycle” “\n\t”
“lpm __tmp_reg__, %a3+ ;3 cycles” “\n\t”
“out %4, __tmp_reg__ ;1 cycle” “\n\t”
“rjmp Loop1 ;2 cycles. Total 9 cycles” “\n\t”
:”r” (ad0),”r” (ad1),”r” (ad2),”e” (signal),”I” (_SFR_IO_ADDR(PORTD))
lister output fragment:
1768 /* #APP */
1769 00f6 CC27 eor r28, r28 ;r28<-0
1770 00f8 DD27 eor r29, r29 ;r29<-0
1772 00fa C20F add r28, r18 ;1 cycle
1773 00fc D41F adc r29, r20 ;1 cycle
1774 00fe 261F adc r18, r22 ;1 cycle
1775 0100 0590 lpm __tmp_reg__, Z+ ;3 cycles
1776 0102 02BA out 18, __tmp_reg__ ;1 cycle
1777 0104 FACF rjmp Loop1 ;2 cycles. Total 9 cycles
1779 /* #NOAPP */
Note: /* #APP */ and /* #NOAPP */ comments are generated by a compiler to show which sentences were not generated by compiler (inline ASM).
I wanted to make Loop part to be as small as possible. So I managed to use 9 clocks per cycle. The code fragment is from https://www.myplace.nu/avr/minidds/minidds.asm
In other hand it will be easier to calculate signal timings because the inline asm is not affected by compiler optimisation.
Read more about using inline asm using WinAVR from https://www.nongnu.org/avr-libc/user-manual/inline_asm.html