I have been working on the optimization of one of my C codes. I needed one function to be as optimal as possible. I decided to use inline ASM to achieve this. I decided to write a few lines about this.
There are a few rules that are necessary to follow. Each ASM statement is divided by colons into 3(up to four parts):
- Assembler instructions part;
- A list of output operands (comma separated);
- A list of input operands (comma separated);
- Clobbered register – usually left empty.
asm(code : output operand list : input operand list [: clobber list]);
Due to the optimization strategy, the compiler may decide which registers will be used for ASM code or decide not to use inserted inline ASM code. To avoid this, it is recommended to use keyword volatile:
asm volatile(code : output operand list : input operand list [: clobber list]);
Lets go through it with some examples.
Let us say, we want to enable or disable global interrupts. The simple inline ASM sentence will do this:
empty command may be inserted like this:
asm volatile( “nop ;this is comment“ ”\n\t”
“nop ;this ASM inline includes 2 nops“ ”\n\t”
Note: “\n\t” is used only for listing purposes- new line and tabbed commands.
When inserting inline ASM code to the c program, there is possible to use some special register, that doesn’t have to be assigned to any variables:
|Status register at address 0x3F|
|Stack pointer high byte at address 0x3E|
|Stack pointer low byte at address 0x3D|
|Register r0, used for temporary storage|
|Register r1, always zero|
Input and output operands are described by a constraint string followed by C expression:
|a||Simple upper registers||r16 to r23|
|b||Base pointer registers pairs||y, z|
|d||Upper register||r16 to r31|
|e||Pointer register pairs||x, y, z|
|G||Floating point constant||0.0|
|I||6-bit positive integer constant||0 to 63|
|J||6-bit negative integer constant||-63 to 0|
|l||Lower registers||r0 to r15|
|M||8-bit integer constant||0 to 255|
|O||Integer constant||8, 16, 24|
|q||Stack pointer register||SPH:SPL|
|r||Any register||r0 to r31|
|w||Special upper register pairs||r24, r26, r28, r30|
|x||Pointer register pair X||x (r27:r26)|
|y||Pointer register pair Y||y (r29:r28)|
|z||Pointer register pair Z||z (r31:r30)|
The following table shows all assembler mnemonics which require operands and related constraints.
Constraint characters may be prepended by a single constraint modifier. Constraints without a modifier specify read-only operands. Modifiers are:
|=||The write-only operand, usually used for all output operands.|
|+||Read-write operand (not supported by inline assembler)|
|&||Register should be used for output only|
Note: Output operands always must be write-only.
Input operand doesn’t have to be read-only, for instance if you need same register for input and output. Then you may use digit in the constraint string:
asm volatile("swap %0" : "=r" (value) : "0" (value));
Constraint “0” tells the compiler to use a register with 0 (%0).
Let’s look at the other example:
asm volatile("in %0,%1" "\n\t" "out %1, %2" "\n\t" : "=&r" (input) : "I" (_SFR_IO_ADDR(PORTD)), "r" (output) );
Let’s take a look at the first line, “in %0,%1”. The operand %0 is replaced with a register where is input value stored. The register is write-only, and it is used for output only(& modifier). The operand %1 is replaced with “I” (_SFR_IO_ADDR(PORTD)), which respond as PORTD address.
Note: IO register has to be always input operand.
The second line of ASM code is similar. Just %2 operand is tied to any register from range (r0 to r31).
What if we need to pass 32-bit value to inline ASM? Then there is the ability to use different letters, which refer to different 8-bit registers:
uint32_t value=0xffffffff; asm volatile("mov __tmp_reg__, %A0" "\n\t" "mov %A0, %D0" "\n\t" "mov %D0, __tmp_reg__" "\n\t" "mov __tmp_reg__, %B0" "\n\t" "mov %B0, %C0" "\n\t" "mov %C0, __tmp_reg__" "\n\t" : "=r" (value) : "0" (value) );
%A0 is the lowest byte of 32-bit value and %D0 is the highest byte. And then all operations are made with these bytes separately. And then can be returned as a 32bit output parameter by using the number as a modifier (“0” in this example).
The last thing I would like to cover is pointers. The input parameter can be defined as:
Then compiler selects registter z(r30:r31). Then:
%A0 refers to r30
%B0 refers to r31
But if you need to point to address location with address stored in Z register like
ld r24, Z
then you need to use variable with lower case letter like:
ld r24, %a0
Few words about Clobbers. Clobbers are necessary when you are using registers which have not been passed as operands, you need to inform the compiler. For instance:
asm volatile( "cli" "\n\t" "ld r24, %a0" "\n\t" "inc r24" "\n\t" "st %a0, r24" "\n\t" "sei" "\n\t" : : "e" (ptr) : "r24" );
In this example we are using r24 register. The compiler produces the following code fragment in listing:
cli ld r24, Z inc r24 st Z, r24 sei
Another clobber definition may be “memory,” which means that the assembler may modify any memory location. But it forces the compiler to update all variables before executing the ASM code. Try not to use clobbers; it is possible because this gives more freedom to the compiler to optimize the code.
Suppose you need to reuse some assembler parts more than one time it is recommended to define macros. In AVRLibc, you may find many of them. To avoid compiler warnings, use __asm__ instead of asm and __volatile__ instead of volatile. Other options re the same as in a regular inline assembler:
#define loop_until_bit_is_clear(port,bit) __asm__ __volatile__ ( "1: " "sbic %0, %1" "\n\t" "rjmp 1b" : /* no outputs */ : "I" (_SFR_IO_ADDR(port)), "I" (bit) )
I wrote a stub function (the function contains nothing but assembler code). Larger routines should make those stub functions because using macro asm routines may be painful because of code size inserted (not called) when the macro is called. My stub function for the AVR DDS generator:
void signalOUT(const uint8_t *signal, uint8_t ad2, uint8_t ad1, uint8_t ad0)
asm volatile( “eor r28, r28 ;r28<-0” “\n\t”
“eor r29, r29 ;r29<-0” “\n\t”
“add r28, %0 ;1 cycle” “\n\t”
“adc r29, %1 ;1 cycle” “\n\t”
“adc %A0, %2 ;1 cycle” “\n\t”
“lpm __tmp_reg__, %a3+ ;3 cycles” “\n\t”
“out %4, __tmp_reg__ ;1 cycle” “\n\t”
“rjmp Loop1 ;2 cycles. Total 9 cycles” “\n\t”
:”r” (ad0),”r” (ad1),”r” (ad2),”e” (signal),”I” (_SFR_IO_ADDR(PORTD))
lister output fragment:
1768 /* #APP */
1769 00f6 CC27 eor r28, r28 ;r28<-0
1770 00f8 DD27 eor r29, r29 ;r29<-0
1772 00fa C20F add r28, r18 ;1 cycle
1773 00fc D41F adc r29, r20 ;1 cycle
1774 00fe 261F adc r18, r22 ;1 cycle
1775 0100 0590 lpm __tmp_reg__, Z+ ;3 cycles
1776 0102 02BA out 18, __tmp_reg__ ;1 cycle
1777 0104 FACF rjmp Loop1 ;2 cycles. Total 9 cycles
1779 /* #NOAPP */
Note: /* #APP */ and /* #NOAPP */ comments are generated by a compiler to show which sentences were not generated by compiler (inline ASM).
I wanted to make the Loop part of being as small as possible. So I managed to use 9 clocks per cycle. The code fragment is from https://www.myplace.nu/avr/minidds/minidds.asm
On the other hand, it will be easier to calculate signal timings because the inline asm is not affected by a compiler optimization.
Read more about using inline asm using WinAVR from https://www.nongnu.org/avr-libc/user-manual/inline_asm.html