Skip to content

Functions and Stack Management in Arm64 Assembly

Introduction

Functions are the cornerstone of modular programming, allowing code reuse, abstraction, and maintainability. In the previous tutorials, we learned about registers, basic instructions, and control flow. Now we'll explore how to properly implement functions in Arm64 assembly, following the ARM Architecture Procedure Call Standard (AAPCS64).

Understanding function calling conventions is essential for:

  • Interoperability: Calling C/C++ functions from assembly and vice versa
  • Correctness: Properly preserving register values across function calls
  • Debugging: Understanding stack frames and backtrace
  • Optimization: Writing efficient function prologues and epilogues

This tutorial covers the AAPCS64 calling convention, stack management, parameter passing, return values, and practical function patterns.

AAPCS64 Overview

The ARM Architecture Procedure Call Standard for AArch64 (AAPCS64) defines how functions interact:

Key Principles

  1. Register Usage: Defines which registers are preserved across calls
  2. Parameter Passing: First 8 integer args in x0-x7, floating-point in v0-v7
  3. Return Values: Results in x0 (or x0+x1, or v0)
  4. Stack Alignment: Stack pointer must be 16-byte aligned at public interfaces
  5. Stack Growth: Stack grows downward (from high to low addresses)

Register Preservation Rules

Register(s) Role Preserved? Notes
x0 - x7 Arguments/results No Caller-saved (scratch)
x8 Indirect result location No Caller-saved
x9 - x15 Temporary No Caller-saved
x16 - x17 IP0, IP1 (intra-procedure) No Linker scratch
x18 Platform register Maybe Platform dependent
x19 - x28 General purpose Yes Callee-saved
x29 (FP) Frame pointer Yes Callee-saved
x30 (LR) Link register Yes Callee-saved
SP Stack pointer Yes Must be 16-byte aligned

What "Preserved" Means

// Caller's perspective:
main:
    mov     x19, #42        // x19 = 42
    bl      some_function
    // x19 is still 42 here (callee must preserve it)

    mov     x9, #100        // x9 = 100
    bl      some_function
    // x9 might be changed (caller-saved)

// Callee's perspective:
some_function:
    // If we use x19, must save/restore it
    stp     x19, x20, [sp, #-16]!
    mov     x19, #999       // OK to modify now
    ldp     x19, x20, [sp], #16    // Restore before return

    // x9 can be used without saving
    mov     x9, #777        // No need to preserve
    ret

Stack Frame Structure

A typical stack frame contains:

High addresses
+------------------+
| Previous frame   |
+------------------+ <- FP (x29) on entry
| Saved LR (x30)   |
+------------------+ <- SP - 8
| Saved FP (x29)   |
+------------------+ <- SP - 16 (new FP)
| Saved x19        |
+------------------+ <- SP - 24
| Saved x20        |
+------------------+ <- SP - 32
| Local var 1      |
+------------------+ <- SP - 40
| Local var 2      |
+------------------+ <- SP - 48
| ...              |
+------------------+ <- SP (current, 16-byte aligned)
Low addresses

Frame Pointer (FP / x29)

The frame pointer provides a stable reference point for: - Accessing local variables - Debugging (stack unwinding) - Exception handling

function:
    // Establish frame
    stp     x29, x30, [sp, #-48]!   // Save FP, LR and allocate space
    mov     x29, sp                  // FP points to saved FP

    // Access local variables via FP
    str     x0, [x29, #16]          // Store at FP + 16
    ldr     x1, [x29, #16]          // Load from FP + 16

    // Tear down frame
    ldp     x29, x30, [sp], #48
    ret

Function Prologue and Epilogue

Minimal Prologue/Epilogue (Leaf Function)

A leaf function doesn't call other functions:

// Leaf function that doesn't use callee-saved registers
add_two_numbers:
    add     x0, x0, x1      // x0 = x0 + x1
    ret                     // No prologue/epilogue needed!

// Leaf function that uses callee-saved registers
multiply_add:
    // Prologue: save x19
    str     x19, [sp, #-16]!

    // Body
    mul     x19, x0, x1     // x19 = x0 * x1
    add     x0, x19, x2     // x0 = (x0 * x1) + x2

    // Epilogue: restore x19
    ldr     x19, [sp], #16
    ret

Standard Prologue/Epilogue (Non-Leaf Function)

Functions that call other functions must save LR:

function:
    // Prologue
    stp     x29, x30, [sp, #-32]!   // Save FP and LR
    mov     x29, sp                  // Set up frame pointer
    stp     x19, x20, [sp, #16]     // Save callee-saved regs if used

    // Function body
    // ... can call other functions safely ...
    bl      other_function

    // Epilogue
    ldp     x19, x20, [sp, #16]     // Restore callee-saved regs
    ldp     x29, x30, [sp], #32     // Restore FP and LR
    ret

Complete Example with Local Variables

// Function with parameters, locals, and calls
// int compute(int a, int b, int c) {
//     int x = a * b;
//     int y = x + c;
//     int z = helper(y);
//     return z + x;
// }

compute:
    // Prologue: allocate 32 bytes
    // [sp+0]:  saved FP
    // [sp+8]:  saved LR
    // [sp+16]: saved x19 (will hold x)
    // [sp+24]: saved x20 (will hold y)
    stp     x29, x30, [sp, #-32]!
    mov     x29, sp
    stp     x19, x20, [sp, #16]

    // Body: a=x0, b=x1, c=x2
    mul     x19, x0, x1     // x19 = x = a * b
    add     x20, x19, x2    // x20 = y = x + c

    mov     x0, x20         // Argument for helper
    bl      helper          // Call helper(y)
    // x0 now contains z

    add     x0, x0, x19     // return z + x

    // Epilogue
    ldp     x19, x20, [sp, #16]
    ldp     x29, x30, [sp], #32
    ret

Parameter Passing

Integer and Pointer Parameters

First 8 parameters use x0-x7:

// void func(int a, int b, int c, int d, int e, int f, int g, int h);
// a=x0, b=x1, c=x2, d=x3, e=x4, f=x5, g=x6, h=x7

func:
    add     x0, x0, x1      // a + b
    add     x0, x0, x2      // + c
    add     x0, x0, x3      // + d
    add     x0, x0, x4      // + e
    add     x0, x0, x5      // + f
    add     x0, x0, x6      // + g
    add     x0, x0, x7      // + h
    ret                     // Return sum in x0

Stack Parameters (More Than 8)

Parameters beyond 8 are passed on the stack:

// void func(int p0, ..., int p7, int p8, int p9);
// p0-p7 in x0-x7
// p8 at [sp, #0]
// p9 at [sp, #8]

func:
    stp     x29, x30, [sp, #-16]!
    mov     x29, sp

    // Access p0-p7 normally
    add     x0, x0, x1      // p0 + p1

    // Access p8 and p9 from stack
    // They're at [x29, #16] and [x29, #24] (after our frame)
    ldr     x10, [x29, #16] // p8
    ldr     x11, [x29, #24] // p9
    add     x0, x0, x10
    add     x0, x0, x11

    ldp     x29, x30, [sp], #16
    ret

// Calling with 10 parameters:
caller:
    mov     x0, #0
    mov     x1, #1
    mov     x2, #2
    mov     x3, #3
    mov     x4, #4
    mov     x5, #5
    mov     x6, #6
    mov     x7, #7

    // Push p8 and p9 to stack (in reverse order!)
    stp     x8, x9, [sp, #-16]!     // p9, p8 on stack
    mov     x8, #8                   // Load values
    mov     x9, #9
    str     x8, [sp, #0]            // p8
    str     x9, [sp, #8]            // p9

    bl      func
    add     sp, sp, #16             // Clean up stack
    ret

Floating-Point Parameters

First 8 FP parameters use v0-v7:

// double add_doubles(double a, double b);
// a=d0, b=d1, return in d0

add_doubles:
    fadd    d0, d0, d1      // d0 = a + b
    ret

// float multiply(float a, float b, float c, float d);
// a=s0, b=s1, c=s2, d=s3

multiply:
    fmul    s0, s0, s1      // s0 = a * b
    fmul    s0, s0, s2      // s0 *= c
    fmul    s0, s0, s3      // s0 *= d
    ret

Mixed Integer and Floating-Point

// double mixed(int a, double b, int c, double d);
// a=x0 (w0), b=d0, c=x1 (w1), d=d1

mixed:
    scvtf   d2, w0          // Convert int a to double
    fadd    d0, d2, d0      // d0 = (double)a + b
    scvtf   d2, w1          // Convert int c to double
    fadd    d0, d0, d2      // d0 += (double)c
    fadd    d0, d0, d1      // d0 += d
    ret

Structure Parameters

Small Structures (≤ 16 bytes)

Passed in registers:

// struct Point { long x, y; };  // 16 bytes
// void process(Point p);
// p.x in x0, p.y in x1

process_point:
    add     x0, x0, x1      // x = x + y
    ret

// Caller:
caller:
    mov     x0, #10         // p.x = 10
    mov     x1, #20         // p.y = 20
    bl      process_point
    ret

Large Structures (> 16 bytes)

Passed by reference via x8:

// struct Large { long a, b, c, d; };  // 32 bytes
// Large create_large(int value);

create_large:
    // x8 points to memory where caller wants result
    // x0 contains value parameter
    str     x0, [x8, #0]    // result.a = value
    str     x0, [x8, #8]    // result.b = value
    str     x0, [x8, #16]   // result.c = value
    str     x0, [x8, #24]   // result.d = value
    ret                     // Return (x8 unchanged)

// Caller:
caller:
    sub     sp, sp, #32     // Allocate space for result
    mov     x8, sp          // x8 points to result location
    mov     x0, #42         // value parameter
    bl      create_large
    // Result now at [sp, #0] through [sp, #24]
    ldp     x0, x1, [sp, #0]    // Load first two fields
    add     sp, sp, #32
    ret

Return Values

Integer Returns

// Single 64-bit value in x0
return_int:
    mov     x0, #42
    ret

// Two 64-bit values in x0 and x1
return_pair:
    mov     x0, #10     // First return value
    mov     x1, #20     // Second return value
    ret

// 128-bit value in x0 (low) and x1 (high)
return_128bit:
    mov     x0, #0x123456789ABCDEF0    // Low 64 bits
    mov     x1, #0xFEDCBA9876543210    // High 64 bits
    ret

Floating-Point Returns

// Single precision in s0
return_float:
    fmov    s0, #1.5
    ret

// Double precision in d0
return_double:
    fmov    d0, #3.14
    ret

// Complex number (two doubles) in d0 and d1
return_complex:
    fmov    d0, #1.0    // Real part
    fmov    d1, #2.0    // Imaginary part
    ret

Structure Returns

1
2
3
4
5
6
7
8
// Small struct (≤ 16 bytes) in registers
// struct Point { long x, y; };
return_point:
    mov     x0, #100    // result.x
    mov     x1, #200    // result.y
    ret

// Large struct via x8 (shown earlier)

Nested Function Calls

When calling functions from within functions, manage LR carefully:

// Function A calls B which calls C
function_a:
    stp     x29, x30, [sp, #-16]!
    mov     x29, sp

    bl      function_b      // LR is overwritten!

    ldp     x29, x30, [sp], #16
    ret                     // Returns to A's caller (LR was saved)

function_b:
    stp     x29, x30, [sp, #-16]!
    mov     x29, sp

    bl      function_c      // LR is overwritten again!

    ldp     x29, x30, [sp], #16
    ret                     // Returns to A (LR was saved)

function_c:
    // Leaf function - no need to save LR
    mov     x0, #42
    ret                     // Returns to B

Deep Call Stack Example

main:
    stp     x29, x30, [sp, #-16]!
    mov     x29, sp

    mov     x0, #5
    bl      factorial       // Calculate 5!
    // x0 = 120

    ldp     x29, x30, [sp], #16
    ret

// Recursive factorial
factorial:
    cmp     x0, #1
    b.le    base_case

    // Recursive case: save LR and argument
    stp     x29, x30, [sp, #-32]!
    mov     x29, sp
    str     x19, [sp, #16]      // Save x19 for n

    mov     x19, x0             // x19 = n
    sub     x0, x0, #1          // x0 = n - 1
    bl      factorial           // factorial(n-1)

    mul     x0, x19, x0         // n * factorial(n-1)

    ldr     x19, [sp, #16]
    ldp     x29, x30, [sp], #32
    ret

base_case:
    mov     x0, #1
    ret

Variable-Length Arguments (Varargs)

Implementing functions like printf:

// void my_printf(const char *format, ...);
// format in x0
// Variable args start at x1

my_printf:
    stp     x29, x30, [sp, #-80]!   // Save FP, LR
    mov     x29, sp

    // Save all potential integer arguments (x1-x7) to stack
    stp     x1, x2, [sp, #16]
    stp     x3, x4, [sp, #32]
    stp     x5, x6, [sp, #48]
    str     x7, [sp, #64]

    // Save all potential FP arguments (v0-v7)
    stp     d0, d1, [sp, #72]
    // ... and so on ...

    // Process format string in x0
    // Access arguments from stack as needed

    ldp     x29, x30, [sp], #80
    ret

// Calling varargs function:
caller:
    ldr     x0, =format_string  // "Value: %d, Float: %f"
    mov     x1, #42             // Integer argument
    fmov    d0, #3.14           // Float argument
    bl      my_printf
    ret

Practical Examples

Example 1: String Copy

// char* strcpy(char *dest, const char *src);
// dest=x0, src=x1, return=x0

strcpy:
    mov     x2, x0              // Save dest for return
strcpy_loop:
    ldrb    w3, [x1], #1        // Load byte from src, increment
    strb    w3, [x0], #1        // Store to dest, increment
    cbnz    w3, strcpy_loop     // Continue if not null terminator

    mov     x0, x2              // Return original dest
    ret

Example 2: String Compare

// int strcmp(const char *s1, const char *s2);
// s1=x0, s2=x1
// Returns: <0 if s1<s2, 0 if equal, >0 if s1>s2

strcmp:
strcmp_loop:
    ldrb    w2, [x0], #1        // Load from s1
    ldrb    w3, [x1], #1        // Load from s2
    cmp     w2, w3
    b.ne    strcmp_diff         // Different characters
    cbz     w2, strcmp_equal    // Both null terminators
    b       strcmp_loop

strcmp_diff:
    sub     w0, w2, w3          // Return difference
    ret

strcmp_equal:
    mov     w0, #0              // Return 0 (equal)
    ret

Example 3: Array Sum (Using Stack)

// long sum_array(long *array, int count);
// array=x0, count=w1

sum_array:
    // Save callee-saved registers
    stp     x19, x20, [sp, #-16]!

    mov     x19, x0             // x19 = array pointer
    mov     w20, w1             // w20 = count
    mov     x0, #0              // sum = 0

sum_loop:
    cbz     w20, sum_done
    ldr     x2, [x19], #8       // Load element
    add     x0, x0, x2          // sum += element
    sub     w20, w20, #1
    b       sum_loop

sum_done:
    ldp     x19, x20, [sp], #16
    ret

Example 4: Bubble Sort

// void bubble_sort(long *array, int count);
// array=x0, count=w1

bubble_sort:
    stp     x29, x30, [sp, #-32]!
    mov     x29, sp
    stp     x19, x20, [sp, #16]

    mov     x19, x0             // x19 = array
    mov     w20, w1             // w20 = count

outer_loop:
    cmp     w20, #1
    b.le    sort_done

    mov     x2, x19             // x2 = array pointer
    sub     w3, w20, #1         // x3 = count - 1
    mov     w4, #0              // swapped = false

inner_loop:
    cbz     w3, check_swapped

    ldp     x5, x6, [x2]        // Load two adjacent elements
    cmp     x5, x6
    b.le    no_swap

    // Swap elements
    stp     x6, x5, [x2]
    mov     w4, #1              // swapped = true

no_swap:
    add     x2, x2, #8
    sub     w3, w3, #1
    b       inner_loop

check_swapped:
    cbz     w4, sort_done       // If no swaps, we're done
    sub     w20, w20, #1        // count--
    b       outer_loop

sort_done:
    ldp     x19, x20, [sp, #16]
    ldp     x29, x30, [sp], #32
    ret

Example 5: Matrix Multiplication

// void matrix_mult(long result[2][2], long a[2][2], long b[2][2]);
// result=x0, a=x1, b=x2

matrix_mult:
    // result[0][0] = a[0][0]*b[0][0] + a[0][1]*b[1][0]
    ldr     x3, [x1, #0]        // a[0][0]
    ldr     x4, [x2, #0]        // b[0][0]
    mul     x5, x3, x4

    ldr     x3, [x1, #8]        // a[0][1]
    ldr     x4, [x2, #16]       // b[1][0]
    madd    x5, x3, x4, x5      // x5 = a[0][0]*b[0][0] + a[0][1]*b[1][0]
    str     x5, [x0, #0]        // result[0][0]

    // result[0][1] = a[0][0]*b[0][1] + a[0][1]*b[1][1]
    ldr     x3, [x1, #0]        // a[0][0]
    ldr     x4, [x2, #8]        // b[0][1]
    mul     x5, x3, x4

    ldr     x3, [x1, #8]        // a[0][1]
    ldr     x4, [x2, #24]       // b[1][1]
    madd    x5, x3, x4, x5
    str     x5, [x0, #8]        // result[0][1]

    // result[1][0] = a[1][0]*b[0][0] + a[1][1]*b[1][0]
    ldr     x3, [x1, #16]       // a[1][0]
    ldr     x4, [x2, #0]        // b[0][0]
    mul     x5, x3, x4

    ldr     x3, [x1, #24]       // a[1][1]
    ldr     x4, [x2, #16]       // b[1][0]
    madd    x5, x3, x4, x5
    str     x5, [x0, #16]       // result[1][0]

    // result[1][1] = a[1][0]*b[0][1] + a[1][1]*b[1][1]
    ldr     x3, [x1, #16]       // a[1][0]
    ldr     x4, [x2, #8]        // b[0][1]
    mul     x5, x3, x4

    ldr     x3, [x1, #24]       // a[1][1]
    ldr     x4, [x2, #24]       // b[1][1]
    madd    x5, x3, x4, x5
    str     x5, [x0, #24]       // result[1][1]

    ret

Tail Call Optimization

When the last action is calling another function, optimize by jumping instead:

// Without tail call optimization:
wrapper:
    stp     x29, x30, [sp, #-16]!
    bl      actual_function
    ldp     x29, x30, [sp], #16
    ret

// With tail call optimization:
wrapper:
    b       actual_function     // Jump instead of call
    // No need to save/restore LR or adjust stack!

// Recursive example:
// int sum_tail(int n, int acc) {
//     if (n == 0) return acc;
//     return sum_tail(n-1, acc+n);
// }

sum_tail:
    cbz     x0, return_acc      // if n == 0, return acc
    add     x1, x1, x0          // acc = acc + n
    sub     x0, x0, #1          // n = n - 1
    b       sum_tail            // Tail call (no stack needed!)

return_acc:
    mov     x0, x1              // Return acc
    ret

Complete Function Template

Here's a comprehensive template for a complex function:

// Type: Non-leaf function with local variables and callee-saved registers
// Parameters: x0, x1, x2
// Uses: x19, x20 (callee-saved)
// Locals: 16 bytes
// Calls: other functions

my_function:
    // ========== PROLOGUE ==========
    // Allocate stack frame:
    // [sp+0]:  saved FP (x29)
    // [sp+8]:  saved LR (x30)
    // [sp+16]: saved x19
    // [sp+24]: saved x20
    // [sp+32]: local variable 1 (8 bytes)
    // [sp+40]: local variable 2 (8 bytes)
    // Total: 48 bytes (16-byte aligned)

    stp     x29, x30, [sp, #-48]!   // Save FP and LR
    mov     x29, sp                  // Set up frame pointer
    stp     x19, x20, [sp, #16]     // Save callee-saved registers

    // ========== BODY ==========
    // Save parameters to callee-saved registers if needed
    mov     x19, x0             // Save param 1
    mov     x20, x1             // Save param 2

    // Store local variables
    str     x2, [sp, #32]       // local1 = param 3

    // Do work, possibly calling other functions
    bl      helper_function

    // Use local variables
    ldr     x2, [sp, #32]       // Load local1
    add     x0, x0, x19         // result += param1
    add     x0, x0, x20         // result += param2
    add     x0, x0, x2          // result += local1

    // ========== EPILOGUE ==========
    ldp     x19, x20, [sp, #16]     // Restore callee-saved registers
    ldp     x29, x30, [sp], #48     // Restore FP and LR, deallocate
    ret

Stack Alignment Debugging

Common mistake: misaligned stack

// WRONG: 8-byte allocation (misaligned!)
buggy_function:
    sub     sp, sp, #8      // SP is now 8-byte aligned, not 16!
    bl      some_function   // May crash or behave incorrectly!
    add     sp, sp, #8
    ret

// CORRECT: 16-byte allocation
correct_function:
    sub     sp, sp, #16     // SP remains 16-byte aligned
    str     x0, [sp, #0]    // Use lower 8 bytes
    // [sp, #8] is padding
    bl      some_function   // Safe!
    ldr     x0, [sp, #0]
    add     sp, sp, #16
    ret

Summary

In this tutorial, we covered:

AAPCS64 Calling Convention

  • ✅ Register preservation rules (caller-saved vs callee-saved)
  • ✅ Stack alignment requirements (16 bytes)
  • ✅ Frame pointer usage

Stack Management

  • ✅ Stack frame structure
  • ✅ Prologue and epilogue patterns
  • ✅ Local variable allocation

Parameter Passing

  • ✅ Integer parameters (x0-x7, then stack)
  • ✅ Floating-point parameters (v0-v7)
  • ✅ Structure parameters (small vs large)
  • ✅ Mixed parameter types
  • ✅ Variable-length arguments

Return Values

  • ✅ Integer returns (x0, x0+x1)
  • ✅ Floating-point returns (d0, s0)
  • ✅ Structure returns

Advanced Topics

  • ✅ Nested function calls
  • ✅ Recursive functions
  • ✅ Tail call optimization
  • ✅ Complete function templates

Practical Examples

  • ✅ String operations (strcpy, strcmp)
  • ✅ Array algorithms (sum, bubble sort)
  • ✅ Matrix multiplication
  • ✅ Recursive tail-optimized functions

Next Steps

In the final tutorial, we'll cover:

  • Interfacing with C++: Calling assembly from C++ and vice versa
  • Inline Assembly: Embedding assembly in C++ code
  • GPIO Control: Direct hardware access in assembly
  • LED Control Example: Complete practical project
  • Performance Optimization: Writing faster code than the compiler
  • SIMD/NEON: Vector instructions for parallel processing
  • Debugging Mixed Code: GDB with C++ and assembly

This will tie together everything we've learned and show how to use assembly for real-world Raspberry Pi projects.