x86 Basics
Understanding x86-64 assembly is not about writing kernels in assembly -- it is about reading what the compiler produces. When a PyTorch operator is slow, when a C++ extension crashes with a segfault, or when you want to verify that the compiler vectorized your inner loop, the ability to read assembly transforms you from someone who guesses to someone who knows. This chapter covers the x86-64 register set, the most common instructions, calling conventions, and how to read compiler output.
Registers
x86-64 has 16 general-purpose 64-bit registers. Each register serves a conventional role defined by the calling convention, but any register can hold any value:
| Register | ABI Role | Caller/Callee Saved | Notes |
|---|---|---|---|
| RAX | Return value | Caller-saved | Also used as accumulator |
| RBX | General purpose | Callee-saved | Preserved across function calls |
| RCX | 4th integer argument | Caller-saved | Also used for shift counts |
| RDX | 3rd integer argument | Caller-saved | Also upper 64 bits of 128-bit return |
| RSI | 2nd integer argument | Caller-saved | Source index for string ops |
| RDI | 1st integer argument | Caller-saved | Destination index for string ops |
| RSP | Stack pointer | Callee-saved | Must always point to valid stack |
| RBP | Base pointer | Callee-saved | Frame pointer (optional with -fomit-frame-pointer) |
| R8 | 5th integer argument | Caller-saved | |
| R9 | 6th integer argument | Caller-saved | |
| R10 | Temporary | Caller-saved | Static chain pointer in nested functions |
| R11 | Temporary | Caller-saved | |
| R12-R15 | General purpose | Callee-saved | Preserved across function calls |
Each 64-bit register can be accessed at smaller widths:
RAX [63..0] full 64-bit register
EAX [31..0] lower 32 bits (writing EAX zero-extends to RAX)
AX [15..0] lower 16 bits (does NOT zero-extend)
AH [15..8] bits 15-8 (legacy, avoid in x86-64)
AL [7..0] lower 8 bits
Common Instructions
; ── Data movement ──
mov rax, rbx ; rax = rbx (register to register)
mov rax, [rbx] ; rax = *(rbx) (load from memory)
mov [rbx], rax ; *(rbx) = rax (store to memory)
mov rax, [rbx+rcx*8+16] ; rax = *(rbx + rcx*8 + 16) (scaled index)
movzx eax, byte [rsi] ; zero-extend byte to 32 bits
movsx rax, dword [rsi] ; sign-extend 32-bit to 64 bits
; ── Address calculation ──
lea rax, [rbx+rcx*4+8] ; rax = rbx + rcx*4 + 8 (NO memory access)
; LEA is used for address arithmetic AND general math
; ── Arithmetic ──
add rax, rbx ; rax += rbx
sub rax, 42 ; rax -= 42
imul rax, rbx ; rax *= rbx (signed multiply)
imul rax, rbx, 10 ; rax = rbx * 10 (three-operand form)
inc rcx ; rcx++
neg rax ; rax = -rax
xor eax, eax ; eax = 0 (fastest way to zero a register)
; ── Bitwise ──
and rax, 0xFF ; mask lower byte
or rax, rbx ; bitwise OR
shl rax, 3 ; left shift by 3 (multiply by 8)
shr rax, 1 ; logical right shift by 1 (unsigned divide by 2)
sar rax, 1 ; arithmetic right shift (signed divide by 2)
; ── Control flow ──
cmp rax, rbx ; set flags based on rax - rbx (result discarded)
test rax, rax ; set flags based on rax & rax (check if zero)
je label ; jump if equal (ZF=1)
jne label ; jump if not equal (ZF=0)
jl label ; jump if less (signed)
jb label ; jump if below (unsigned)
jg label ; jump if greater (signed)
call function ; push return address, jump to function
ret ; pop return address, jump back
| Mode | Syntax | Example | Common Use |
|---|---|---|---|
| Register | reg | mov rax, rbx | Variable in register |
| Immediate | imm | mov rax, 42 | Constants |
| Direct | [addr] | mov rax, [0x601000] | Global variables |
| Register indirect | [reg] | mov rax, [rbx] | Pointer dereference |
| Base + displacement | [reg+disp] | mov rax, [rbp-8] | Local variables on stack |
| Base + index | [reg+reg] | mov rax, [rbx+rcx] | Array with byte-sized elements |
| Scaled index | [reg+reg*s+disp] | mov rax, [rbx+rcx*8+16] | Array of structs, matrix rows |
lea rax, [rdi+rdi*2] ; rax = rdi * 3
lea rax, [rdi*8] ; rax = rdi * 8
lea rax, [rdi+rdi*4] ; rax = rdi * 5
lea rax, [rdi+rsi] ; rax = rdi + rsi (without clobbering either)
Reading Compiler Output
Use Compiler Explorer (godbolt.org) or objdump to see what the compiler generates:
# Compile to assembly (AT&T syntax, default)
gcc -S -O2 -o output.s source.c
# Compile to assembly (Intel syntax, more readable)
gcc -S -O2 -masm=intel -o output.s source.c
# Disassemble a compiled binary
objdump -d -M intel program | less
# Show only a specific function
objdump -d -M intel program | awk '/^[0-9a-f]+ <my_func>:$/,/^$/'
// C code
int add(int a, int b) {
return a + b;
}
; Generated assembly (gcc -O2, Intel syntax)
add:
lea eax, [rdi+rsi] ; eax = edi + esi (arguments in RDI, RSI)
ret ; return value in EAX
// C code
float dot(const float* restrict a, const float* restrict b, int n) {
float sum = 0.0f;
for (int i = 0; i < n; i++) {
sum += a[i] * b[i];
}
return sum;
}
; Key portion of gcc -O3 -mavx2 output (simplified):
; The compiler auto-vectorizes with AVX, processing 8 floats per iteration
.L3:
vmovups ymm1, [rdi+rax] ; load 8 floats from a
vfmadd231ps ymm0, ymm1, [rsi+rax] ; ymm0 += a[i:i+8] * b[i:i+8]
add rax, 32 ; advance by 8 floats (32 bytes)
cmp rax, rcx
jne .L3
; Followed by horizontal reduction of ymm0 to a single float
Calling Convention (System V AMD64)
The calling convention defines how functions pass arguments and return values. All x86-64 Linux, macOS, and FreeBSD systems use the System V AMD64 ABI:
Integer/pointer arguments (in order): RDI, RSI, RDX, RCX, R8, R9
Floating-point arguments (in order): XMM0, XMM1, ..., XMM7
Return value: RAX (integer/pointer), XMM0 (floating-point)
RDX:RAX for 128-bit integer returns
Caller-saved (volatile): RAX, RCX, RDX, RSI, RDI, R8-R11, XMM0-XMM15
Callee-saved (non-volatile): RBX, RBP, R12-R15
Stack alignment: 16-byte aligned BEFORE the CALL instruction
Red zone: 128 bytes below RSP (leaf functions can use without adjusting RSP)
// This function call:
result = foo(1, 2, 3, 4, 5, 6, 7, 8.0);
// Becomes:
// RDI=1, RSI=2, RDX=3, RCX=4, R8=5, R9=6
// 7 pushed onto stack (7th integer arg and beyond go on stack)
// XMM0=8.0 (first floating-point argument)
// CALL foo
// Result in RAX (if integer) or XMM0 (if float)
| Property | System V AMD64 | Microsoft x64 |
|---|---|---|
| Integer args | RDI, RSI, RDX, RCX, R8, R9 | RCX, RDX, R8, R9 |
| Float args | XMM0-XMM7 | XMM0-XMM3 |
| Shadow space | None (red zone instead) | 32 bytes reserved by caller |
| Callee-saved | RBX, RBP, R12-R15 | RBX, RBP, RDI, RSI, R12-R15 |
FLAGS Register
The FLAGS register records the result of the most recent arithmetic or comparison instruction. Conditional jumps read these flags:
| Flag | Name | Set When | Used By |
|---|---|---|---|
| ZF | Zero Flag | Result is zero | je/jz (jump if zero) |
| SF | Sign Flag | Result is negative | js (jump if sign) |
| CF | Carry Flag | Unsigned overflow | jb/jc (jump if carry) |
| OF | Overflow Flag | Signed overflow | jo (jump if overflow) |
The compiler prefers test rax, rax over cmp rax, 0 because the encoding is shorter.