Categories
Uncategorized

LLM VRAM usage with llama.cpp

While trying to get some of Qwen’s latest models up and running on my AMD iGPU I encountered some crashes.

The errors were misleading, but in the end it turned out to be out of memory errors, so I started to think about how much memory different components of a LLM use.

The data in the following table were taken from llama.cpp output (example model was unsloth/Qwen3.6-27B):

ModelPrecisioinVRAM ModelContext SizeLayersVRAM KVTotal VRAM
unsloth/Qwen3.6-27B-GGUF:Q4_K_S4bit14430 MiB16384161024 MiB~ 16 GiB
1320728192 MiB~ 22 GiB
26214416384 MiB~ 31 GiB

So a (very rough) calculation looks like this:

VRAMtotal=VRAMmodel+VRAMkvcacheVRAM_{total}=VRAM_{model}+VRAM_{kvcache}

The (V)RAM required for the model depends on its size and the used precision:

VRAMmodel=ModelsizePrecisionVRAM_{model}=Modelsize * Precision

and the (V)RAM required for the KV cache depends on the context size, the models number of layers, the precision of the KV cache (commonly 16bit) multiplied by 2 (because key and value are stored):

VRAMkvcache=ContextSizeLayersPrecision2VRAM_{kvcache} =ContextSize*Layers*Precision*2

So for the above example this means:

VRAMmodel=27B(illion)4Bit=13.5GByteVRAM_{model}=27B(illion)*4Bit=13.5 GByte
VRAMkvcache=1320721622Byte=8GByteVRAM_{kvcache}=132072*16*2*2 Byte=8GByte

As mentioned above this is only an estimation that leaves aside quite some details. But it should give you a first impression of the required amount of memory.

Leave a Reply

Your email address will not be published. Required fields are marked *