LLM VRAM usage with llama.cpp – Linux

While trying to get some of Qwen’s latest models up and running on my AMD iGPU I encountered some crashes.

The errors were misleading, but in the end it turned out to be out of memory errors, so I started to think about how much memory different components of a LLM use.

The data in the following table were taken from llama.cpp output (example model was unsloth/Qwen3.6-27B):

Model	Precisioin	VRAM Model	Context Size	Layers	VRAM KV	Total VRAM
unsloth/Qwen3.6-27B-GGUF:Q4_K_S	4bit	14430 MiB	16384	16	1024 MiB	~ 16 GiB
			132072		8192 MiB	~ 22 GiB
			262144		16384 MiB	~ 31 GiB

So a (very rough) calculation looks like this:

VRAM_{total}=VRAM_{model}+VRAM_{kvcache}

The (V)RAM required for the model depends on its size and the used precision:

VRAM_{model}=Modelsize * Precision

and the (V)RAM required for the KV cache depends on the context size, the models number of layers, the precision of the KV cache (commonly 16bit) multiplied by 2 (because key and value are stored):

VRAM_{kvcache} =ContextSize*Layers*Precision*2

So for the above example this means:

VRAM_{model}=27B(illion)*4Bit=13.5 GByte

VRAM_{kvcache}=132072*16*2*2 Byte=8GByte

As mentioned above this is only an estimation that leaves aside quite some details. But it should give you a first impression of the required amount of memory.

Leave a Reply Cancel reply