While trying to get some of Qwen’s latest models up and running on my AMD iGPU I encountered some crashes.
The errors were misleading, but in the end it turned out to be out of memory errors, so I started to think about how much memory different components of a LLM use.
The data in the following table were taken from llama.cpp output (example model was unsloth/Qwen3.6-27B):
| Model | Precisioin | VRAM Model | Context Size | Layers | VRAM KV | Total VRAM |
| unsloth/Qwen3.6-27B-GGUF:Q4_K_S | 4bit | 14430 MiB | 16384 | 16 | 1024 MiB | ~ 16 GiB |
| 132072 | 8192 MiB | ~ 22 GiB | ||||
| 262144 | 16384 MiB | ~ 31 GiB |
So a (very rough) calculation looks like this:
The (V)RAM required for the model depends on its size and the used precision:
and the (V)RAM required for the KV cache depends on the context size, the models number of layers, the precision of the KV cache (commonly 16bit) multiplied by 2 (because key and value are stored):
So for the above example this means:
As mentioned above this is only an estimation that leaves aside quite some details. But it should give you a first impression of the required amount of memory.
