Running LLMS with llama.cpp using vulkan – Linux

Some time ago I tried to use my AMD iGPUs (not supported by AMDs ROCm) for LLMs. However I didn’t succeed.

Now I read some benchmark for the newest AMD Strix Halo systems performing quite well in some LLM tasks using vulkan instead of ROCm.

Building vulkan enabled llama.cpp (using docker)

Building llama.cpp using docker is quite easy (assuming you already have an up and running docker installation):

linux # git clone https://github.com/ggml-org/llama.cpp.git
linux # docker build -t llama-cpp-vulkan -f .devops/vulkan.Dockerfile .

So we just got the source from github and created a new image called llama-cpp-vulkan using the provided build recipe in vulkan.Dockerfile. By default the above command creates a docker image of the server version (no CLI for easy testing), however there are several targets available in this Dockerfile, so we’ll also create a cli version (target name: “light”) to ease testing:

linux # docker build -t llama-cpp-cli -f .devops/vulkan.Dockerfile --target light .

First try – small LLM

linux # docker run -it --rm -v "models:/root/.cache/llama.cpp:Z" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 llama-cpp-cli -hf ggml-org/gemma-3-1b-it-GGUF
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 680M (RADV REMBRANDT) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /app/libggml-vulkan.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
<...>
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon 680M (RADV REMBRANDT)) (0000:e7:00.0) - 43003 MiB free
<...>
load_tensors: loading model tensors, this can take a while… (mmap = true)
load_tensors: offloading 26 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 27/27 layers to GPU
load_tensors: Vulkan0 model buffer size = 762.49 MiB
load_tensors: CPU_Mapped model buffer size = 306.00 MiB
<...>

So loading the (1b) gemma 3 model worked like a charm, and is using less that 1 GB of RAM. The answer speed is good (but I didn’t run real benchmarks, only copied the data displayed by ollama.cpp itself, s. below).

Looks like my iGPU can use up to 40 GB of VRAM, so the next try will be the bigger (27b) gemma 3 model.

Different systems seem to allocate different amounts of VRAM, so your system may provide much less by default. If you’re looking for a way to change this (independent of your BIOS settings), have a look here.

Second try – bigger LLM

I already did some experiments with the bigger (27b) model on Nvidia L40 hardware, but till now I couldn’t get bigger models to run on my iGPU hardware. That changed with llama.cpp:

linux # docker run -it --rm -v "models:/root/.cache/llama.cpp:Z" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card1 llama-cpp-cli -hf ggml-org/gemma-3-
27b-it-qat-GGUF
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 680M (RADV REMBRANDT) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /app/libggml-vulkan.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
<...>
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon 680M (RADV REMBRANDT)) (0000:e7:00.0) - 43003 MiB free
<...>

Ok, while loading the model worked without problems the resulting performance is quite limited. Let’s put it this way: In a typing contest I’d most likely win 😉 . Nothing the less for a single user (with some time to spare) these models can run on this very modest hardware!

Server mode

The above calls always use the CLI version, now it’s time to switch to server mode:

linux # docker run -it --rm -p 8080:8080 -v "./root/.cache/llama.cpp:/root/.cache/llama.cpp:Z" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 llama-cpp-vulkan -hf ggml-org/gemma-3-1b-it-GGUF
<...>
main: server is listening on http://0.0.0.0:8080 - starting the main loop
<...>

In order to do that we basically only change the docker image from llama-cpp-cli to llama-cpp-vulkan.

Make sure to publish the internal port (default: 8080) to the outside world when using docker (‘-p 8080:8080‘).

If your machine is accessible for others (and you don’t want them to use your new AI power) make sure to specify an API-key using the --api-key option.

You should now be able to access the interface using a browser and your host’s IP address and the (outside) port:

http://192.168.0.10:8080

Benchmarks

Ok, it’s not a real benchmark … it’s just writing down the values I got while experimenting (with some easy questions like “Name all US presidents/states”).

However it should be enough to give you a first impression about what to expect:

model	prompt [tokens/s]	eval [tokens/s]
ggml-org/gemma-3-1b-it-GGUF	~150	~60
ggml-org/gemma-3-12b-it-qat-GGUF	~37	~8
ggml-org/gemma-3-27b-it-qat-GGUF	~16	2,2
CPU only: 8cores, AMD Ryzen 7 7735HS ggml-org/gemma-3-12b-it-qat-GGUF	~9	~5

I added the 12b model on CPU only for speed comparison. The more notable difference however was the noise level: With all 8 CPU cores active the cooling fan started working almost instantly, in GPU-mode the system kept quiet (at least for my short test time frame).

Building vulkan enabled llama.cpp (using docker)

First try – small LLM

Second try – bigger LLM

Server mode

Benchmarks

Leave a Reply Cancel reply