Preparing AMD APUs for LLM usage – Linux

While investigating whether my AMD is somewhat usable for running LLMs I this is what I found.

Preparations

Installation of amdgpu driver and ROCm is explained here.

Hardware/device information

During reboot the amdgpu driver logs some information about the available amount of graphics memory: the reserved VRAM memory and the GTT (graphics translation table) memory that can be used both as graphics memory (if needed/allocated) or system memory.

Machine #1 with a 680M APU reports:

linux # dmesg | grep amdgpu
<...>
[    3.845996] [drm] amdgpu: 512M of VRAM memory ready
[    3.846004] [drm] amdgpu: 64021M of GTT memory ready.
<...>

Machine #2 with a 780M APU:

linux # dmesg | grep amdgpu
<...>
[    2.458868] [drm] amdgpu: 4096M of VRAM memory ready
[    2.458871] [drm] amdgpu: 13945M of GTT memory ready.
<...>

Increase reserved VRAM

According to some postings GTT memory can be modified by amdgpu module’s parameter gttsize:

linux # modinfo amdgpu | grep gttsize
parm:           gttsize:Size of the GTT userspace domain in megabytes (-1 = auto) (int)

So I booted the system with extra kernel option "amdgpu.gttsize=16384" (size in MB) to allocate 16GB of VRAM. That did the trick, well – at least kind of:

linux # dmesg
<...>
[    2.507767] [drm] amdgpu: 4096M of VRAM memory ready
[    2.507772] amdgpu 0000:c6:00.0: amdgpu: [drm] Configuring gttsize via module parameter is deprecated, please use ttm.pages_limit
[    2.507777] amdgpu 0000:c6:00.0: amdgpu: [drm] GTT size has been set as 17179869184 but TTM size has been set as 14622654464, this is unusual
[    2.507781] [drm] amdgpu: 16384M of GTT memory ready.
<...>

Ok, so this value should now be set differently (btw: kernel version is 6.8.0), let’s check out how by getting information about the parameters of the ttm ("TTM" stands for "translation table maps") kernel module:

linux # modinfo ttm
<...>
description:    TTM memory manager subsystem (for DRM device)
<...>
parm:           page_pool_size:Number of pages in the WC/UC/DMA pool (ulong)
parm:           pages_limit:Limit for the allocated pages (ulong)
parm:           dma32_pages_limit:Limit for the allocated DMA32 pages (ulong)

While trying to get the current values used by that module I found that the ttm module isn’t even loaded, instead I found an active module amdttm which seems to ship with the amdgpu drivers (and seems to replace the standard ttm module):

linux # modinfo amdttm
<...>
description:    TTM memory manager subsystem (for DRM device)
<...>
parm:           page_pool_size:Number of pages in the WC/UC/DMA pool (ulong)
parm:           pages_limit:Limit for the allocated pages (ulong)
parm:           dma32_pages_limit:Limit for the allocated DMA32 pages (ulong)

Let’s see what values were used by the module on the 780M machine that complained about the memory mismatch:

linux # cat /sys/module/amdttm/parameters/page_pool_size
3569984
linux # cat /sys/module/amdttm/parameters/pages_limit
3569984
linux # cat /sys/module/amdttm/parameters/dma32_pages_limit
524288

With a default page size of 4k (4096 bytes) this is:

3569984 * 4096 = 14622654464

which is exactly the amount of memory mentioned during boot.

So next try is to set amdttm.pages_limit and amdttm.page_pool_size to the desired value (which is specified in pages with a size of 4k), so for 16 GB that means

16 * 1024 * 1024 * 1024 / 4096 = 4194304

Booting up machine #2 (780M) with these options results in 16GB of GTT memory like before, but this time without complaints by the kernel:

linux # dmesg
<...>
[    2.431176] [drm] amdgpu: 4096M of VRAM memory ready
[    2.431180] [drm] amdgpu: 16384M of GTT memory ready.
<...>

System tools

Let’s first have a look at our hardware/software stack:

linux # rocm-smi
==== ROCm System Management Interface ====
==== Concise Info ====
Device  Node  IDs              Temp    Power     Partitions          SCLK  MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)                                                 
=================================================
0       1     0x1681,   50563  43.0°C  8.168W    N/A, N/A, 0         N/A   2400Mhz  0%   auto  N/A     4%     0%    
=================================================
==== End of ROCm SMI Log ====

linux # rocm-smi --showproductname
==== ROCm System Management Interface ====
==== Product Info ====
GPU[0]          : Card Series:          AMD Radeon Graphics
GPU[0]          : Card Model:           0x1681
GPU[0]          : Card Vendor:          Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0]          : Card SKU:             REMBRANDT
GPU[0]          : Subsystem ID:         -0x776b
GPU[0]          : Device Rev:           0x0a
GPU[0]          : Node ID:              1
GPU[0]          : GUID:                 50563
GPU[0]          : GFX Version:          gfx1035
==== End of ROCm SMI Log ====

linux # rocm-smi --showhw
==== ROCm System Management Interface ====
==== Concise Hardware Info ====
GPU  NODE  DID     GUID   GFX VER  GFX RAS  SDMA RAS  UMC RAS  VBIOS              BUS           PARTITION ID  
0    1     0x1681  50563  gfx1035  N/A      N/A       N/A      113-REMBRANDT-X37  0000:E7:00.0  0             
=================================================
==== End of ROCm SMI Log ====

linux # rocm-smi --showdriverversion
==== ROCm System Management Interface ====
==== Version of System Component ====
Driver version: 6.12.12
=================================================
==== End of ROCm SMI Log ====

linux # rocm-smi --showfwinfo
==== ROCm System Management Interface ====
==== Firmware Information ====
GPU[0]          : ASD firmware version:         0x210000eb
GPU[0]          : CE firmware version:          37
GPU[0]          : ME firmware version:          64
GPU[0]          : MEC firmware version:         122
GPU[0]          : MEC2 firmware version:        122
GPU[0]          : PFP firmware version:         104
GPU[0]          : RLC firmware version:         83
GPU[0]          : RLC SRLC firmware version:    1
GPU[0]          : RLC SRLG firmware version:    1
GPU[0]          : RLC SRLS firmware version:    1
GPU[0]          : SDMA firmware version:        47
GPU[0]          : SMC firmware version:         04.69.63.105
GPU[0]          : VCN firmware version:         0x04121003
==== End of ROCm SMI Log ====

Or just use "rocm-smi -a" to show all available information.

Running vllm with ROCm support on 680M/780M

While official ROCm releases do not support these cards, there’s an inofficial project adding support for those cards!

And even better: docker images containing a pre-build vllm are also available!

So for now all we have to do is choose the right docker image according to the graphics core engine:

RDNA1/2 image for 680M:

linux # docker pull lamikr/rocm_sdk_builder:612_01_rdna1_rdna2
linux # docker run -it --device=/dev/kfd --device=/dev/dri -p 8000:8000 docker.io/lamikr/rocm_sdk_builder:612_01_rdna1_rdna2 bash

RDNA3 image for 780M:

linux # docker pull lamikr/rocm_sdk_builder:612_01_rdna3
linux # docker run -it --device=/dev/kfd --device=/dev/dri -p 8000:8000 docker.io/lamikr/rocm_sdk_builder:612_01_rdna3 bash

Now let’s start vllm with a standard module:

linux # vllm serve <your model here> --api-key token-abc123

Wait for the model preparation … and hope for the best.

If startup is successful you can now access the model using the openai API URL: http://<your-ip>:8000/v1.

Conclusion

While some simple LLMs worked as described above, the majority either took too long to even start up, didn’t start up at all (with a random bunch of error messages) or even crashed the system (ok, only the Wayland-server, but bad enough).

So in the end it still looks like AMD GPUs (especially the cheap APU ones) are not supported by official ROCm release for a reason (but maybe the same applies for the bigger GPUs … I don’t know …)