Motivation
I’m currently playing around with some LLMs for different projects. One project aims to add textual descriptions to a large image gallery. Those descriptions can be used both for better image organization or as alternative texts when publishing them on websites.
My basic setup contains ollama
for running the LLMs and some python code to locate the images, upload them to ollama
and save the returned description in the gallery’s database.
While testing I found, that I could run 4 instances of ollama
on 2 Nvidia cards to get the most out of the available hardware, however I’d need some kind of load balancing to make use of this parallelism.
Load balancing
So I researched and tried several ollama-specific load balancing solutions (like ollamaflow or ollama_proxy_server), however none of them provided up-to-date and working installation instructions or – if they did so – the configured system just failed for whatever reason (and I was unwilling to spent much time on further investigation).
So I decided to go with haproxy
for now. As I used it on several occasions the basic load balancing was easy:
linux # cat /etc/haproxy/conf.d/ollama.cfg
defaults
mode http
balance leastconn
frontend ft_ollama
bind :80
default_backend be_ollama
timeout client 60s
backend be_ollama
mode http
balance leastconn
option httpchk GET /
option forwardfor
timeout server 60s
timeout connect 60s
http-request set-header X-Real-IP %[src]
http-request set-header X-Forwarded-Proto https if { ssl_fc }
server ollama1 127.0.0.1:11435 check maxconn 2 fall 3 rise 2 inter 10s downinter 10s
server ollama2 127.0.0.1:11436 check maxconn 2 fall 3 rise 2 inter 10s downinter 10s
server ollama3 127.0.0.1:11437 check maxconn 2 fall 3 rise 2 inter 10s downinter 10s
server ollama4 127.0.0.1:11438 check maxconn 2 fall 3 rise 2 inter 10s downinter 10s
So basically we get a frontend listening on port 80, and we do load balancing for the 4 ollama
instances running on localhost (ports 11435-11438).
Adding bearer authentication
However what’s missing with plain ollama is authentication: everyone can use these systems (if a network connection can be established). Some of the tools (that I couldn’t get to work) had additional bearer authentication implemented, so I was wondering whether this feature could also be done by haproxy
.
Searching for “haproxy bearer authentication” didn’t give an easy solution, however there were some pointers into the right direction (s. here).
After some testing I ended up with this prototype:
linux # cat /etc/haproxy/conf.d/ollama.cfg
<...>
frontend ft_ollama
bind :80
default_backend be_ollama
timeout client 60s
acl valid_bearer var(txn.bearer) -m str 'sk_mysupersecure_bearer_token1'
acl valid_bearer var(txn.bearer) -m str 'sk_mysupersecure_bearer_token2'
http-request deny content-type 'text/html' string 'Missing Authorization HTTP header' unless { req.hdr(authorization) -m found }
http-request deny content-type 'text/html' string 'Not authorized' unless valid_bearer
<...>
Testing
A simple test using curl shows the desired effect:
linux # curl http://10.10.0.1:80/
Missing Authorization HTTP header
linux # curl -H 'Authorization: Bearer some-invalid-token' http://10.10.0.1:80/
Not authorized
linux # curl -H 'Authorization: Bearer sk_mysupersecure_bearer_token1' http://10.10.0.1:80/
Ollama is running
Conclusion
For a basic load balancing solution (even with rudimentary authentication) haproxy
will to fine. Of course other solutions could make use of internal ollama knowledge to achieve better results (something like choosing the right node that has the desired LLM already in memory instead of choosing a random one that may need to unload an existing LLM and load the new one and so on…).
You’d also want to have additional SSL protection for a production system, but there’s plenty of documentation about how to achieve that with haproxy out there, so I decided to keep it simple here 🙂 .