Categories
Linux Ubuntu

Running multiple ollama instances on one machine

While ollama does handle multiple different LLMs quite nicely (loading/unloading on demand) there are situations where you may want to run multiple instances of the same model at the same time (e.g. to increase throughput).

Here’s how you can do so with minimal changes to your zero effort ollama installation.

Let’s assume you just did a fresh ollama installation.

Here’s what you need to change in order to run multiple instances of the same LLM:

Required changes

First of all we copy the systemd service file provided by the ollama installation:

linux # cp /etc/systemd/system/ollama.service /etc/systemd/system/ollama@.service

And now we apply some changes to convert it into a template service file (identified by the additional “@” in its name):

linux # vi /etc/systemd/system/ollama@.service
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
EnvironmentFile=/etc/default/ollama-%i
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"

[Install]
WantedBy=default.target

So what did we do? Basically we can now provide different config files (/etc/default/ollama-%i) for each ollama instance. The “%i” will be replaced by whatever will follow the “@” in the service instance name:

For example the ollama@instance1.service will look for its config in a file named /etc/default/ollama-instance1.

Let’s not forget to make sure systemd knows about the changes:

linux # systemctl daemon-reload

Preparing instances

First of all we’ll disable the default ollama instance in order to prevent conflicts with our upcoming instances:

linux # systemctl disable --now ollama

Then we provide each instance of ollama with their individual settings (like unique listening port, GPU to use, available options are documented here):

linux # vi /etc/default/ollama-instance1
OLLAMA_HOST=127.0.0.1:11435
CUDA_VISIBLE_DEVICES=0

# s. https://github.com/ollama/ollama/issues/9054
OLLAMA_KEEP_ALIVE=-1
OLLAMA_LOAD_TIMEOUT=-5m
OLLAMA_MAX_LOADED_MODELS=1
OLLAMA_NUM_PARALLEL=2
OLLAMA_MAX_QUEUE=64
OLLAMA_FLASH_ATTENTION=1
OLLAMA_DEBUG=0

By selecting unique listening ports and/or CUDA devices we can now run multiple instances of ollama using:

linux # systemctl enable --now ollama@instance1

Running ollama commands

In order to configure a specific instance, make sure to have the variable OLLAMA_HOST set (like in the configuration file):

linux # OLLAMA_HOST=127.0.0.1:11435 ollama ps
NAME    ID    SIZE    PROCESSOR    CONTEXT    UNTIL
linux # OLLAMA_HOST=127.0.0.1:11435 ollama pull gemma3:27b
pulling manifest 
pulling e796792eba26: 100%
<...>
verifying sha256 digest 
writing manifest 
success

Be aware that it may take some time after startup until the ollama service will respond!

Load-balancing with litellm

Install lightllm (as described here):

linux # apt install python3 python3-venv
linux # mkdir litellm
linux # cd litellm
linux # python3 -m venv venv
linux # source venv/bin/activate
linux # pip install 'litellm[proxy]'

Prepare the config file

So for now we’ll run 2 instances of ollama, listening on port 11435 and 11436:

linux # vi litellm_config.yaml 
model_list:
  - model_name: gemma3:27b
    litellm_params:
      model: ollama/gemma3:27b
      api_base: "http://127.0.0.1:11435"
      
  - model_name: gemma3:27b
    litellm_params:
      model: ollama/gemma3:27b
      api_base: "http://127.0.0.1:11436"

Starting litellm

linux # litellm --config litellm_config.yaml --host 127.0.0.1 --port 80
LiteLLM: Proxy initialized with Config, Set models:
    gemma3:27b
    gemma3:27b

Let’s do a short test:

linux # curl -s -X POST 'http://localhost:80/chat/completions' -H 'Content-Type: application/json' -d '{
  "model": "gemma3:27b",
  "messages": [
        {"role": "user", "content": "What is 5+3?"}
    ],
    "mock_testing_rate_limit_error": true
}' | jq
{
  "id": "chatcmpl-616f33ac-11ab-4a6a-9475-bff8154e7d10",
  "created": 1766324471,
  "model": "ollama/gemma3:27b",
  "object": "chat.completion",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "5 + 3 = 8\n",
        "role": "assistant"
      }
    }
  ],
  "usage": {
    "completion_tokens": 9,
    "prompt_tokens": 21,
    "total_tokens": 30
  }
}

Some more information about API keys and litellm can be found here.

Leave a Reply

Your email address will not be published. Required fields are marked *