While ollama does handle multiple different LLMs quite nicely (loading/unloading on demand) there are situations where you may want to run multiple instances of the same model at the same time (e.g. to increase throughput).
Here’s how you can do so with minimal changes to your zero effort ollama installation.
Let’s assume you just did a fresh ollama installation.
Here’s what you need to change in order to run multiple instances of the same LLM:
Required changes
First of all we copy the systemd service file provided by the ollama installation:
linux # cp /etc/systemd/system/ollama.service /etc/systemd/system/ollama@.service
And now we apply some changes to convert it into a template service file (identified by the additional “@” in its name):
linux # vi /etc/systemd/system/ollama@.service
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
EnvironmentFile=/etc/default/ollama-%i
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"
[Install]
WantedBy=default.target
So what did we do? Basically we can now provide different config files (/etc/default/ollama-%i) for each ollama instance. The “%i” will be replaced by whatever will follow the “@” in the service instance name:
For example the ollama@instance1.service will look for its config in a file named /etc/default/ollama-instance1.
Let’s not forget to make sure systemd knows about the changes:
linux # systemctl daemon-reload
Preparing instances
First of all we’ll disable the default ollama instance in order to prevent conflicts with our upcoming instances:
linux # systemctl disable --now ollama
Then we provide each instance of ollama with their individual settings (like unique listening port, GPU to use, available options are documented here):
linux # vi /etc/default/ollama-instance1
OLLAMA_HOST=127.0.0.1:11435
CUDA_VISIBLE_DEVICES=0
# s. https://github.com/ollama/ollama/issues/9054
OLLAMA_KEEP_ALIVE=-1
OLLAMA_LOAD_TIMEOUT=-5m
OLLAMA_MAX_LOADED_MODELS=1
OLLAMA_NUM_PARALLEL=2
OLLAMA_MAX_QUEUE=64
OLLAMA_FLASH_ATTENTION=1
OLLAMA_DEBUG=0
By selecting unique listening ports and/or CUDA devices we can now run multiple instances of ollama using:
linux # systemctl enable --now ollama@instance1
Running ollama commands
In order to configure a specific instance, make sure to have the variable OLLAMA_HOST set (like in the configuration file):
linux # OLLAMA_HOST=127.0.0.1:11435 ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
linux # OLLAMA_HOST=127.0.0.1:11435 ollama pull gemma3:27b
pulling manifest
pulling e796792eba26: 100%
<...>
verifying sha256 digest
writing manifest
success
Be aware that it may take some time after startup until the ollama service will respond!
Load-balancing with litellm
Install lightllm (as described here):
linux # apt install python3 python3-venv
linux # mkdir litellm
linux # cd litellm
linux # python3 -m venv venv
linux # source venv/bin/activate
linux # pip install 'litellm[proxy]'
Prepare the config file
So for now we’ll run 2 instances of ollama, listening on port 11435 and 11436:
linux # vi litellm_config.yaml
model_list:
- model_name: gemma3:27b
litellm_params:
model: ollama/gemma3:27b
api_base: "http://127.0.0.1:11435"
- model_name: gemma3:27b
litellm_params:
model: ollama/gemma3:27b
api_base: "http://127.0.0.1:11436"
Starting litellm
linux # litellm --config litellm_config.yaml --host 127.0.0.1 --port 80
LiteLLM: Proxy initialized with Config, Set models:
gemma3:27b
gemma3:27b
Let’s do a short test:
linux # curl -s -X POST 'http://localhost:80/chat/completions' -H 'Content-Type: application/json' -d '{
"model": "gemma3:27b",
"messages": [
{"role": "user", "content": "What is 5+3?"}
],
"mock_testing_rate_limit_error": true
}' | jq
{
"id": "chatcmpl-616f33ac-11ab-4a6a-9475-bff8154e7d10",
"created": 1766324471,
"model": "ollama/gemma3:27b",
"object": "chat.completion",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "5 + 3 = 8\n",
"role": "assistant"
}
}
],
"usage": {
"completion_tokens": 9,
"prompt_tokens": 21,
"total_tokens": 30
}
}
Some more information about API keys and litellm can be found here.
