Your own LLM (Ollama)

Ollama runs open language models right on your own server: your data never goes to the cloud, and you don’t need API keys or subscriptions. You install the engine, download a model, and chat with it through the console or over a local API from your own apps.

Commands are current as of writing. Before installing, check the official site ollama.com and the repository github.com/ollama/ollama — the syntax and the model list change.

The reality about hardware

An LLM loves memory and is slow at crunching numbers on a CPU. Without a graphics card, go for modest models:

Model size	Where it runs	What to expect
1B–3B (e.g. `llama3.2`, `gemma3:1b`)	CPU + a few GB of RAM	it answers, but not instantly
7B–8B	CPU + lots of RAM, noticeably slower	tolerable for experiments
13B and up	you really need a GPU	not an option on a regular VPS

The main limiter is RAM. The model loads into RAM in full; if there isn’t enough, the OOM-killer terminates the process. Choose a model sized to your server and keep some memory in reserve. On a small plan, a swap file helps — but swap is slow; it’s a safety net, not a replacement for RAM.

Installation

Install Ollama

The official script installs Ollama and brings up a background service:

curl -fsSL https://ollama.com/install.sh | sh

Run a model

Download and start a conversation right away with one command:

ollama run llama3.2

Without a tag, the command pulls the 3B variant (~2 GB). On small servers (2–4 GB of RAM), grab the lightweight variant explicitly:

ollama run llama3.2:1b

The :1b and :3b tags set the model size. The first run downloads the model (from hundreds of MB to several GB); after that it’s pulled from disk. To exit the conversation — /bye.

Managing models

Download a model ahead of time

ollama pull gemma3

The list of available models and their sizes is at ollama.com/library.

List downloaded models

ollama list

Remove a model and free up disk

ollama rm gemma3

API for your apps

After installation, Ollama listens on a local API at http://localhost:11434. This is how your code talks to it:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{ "role": "user", "content": "Hello" }],
  "stream": false
}'

Want to chat through a browser instead of the console? Install the Open WebUI web interface in a container. It’s a handy chat on top of Ollama. How to set up Docker — in the Docker on your server guide.

Don’t expose port 11434 to the internet “as is” — an open API lets anyone load it up with your resources. Keep it reachable locally only (that’s the default), or close it off with a firewall and let traffic in only through a secured proxy with authentication.

Lumi handles the server and network; software setup is on you. Network or port issues — write to @lumisup_robot.

Your own LLM (Ollama)

The reality about hardware

Installation

Managing models

API for your apps

Where to next

Swap

Docker

​The reality about hardware

​Installation

​Managing models

​API for your apps

​Where to next

Swap

Docker

The reality about hardware

Installation

Managing models

API for your apps

Where to next