How to Self-Host AI on Your Proxmox Homelab with Ollama and Open WebUI

By NetworkThinkTank | April 23, 2026

Introduction

I got tired of sending my data to cloud AI services. Every prompt I typed into ChatGPT or Claude was being stored, analyzed, and used for training. For personal questions, code snippets with API keys, and private brainstorming sessions, that never sat well with me.

So I built my own. A fully self-hosted AI assistant running on my Proxmox homelab, powered by Ollama for local LLM inference and Open WebUI for a polished ChatGPT-like interface. The models run on my own NVIDIA GPU, the data stays on my NAS, and nothing leaves my network.

This guide walks you through exactly how I did it – from VM creation to pulling your first model and chatting with it through a clean web interface. If you have a Proxmox server and a spare GPU, you can have this running in an afternoon.

Prerequisites and Hardware Requirements

Here is what you need before starting:

Hardware

Proxmox VE host (version 7.x or 8.x)
NVIDIA GPU with at least 8GB VRAM (I use an RTX 3060 Ti 8GB)
Minimum 16GB RAM allocated to the AI VM (32GB recommended)
NAS with NFS or SMB shares available (Synology, TrueNAS, etc.)
At least 50GB free storage for models (100GB+ recommended)

Software

Proxmox VE installed and running
Ubuntu Server 22.04 or 24.04 LTS ISO
Docker and Docker Compose
NVIDIA drivers (535+ recommended)
NVIDIA Container Toolkit

Network

Static IP or DHCP reservation for the AI VM
Access to your NAS from the VM subnet
Optional: domain name for reverse proxy

Architecture Overview

The stack looks like this:

All AI processing happens locally on the GPU inside the VM. Open WebUI provides the browser-based chat interface and connects to Ollama’s API on the backend. The NAS stores all model files and conversation data so nothing is lost if the VM needs rebuilding.

Step 1: Preparing the Proxmox VM

First, create a new VM in Proxmox optimized for AI workloads.

VM Configuration

VM ID: 205
Name: ollama-gpu
OS: Ubuntu Server 22.04 LTS
CPU: host type, 8 cores
RAM: 16GB minimum (I use 32GB)
Disk: 100GB on local-lvm (SSD preferred)
Network: vmbr0, bridge mode
BIOS: OVMF (UEFI) for GPU passthrough

Enabling IOMMU on the Proxmox Host

Edit GRUB configuration:

nano /etc/default/grub

For AMD CPUs, set:

GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt"

For Intel CPUs, set:

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"

Update GRUB and reboot:

			
update-grub
reboot

Add VFIO modules to /etc/modules:

			
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

Blacklist the NVIDIA drivers on the host:

			
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf

Find your GPU’s PCI IDs:

lspci -nn | grep NVIDIA

Example output:

			
00.0 VGA compatible controller [0300]: NVIDIA [10de:2489]
00.1 Audio device [0403]: NVIDIA [10de:228b]

Bind to VFIO:

echo "options vfio-pci ids=10de:2489,10de:228b" > /etc/modprobe.d/vfio.conf

Update initramfs and reboot:

			
update-initramfs -u -k all
reboot

Add GPU to the VM

In the Proxmox web UI:

			
VM 205 > Hardware > Add > PCI Device
- Select your NVIDIA GPU
- Check "All Functions"
- Check "ROM-Bar"
- Check "PCI-Express"
- Set "Primary GPU" if this VM has no other display

		

Verify GPU Inside the VM

After booting the VM, run:

lspci | grep -i nvidia

You should see your GPU listed. If not, check IOMMU groups and VFIO binding on the Proxmox host.

Step 2: Installing NVIDIA Drivers and Docker

With the GPU visible inside the VM, install the required software.

Update the System

			
sudo apt update && sudo apt upgrade -y
sudo reboot

Install NVIDIA Drivers

			
sudo apt install -y nvidia-driver-535 nvidia-utils-535
sudo reboot

Verify GPU Access

nvidia-smi

Expected output shows your RTX 3060 Ti with driver version and CUDA version. If nvidia-smi fails, check that the GPU passthrough is configured correctly on the Proxmox host.

Install Docker

			
sudo apt install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
  sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) \
  signed-by=/etc/apt/keyrings/docker.gpg] \
  https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo $VERSION_CODENAME) stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io \
  docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER

		

Install NVIDIA Container Toolkit

			
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

		

Verify GPU in Docker

docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

If this shows your GPU info inside the container, you are ready to deploy Ollama.

Step 3: Deploying Ollama

Create a project directory:

			
mkdir -p ~/ai-stack
cd ~/ai-stack

Create docker-compose.yml:

			
version: "3.8"
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - /mnt/nas/ollama-models:/root/.ollama
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    volumes:
      - /mnt/nas/openwebui-data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

		

Start the stack:

docker compose up -d

Pull your first model:

docker exec -it ollama ollama pull llama3.1:8b

This downloads the Llama 3.1 8B parameter model, which is an excellent starting point for an 8GB GPU. The download is roughly 4.7GB and will be stored on your NAS mount.

Other Recommended Models for 8GB VRAM

			
ollama pull mistral:7b        # Great for general tasks
ollama pull codellama:7b      # Optimized for coding
ollama pull llama3.1:8b-instruct # Best for chat interactions
ollama pull phi3:mini          # Microsoft's compact model
ollama pull gemma2:9b          # Google's open model

		

Test the Ollama API

			
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Hello, how are you?",
  "stream": false
}'

		

If you get a JSON response with generated text, Ollama is working.

Step 4: Configuring Open WebUI

Open your browser and navigate to:

http://<VM-IP>:3000

First-Time Setup

Create an admin account (first user automatically becomes admin)
Set a strong password – this is your AI assistant gateway
Open WebUI will auto-detect Ollama at the configured URL

Connecting to Ollama

Open WebUI should automatically connect to Ollama using the OLLAMA_BASE_URL environment variable we set in Docker Compose. Verify by clicking Settings > Connections and confirming the Ollama URL shows http://ollama:11434 with a green status.

Key Settings to Configure

Settings > General: Set default model to llama3.1:8b
Settings > Interface: Enable chat history, code highlighting
Settings > Models: View and manage downloaded models
Settings > Audio: Enable speech-to-text if desired
Settings > Images: Configure image generation if using a compatible model

Creating Custom Modelfiles

You can create specialized assistants using Ollama Modelfiles. Example – A coding assistant:

			
FROM codellama:7b
SYSTEM "You are an expert programmer. You write clean, efficient
code with clear comments. When asked about code, provide working
examples with explanations."
PARAMETER temperature 0.3
PARAMETER num_ctx 4096

		

Save this as coding-assistant.modelfile and create it:

			
docker exec -it ollama ollama create coding-assistant \
  -f /path/to/coding-assistant.modelfile

This model then appears in Open WebUI as a selectable assistant.

Step 5: NAS Storage Integration

Storing models and data on your NAS ensures persistence and makes backups straightforward.

Mount NFS Shares on the VM

Install NFS client:

sudo apt install -y nfs-common

Create mount points:

			
sudo mkdir -p /mnt/nas/ollama-models
sudo mkdir -p /mnt/nas/openwebui-data

Add to /etc/fstab for persistent mounts:

			
168.1.100:/volume1/ai-models  /mnt/nas/ollama-models  nfs  defaults,_netdev  0  0
168.1.100:/volume1/ai-data    /mnt/nas/openwebui-data nfs  defaults,_netdev  0  0

Mount everything:

sudo mount -a

Verify mounts:

df -h | grep nas

Replace 192.168.1.100 with your NAS IP and adjust the share paths to match your NAS configuration (Synology, TrueNAS, etc.).

Important: Make sure Docker containers have write permissions to these mount points. Set ownership if needed:

			
sudo chown -R 1000:1000 /mnt/nas/ollama-models
sudo chown -R 1000:1000 /mnt/nas/openwebui-data

Backup Strategy

NAS snapshots protect model data and conversations
Export Open WebUI settings periodically from the admin panel
Keep docker-compose.yml in a Git repository
Document your Modelfile customizations

Step 6: Networking and Remote Access

Expose Services on Your LAN

By default, the services are accessible at:

Ollama API: http://<VM-IP>:11434
Open WebUI: http://<VM-IP>:3000

To make Ollama accessible to other machines on your network, ensure the Ollama container binds to 0.0.0.0 (default in our Docker Compose config).

Reverse Proxy with Nginx

Install Nginx on the VM (or use a dedicated reverse proxy VM):

sudo apt install -y nginx

Create /etc/nginx/sites-available/ai-assistant:

			
server {
    listen 80;
    server_name ai.homelab.local;
    location / {
        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }
}

		

Enable the site:

			
sudo ln -s /etc/nginx/sites-available/ai-assistant \
  /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx

Add ai.homelab.local to your DNS server or local hosts file pointing to the VM IP address.

SSL with Let’s Encrypt (if publicly accessible)

			
sudo apt install -y certbot python3-certbot-nginx
sudo certbot --nginx -d ai.yourdomain.com

Firewall Rules

			
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw allow 3000/tcp
sudo ufw allow 11434/tcp
sudo ufw enable

		

Performance Tuning and Optimization

GPU Memory Management

The RTX 3060 Ti has 8GB VRAM, which limits model size
Stick to 7B-8B parameter models for best performance
Use Q4_K_M quantized models for the best speed/quality balance
Monitor VRAM usage: nvidia-smi -l 1 (updates every second)

Model Quantization Guide

			
Q4_K_M  - Best balance of speed and quality (recommended)
Q5_K_M  - Slightly better quality, slightly slower
Q8_0    - Near full quality, uses significantly more VRAM
F16     - Full precision, requires 2x the VRAM (not for 8GB cards)

Context Length vs. Speed

Default context: 2048 tokens (fast, limited memory)
Extended context: 4096 tokens (good balance)
Maximum context: 8192+ tokens (slower, more VRAM usage)

Set in your Modelfile or at runtime:

PARAMETER num_ctx 4096

Monitoring Resource Usage

			
watch -n 1 nvidia-smi       # GPU monitoring
htop                         # CPU and RAM monitoring
docker stats                 # Container resource usage
iostat -x 1                  # Disk I/O monitoring

Conclusion

After following this guide, you now have a fully self-hosted AI assistant running on your Proxmox homelab. Your data stays private, your models run locally on your GPU, and you have a clean web interface for interacting with multiple AI models.

The entire stack – Ollama for inference, Open WebUI for the interface, NAS for storage – runs reliably as a set of Docker containers inside a Proxmox VM with GPU passthrough. It survives reboots, updates cleanly, and scales as you add more models.

This is what homelabbing is about: taking control of your own infrastructure and running services that matter to you. A private AI assistant is one of the most practical and rewarding projects you can build today.

Real-World Deployment Tips

Start with small models first: Pull llama3.1:8b before anything else. It fits comfortably in 8GB VRAM and responds fast. Get everything working before experimenting with larger models.
Use NAS storage from day one: Do not store models on the VM’s local disk. When you inevitably rebuild the VM, you will lose hours re-downloading models. NAS storage makes rebuilds trivial.
Pin your Docker image versions: Use specific tags instead of “latest” in production. An unexpected update broke my Open WebUI setup once when the API format changed between versions.
Set OLLAMA_NUM_PARALLEL=1: On an 8GB card, running multiple concurrent requests causes out-of-memory crashes. Limit Ollama to one request at a time with this environment variable.
Monitor VRAM proactively: Add nvidia-smi -l 5 to a tmux session so you always see GPU memory usage. VRAM exhaustion causes silent failures that are hard to debug.
Enable Docker restart policies: The “unless-stopped” restart policy in our Docker Compose file means containers recover automatically after host reboots or crashes.
Test your NFS mounts under load: Some NAS devices throttle NFS under heavy I/O. Run model inference while monitoring NAS performance to catch bottlenecks early.
Keep a shell alias for quick model pulls: Add to your .bashrc: alias opull='docker exec -it ollama ollama pull' Then pulling models is just: opull mistral:7b

Honest Takeaways and Lessons Learned

Local LLMs are not ChatGPT replacements (yet): The 7B-9B models that fit on an 8GB GPU are impressive but noticeably less capable than GPT-4 or Claude for complex reasoning. They excel at drafting, summarization, code completion, and brainstorming. Manage your expectations accordingly.
GPU passthrough is the hardest part: Getting IOMMU groups clean, VFIO binding correct, and the GPU visible inside the VM took more troubleshooting than the entire rest of the stack combined. Once it works, it stays working, but expect 2-4 hours of debugging on your first attempt.
Open WebUI is surprisingly polished: I expected a rough open-source interface. Instead, Open WebUI is genuinely pleasant to use daily. The chat interface, model switching, conversation history, and document upload features rival commercial products.
Storage adds up fast: Each 7B model is 4-5GB. If you start collecting models (and you will), budget 100-200GB of NAS storage. I currently have 12 models taking up 67GB.
The privacy benefit is real: Once you start using a local AI for sensitive queries – tax questions, medical research, private code review – you realize how uncomfortable it was sending that data to third-party servers. This alone justifies the project.
Docker makes everything easier: Without Docker and the NVIDIA Container Toolkit, this setup would involve painful manual dependency management. The containerized approach means clean upgrades and easy rollbacks.
Community models keep getting better: The open-source LLM ecosystem is evolving rapidly. Models that were state-of-the-art six months ago are now outperformed by newer releases. Check Ollama’s model library regularly for improvements.

Common Pitfalls and How to Avoid Them

Pitfall 1: IOMMU Group Conflicts

Problem: Your GPU shares an IOMMU group with other devices.
Solution: Check groups with: find /sys/kernel/iommu_groups/ -type l
If your GPU is not in a clean group, you may need an ACS override patch or a different PCIe slot. Move the GPU to a slot that isolates it in its own IOMMU group.

Pitfall 2: NVIDIA Driver Conflicts on Proxmox Host

Problem: The Proxmox host loads NVIDIA drivers before VFIO can claim the GPU.
Solution: Blacklist nouveau and nvidia in /etc/modprobe.d/blacklist.conf and ensure VFIO modules load first. Add softdep nvidia pre: vfio-pci to modprobe configuration.

Pitfall 3: Docker Cannot See the GPU

Problem: docker run --gpus all fails with “could not select device driver”.
Solution: The NVIDIA Container Toolkit is not installed or not configured. Run:

			
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Pitfall 4: Open WebUI Cannot Connect to Ollama

Problem: Open WebUI shows “Connection failed” for the Ollama backend.
Solution: Ensure both containers are on the same Docker network (Docker Compose handles this automatically). Verify the OLLAMA_BASE_URL is set to http://ollama:11434 (using the container name, not localhost).

Pitfall 5: Models Disappear After VM Reboot

Problem: Downloaded models are gone after restarting the VM.
Solution: The NFS/SMB mount is not persisting across reboots. Add the mount to /etc/fstab with the _netdev option and verify with sudo mount -a after reboot.

Pitfall 6: Out of Memory (OOM) Crashes

Problem: Ollama crashes or returns errors during inference.
Solution: You are likely running a model too large for your VRAM. Stick to 7B-8B models on 8GB cards. Set OLLAMA_NUM_PARALLEL=1 to prevent concurrent requests from exceeding VRAM. Monitor with nvidia-smi.

Pitfall 7: Slow Model Loading from NAS

Problem: Models take a very long time to load initially.
Solution: NFS over a 1Gbps connection is the bottleneck. Models are 4-5GB each, so initial load takes 30-40 seconds. Consider 10Gbps networking or storing frequently-used models on local SSD with NAS as backup.

Pitfall 8: GPU Passthrough Breaks After Proxmox Update

Problem: GPU passthrough stops working after a Proxmox kernel update.
Solution: Kernel updates can change IOMMU behavior. After updates, verify VFIO binding:

			
lspci -nnk -s 27:00
dmesg | grep -i vfio
update-initramfs -u -k all

Always test GPU passthrough after host kernel updates before relying on the AI assistant for important work.

LinkedIn Version

BUILDING A PRIVATE AI ASSISTANT ON MY HOMELAB

I built a self-hosted AI assistant using Ollama and Open WebUI, running on my Proxmox homelab with an NVIDIA RTX 3060 Ti.

Why? Privacy. Control. Learning.

Every prompt I type stays on my network. My models run on my GPU. My conversations are stored on my NAS. Nothing goes to the cloud.

The stack:

Proxmox VE for virtualization
Ubuntu VM with GPU passthrough (PCIe/IOMMU)
Ollama for local LLM inference
Open WebUI for a ChatGPT-like interface
NAS integration for persistent model storage

What surprised me:

Open WebUI is genuinely polished – rivals commercial AI interfaces
GPU passthrough was the hardest part (expect 2-4 hours first time)
7B/8B models on an 8GB GPU are great for daily tasks
The privacy benefit is more significant than I expected

The open-source AI ecosystem has matured to the point where running your own AI assistant is not just possible – it is practical.

If you have a homelab with a spare GPU, this is one of the most rewarding projects you can build right now.

Full setup guide on my blog: NetworkThinkTank.blog

#homelab #AI #selfhosted #Ollama #OpenWebUI #Proxmox #privacy #LLM #artificialintelligence #homelabbing

Social Media Teaser

Just deployed a fully self-hosted AI assistant on my Proxmox homelab using Ollama and Open WebUI – complete with GPU passthrough and NAS storage integration. Every prompt stays private, every model runs locally, and the web interface rivals ChatGPT. Full build guide with Docker configs and real deployment tips on the blog.

Follow-Up Article Ideas

“Scaling Up: Adding a Second GPU to Your Ollama Homelab for Larger Language Models” – Covers multi-GPU passthrough in Proxmox, running 13B and 70B parameter models across multiple GPUs, VRAM pooling strategies, and benchmarking multi-GPU vs. single-GPU inference performance.
“Building a RAG Pipeline: Teaching Your Self-Hosted AI About Your Own Documents” – Covers Retrieval Augmented Generation (RAG) setup with Open WebUI’s document upload feature, embedding models, vector databases (ChromaDB), indexing your personal knowledge base, and making your AI assistant an expert on your own files.
“Hardening Your Self-Hosted AI: Security Best Practices for Homelab LLM Deployments” – Covers network segmentation for AI services, authentication and access control in Open WebUI, SSL/TLS configuration, firewall rules, monitoring for unauthorized access, Docker security hardening, and safely exposing your AI assistant outside your home network with VPN or Cloudflare Tunnel.