Date: May 10, 2026 | Tags: OpenClaw, Health Monitoring, Proxmox, systemd, Discord, Automation
Overview
The OpenClaw Health Monitor is a systemd-based health checking system that runs on the Proxmox host (10.10.20.252). It monitors critical OpenClaw services every 15 minutes and sends alerts to Discord when issues are detected. This post covers the architecture, components, and management of the health monitoring system.
How It Works
The health monitoring system uses a systemd timer (openclaw-health-alerts.timer) that triggers a Python health check script every 15 minutes. The script checks each monitored service via HTTP requests and reports any failures to Discord through a relay server.
The alert flow is as follows:
Timer triggers the service every 15 minutes
2. Health check script checks each monitored service
3. If any service is CRITICAL, an alert payload is built
4. Alert is sent via HTTP POST to the relay server
5. Relay forwards the payload to Discord webhook
6. Success is indicated by HTTP 204 response from Discord
Successfully completed the migration of both Ollama model data and OpenWebUI application data from local VM storage to a QNAP NAS (TS-453D) via NFS. This involved diagnosing and fixing a broken NAS mount configuration, correcting NFS export issues on the QNAP, and performing live data migrations with minimal downtime.
Background
The Ollama AI stack (Ollama + Open WebUI) runs on VM 205 (IP: 10.10.20.39) inside a Proxmox VE cluster (host IP: 10.10.20.252). The stack was originally configured to store both model weights and OpenWebUI application data on a NAS via NFS mounts, but at some point the NAS mounts broke, causing the stack to fail. A previous workaround switched the Docker volumes to local paths to restore service, but the goal was always to move the data back to the NAS for centralized storage and backup.
Phase 1: NAS Mount Investigation & Diagnosis
The Problem
The NFS mounts on VM 205 were broken. The original /etc/fstab had:
Wrong NAS IP in fstab: The fstab pointed to 10.10.20.6, but the actual NAS IP is 10.10.20.4. Pinging 10.10.20.6 returned 100% packet loss. Pinging 10.10.20.4 succeeded with <1ms latency.
Wrong path format: The fstab used Synology-style paths (/volume1/...) but the NAS is a QNAP TS-453D running QTS 5.2.3. QNAP uses different NFS export paths.
NAS Discovery
Property
Value
Model
QNAP TS-453D
Firmware
QTS 5.2.3.3451
CPU
Intel Celeron J4125 (4C/4T, 2.7GHz)
RAM
4 GB
IP
10.10.20.4
Hostname
NAS5F20EF
MAC
24:5A:8E:5F:20:F0
Open Ports on NAS
Port
Service
Status
22
SSH
Open
111
rpcbind
Open
443
HTTPS
Open
445
SMB
Open
2049
NFS
Open
8080
HTTP
Open
NFS Export Issue
Even though the NFS service was running (confirmed via rpcinfo showing NFS v2/3/4 on TCP/UDP port 2049), showmount -e 10.10.20.4 returned an empty export list. SSH into the QNAP revealed the /etc/exports file contained an invalid NFS option read-wr instead of rw, causing exportfs to silently fail with no active exports.
Phase 2: NFS Export Fix
SSH into QNAP NAS: ssh jczaldivar@10.10.20.4
Backup original exports: sudo cp /etc/exports /etc/exports.bak
The NAS already contained a full set of Ollama models from a previous configuration (9.8 GB):
Model
Size
ID
nomic-embed-text:latest
274 MB
0a195f422b47
qwen2.5-coder:7b
4.7 GB
dae161a2
llama3.1:8b
4.9 GB
46e0d1c039e
tinyllama:latest
637 MB
26449156de35
The NFS share was mounted on VM 205 at /mnt/nas/ollama-models, docker-compose.yml was updated to use the NAS path, and the container was recreated, immediately picking up all 4 models from the NAS.
Phase 4: OpenWebUI Data Migration
Location
Size
Description
NAS (pre-existing)
890 MB
Older copy from previous config
Local VM (active)
934 MB
Current working copy
Mounted NFS share: mount -t nfs 10.10.20.4:/openwebui-data /mnt/nas/openwebui-data
Stopped open-webui container for data consistency
Rsync’d local data to NAS: rsync -av --delete – transferred 934 MB at ~110 MB/sec
Updated docker-compose.yml volume to NAS path
Updated /etc/fstab with correct NAS IP (10.10.20.4) and paths
Restarted all containers – both recreated with NAS mounts and running healthy
Storage path: Ollama reads/writes model data from /root/.ollama inside the container, which maps to /mnt/nas/ollama-models on VM 205, which is an NFS mount to 10.10.20.4:/ollama-models on the QNAP NAS (internal path: /share/CACHEDEV1_DATA/ollama-models).
WebUI data path: Open WebUI reads/writes app data from /app/backend/data inside the container, which maps to /mnt/nas/openwebui-data on VM 205, which is an NFS mount to 10.10.20.4:/openwebui-data on the QNAP NAS (internal path: /share/CACHEDEV1_DATA/openwebui-data).
Always verify the NAS IP: The fstab had a stale IP (10.10.20.6) that no longer existed. The actual NAS was at 10.10.20.4.
QNAP != Synology paths: Synology uses /volume1/share-name, QNAP uses /share/CACHEDEV1_DATA/share-name internally and exports as /share-name.
Check /etc/exports directly: The QNAP web UI showed NFS enabled, but the underlying exports file had an invalid option (read-wr instead of rw), causing silent export failures.
showmount -e is your friend: Quick way to verify NFS exports are actually published.
Keep local backups: The docker-compose.yml.bak file preserved the original NAS-based config for easy restoration.
Status: Complete
Both Ollama models (9.8 GB, 4 models) and OpenWebUI data (890 MB) are now running from the QNAP NAS via NFS. The stack is fully operational with all 4 models accessible and OpenWebUI responding healthy on port 3000. Migration completed April 30, 2026.
This guide covers the essential steps for troubleshooting Proxmox boot failures and network recovery issues. Whether you’re dealing with a Proxmox VE node that won’t boot, network interfaces that fail to initialize, or connectivity problems after an update, this visual reference provides a structured approach to diagnosing and resolving common Proxmox infrastructure issue
Today I completed a BIOS update on my MSI MAG X570S Tomahawk Max WiFi motherboard, upgrading from the original BIOS Version 1.00 (dated 07/06/2021) to the latest Version 1.D1 (7D54v1D1, dated 09/19/2025). This update was performed using MSI’s M-FLASH utility as part of my Proxmox homelab infrastructure maintenance.
The visual guide above outlines the complete BIOS update process using MSI’s M-FLASH utility. Here’s a summary of the key steps performed:
Downloaded BIOS version 7D54v1D1 from MSI official support page
Extracted BIOS file (E7D54AMS.1D1) to a FAT32-formatted USB drive
Stopped all Proxmox VMs and containers before rebooting
Rebooted into BIOS and used M-FLASH to flash the new BIOS
Confirmed BIOS updated to Version 1.D1 (Release Date: 09/19/2025) via dmidecode
For detailed step-by-step documentation of this BIOS update process, visit the NetworkThinkTank-Labs GitHub repository. This motherboard serves as the foundation for my Proxmox homelab running GPU passthrough with an NVIDIA RTX 3060 Ti for AI workloads.
Some days in the homelab are quiet — a config tweak here, a firmware update there. And then there are days like today. April 29, 2026, turned into a full-blown infrastructure marathon: eight distinct projects spanning networking, virtualization, AI deployment, storage management, and documentation. Here is a complete rundown of everything that got done.
1. GitHub Documentation — 565 Lines of Technical Writing
Documentation is the backbone of any serious homelab. Today I pushed 565 lines of new documentation across multiple GitHub repositories. This included updated READMEs, configuration guides, topology diagrams, and step-by-step walkthroughs. Every project in my lab now has proper technical documentation that anyone can follow to replicate the setup. If it is not documented, it did not happen — and today, it all got documented.
2. EVE-NG CCNA Lab Updates — 29 Nodes
My EVE-NG CCNA lab got a major overhaul. The lab now contains 29 active nodes, covering routing, switching, and network services. This includes Cisco IOS routers and switches configured for OSPF, EIGRP, BGP, VLANs, STP, ACLs, NAT, and DHCP. The lab features API troubleshooting support, Proxmox migration readiness via Terraform and Ansible, and qcow2 image management. Whether you are studying for the CCNA or just want a robust network simulation environment, this lab has you covered.
3. Blog Post Publishing — 2,943 Words
Earlier this month, I published a comprehensive 2,943-word blog post on the Network ThinkTank blog covering how to self-host AI on a Proxmox homelab with Ollama and Open WebUI. Today’s writing adds to that momentum. Consistent publishing is key to building a knowledge base that helps both myself and the broader homelab community.
4. Ollama Models Deployment — 4 Models
Local AI is the future of privacy-conscious computing. Today I deployed four Ollama models on my Proxmox homelab, running inference entirely on local hardware. The models are served through Open WebUI, giving me a polished ChatGPT-like interface without any data leaving my network. No API keys, no cloud dependency, no privacy concerns — just pure local LLM power. The models cover different use cases from general conversation to code generation and technical assistance.
5. OpenClaw AI Agent Deployment
OpenClaw, an AI agent framework, was deployed and configured in the homelab today. This adds autonomous AI agent capabilities to the infrastructure, enabling task automation and intelligent workflows. The deployment involved setting up the agent runtime, configuring API endpoints, and testing basic agent interactions. This is a step toward building a more intelligent, self-managing homelab environment.
6. Windows Server VM Build
A fresh Windows Server virtual machine was built from scratch today. This VM will serve as a core infrastructure component for Active Directory, DNS, DHCP, and Group Policy management. The build process included creating the VM in the hypervisor, installing the OS, applying initial configurations, and setting up remote management. Having a Windows Server in the lab opens up enterprise-grade identity and access management capabilities.
7. NAS Storage Cleanup — ~20GB Freed
Storage hygiene is critical in any homelab environment. Today’s cleanup operation freed approximately 20GB of space on the NAS. This involved removing outdated VM snapshots, clearing old ISO images, purging stale Docker volumes, and archiving completed project files. A clean NAS is a happy NAS — and with 20GB reclaimed, there is plenty of room for new projects.
8. UniFi Network Server Installation
The UniFi Network Server was installed and configured today, bringing enterprise-grade network management to the homelab. This provides centralized control over UniFi access points, switches, and security gateways. The installation included setting up the controller software, adopting network devices, configuring wireless networks, and establishing monitoring dashboards. With UniFi in place, the entire network infrastructure can be managed from a single pane of glass.
Wrapping Up
Eight projects. One day. From AI deployments to network labs, from storage cleanup to documentation — today was a masterclass in homelab productivity. Every one of these projects builds on the others, creating a more capable, better-documented, and more resilient infrastructure.
The key takeaway? Documentation makes everything better. By writing things down — both in GitHub repos and blog posts — I am building a knowledge base that pays dividends every time I need to troubleshoot, replicate, or expand my setup.
If you are running a homelab, I encourage you to document your work, share your configs, and keep building. The community is stronger when we share what we learn.
Until next time — keep labbing.
Follow the Network ThinkTank blog for more homelab guides, networking tutorials, and infrastructure deep-dives. Check out the companion GitHub repositories at github.com/jczaldivar71 for configs, scripts, and technical documentation.
I got tired of sending my data to cloud AI services. Every prompt I typed into ChatGPT or Claude was being stored, analyzed, and used for training. For personal questions, code snippets with API keys, and private brainstorming sessions, that never sat well with me.
So I built my own. A fully self-hosted AI assistant running on my Proxmox homelab, powered by Ollama for local LLM inference and Open WebUI for a polished ChatGPT-like interface. The models run on my own NVIDIA GPU, the data stays on my NAS, and nothing leaves my network.
This guide walks you through exactly how I did it – from VM creation to pulling your first model and chatting with it through a clean web interface. If you have a Proxmox server and a spare GPU, you can have this running in an afternoon.
Prerequisites and Hardware Requirements
Here is what you need before starting:
Hardware
Proxmox VE host (version 7.x or 8.x)
NVIDIA GPU with at least 8GB VRAM (I use an RTX 3060 Ti 8GB)
Minimum 16GB RAM allocated to the AI VM (32GB recommended)
NAS with NFS or SMB shares available (Synology, TrueNAS, etc.)
At least 50GB free storage for models (100GB+ recommended)
Software
Proxmox VE installed and running
Ubuntu Server 22.04 or 24.04 LTS ISO
Docker and Docker Compose
NVIDIA drivers (535+ recommended)
NVIDIA Container Toolkit
Network
Static IP or DHCP reservation for the AI VM
Access to your NAS from the VM subnet
Optional: domain name for reverse proxy
Architecture Overview
The stack looks like this:
All AI processing happens locally on the GPU inside the VM. Open WebUI provides the browser-based chat interface and connects to Ollama’s API on the backend. The NAS stores all model files and conversation data so nothing is lost if the VM needs rebuilding.
Step 1: Preparing the Proxmox VM
First, create a new VM in Proxmox optimized for AI workloads.
Expected output shows your RTX 3060 Ti with driver version and CUDA version. If nvidia-smi fails, check that the GPU passthrough is configured correctly on the Proxmox host.
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
If this shows your GPU info inside the container, you are ready to deploy Ollama.
Step 3: Deploying Ollama
Create a project directory:
mkdir -p ~/ai-stack
cd ~/ai-stack
Create docker-compose.yml:
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- /mnt/nas/ollama-models:/root/.ollama
environment:
- NVIDIA_VISIBLE_DEVICES=all
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
volumes:
- /mnt/nas/openwebui-data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
Start the stack:
docker compose up -d
Pull your first model:
docker exec -it ollama ollama pull llama3.1:8b
This downloads the Llama 3.1 8B parameter model, which is an excellent starting point for an 8GB GPU. The download is roughly 4.7GB and will be stored on your NAS mount.
Other Recommended Models for 8GB VRAM
ollama pull mistral:7b # Great for general tasks
ollama pull codellama:7b # Optimized for coding
ollama pull llama3.1:8b-instruct # Best for chat interactions
ollama pull phi3:mini # Microsoft's compact model
ollama pull gemma2:9b # Google's open model
Test the Ollama API
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Hello, how are you?",
"stream": false
}'
If you get a JSON response with generated text, Ollama is working.
Step 4: Configuring Open WebUI
Open your browser and navigate to:
http://<VM-IP>:3000
First-Time Setup
Create an admin account (first user automatically becomes admin)
Set a strong password – this is your AI assistant gateway
Open WebUI will auto-detect Ollama at the configured URL
Connecting to Ollama
Open WebUI should automatically connect to Ollama using the OLLAMA_BASE_URL environment variable we set in Docker Compose. Verify by clicking Settings > Connections and confirming the Ollama URL shows http://ollama:11434 with a green status.
Key Settings to Configure
Settings > General: Set default model to llama3.1:8b
Maximum context: 8192+ tokens (slower, more VRAM usage)
Set in your Modelfile or at runtime:
PARAMETER num_ctx 4096
Monitoring Resource Usage
watch -n 1 nvidia-smi # GPU monitoring
htop # CPU and RAM monitoring
docker stats # Container resource usage
iostat -x 1 # Disk I/O monitoring
Conclusion
After following this guide, you now have a fully self-hosted AI assistant running on your Proxmox homelab. Your data stays private, your models run locally on your GPU, and you have a clean web interface for interacting with multiple AI models.
The entire stack – Ollama for inference, Open WebUI for the interface, NAS for storage – runs reliably as a set of Docker containers inside a Proxmox VM with GPU passthrough. It survives reboots, updates cleanly, and scales as you add more models.
This is what homelabbing is about: taking control of your own infrastructure and running services that matter to you. A private AI assistant is one of the most practical and rewarding projects you can build today.
Real-World Deployment Tips
Start with small models first: Pull llama3.1:8b before anything else. It fits comfortably in 8GB VRAM and responds fast. Get everything working before experimenting with larger models.
Use NAS storage from day one: Do not store models on the VM’s local disk. When you inevitably rebuild the VM, you will lose hours re-downloading models. NAS storage makes rebuilds trivial.
Pin your Docker image versions: Use specific tags instead of “latest” in production. An unexpected update broke my Open WebUI setup once when the API format changed between versions.
Set OLLAMA_NUM_PARALLEL=1: On an 8GB card, running multiple concurrent requests causes out-of-memory crashes. Limit Ollama to one request at a time with this environment variable.
Monitor VRAM proactively: Add nvidia-smi -l 5 to a tmux session so you always see GPU memory usage. VRAM exhaustion causes silent failures that are hard to debug.
Enable Docker restart policies: The “unless-stopped” restart policy in our Docker Compose file means containers recover automatically after host reboots or crashes.
Test your NFS mounts under load: Some NAS devices throttle NFS under heavy I/O. Run model inference while monitoring NAS performance to catch bottlenecks early.
Keep a shell alias for quick model pulls: Add to your .bashrc: alias opull='docker exec -it ollama ollama pull' Then pulling models is just: opull mistral:7b
Honest Takeaways and Lessons Learned
Local LLMs are not ChatGPT replacements (yet): The 7B-9B models that fit on an 8GB GPU are impressive but noticeably less capable than GPT-4 or Claude for complex reasoning. They excel at drafting, summarization, code completion, and brainstorming. Manage your expectations accordingly.
GPU passthrough is the hardest part: Getting IOMMU groups clean, VFIO binding correct, and the GPU visible inside the VM took more troubleshooting than the entire rest of the stack combined. Once it works, it stays working, but expect 2-4 hours of debugging on your first attempt.
Open WebUI is surprisingly polished: I expected a rough open-source interface. Instead, Open WebUI is genuinely pleasant to use daily. The chat interface, model switching, conversation history, and document upload features rival commercial products.
Storage adds up fast: Each 7B model is 4-5GB. If you start collecting models (and you will), budget 100-200GB of NAS storage. I currently have 12 models taking up 67GB.
The privacy benefit is real: Once you start using a local AI for sensitive queries – tax questions, medical research, private code review – you realize how uncomfortable it was sending that data to third-party servers. This alone justifies the project.
Docker makes everything easier: Without Docker and the NVIDIA Container Toolkit, this setup would involve painful manual dependency management. The containerized approach means clean upgrades and easy rollbacks.
Community models keep getting better: The open-source LLM ecosystem is evolving rapidly. Models that were state-of-the-art six months ago are now outperformed by newer releases. Check Ollama’s model library regularly for improvements.
Common Pitfalls and How to Avoid Them
Pitfall 1: IOMMU Group Conflicts
Problem: Your GPU shares an IOMMU group with other devices. Solution: Check groups with: find /sys/kernel/iommu_groups/ -type l If your GPU is not in a clean group, you may need an ACS override patch or a different PCIe slot. Move the GPU to a slot that isolates it in its own IOMMU group.
Pitfall 2: NVIDIA Driver Conflicts on Proxmox Host
Problem: The Proxmox host loads NVIDIA drivers before VFIO can claim the GPU. Solution: Blacklist nouveau and nvidia in /etc/modprobe.d/blacklist.conf and ensure VFIO modules load first. Add softdep nvidia pre: vfio-pci to modprobe configuration.
Pitfall 3: Docker Cannot See the GPU
Problem:docker run --gpus all fails with “could not select device driver”. Solution: The NVIDIA Container Toolkit is not installed or not configured. Run:
Problem: Open WebUI shows “Connection failed” for the Ollama backend. Solution: Ensure both containers are on the same Docker network (Docker Compose handles this automatically). Verify the OLLAMA_BASE_URL is set to http://ollama:11434 (using the container name, not localhost).
Pitfall 5: Models Disappear After VM Reboot
Problem: Downloaded models are gone after restarting the VM. Solution: The NFS/SMB mount is not persisting across reboots. Add the mount to /etc/fstab with the _netdev option and verify with sudo mount -a after reboot.
Pitfall 6: Out of Memory (OOM) Crashes
Problem: Ollama crashes or returns errors during inference. Solution: You are likely running a model too large for your VRAM. Stick to 7B-8B models on 8GB cards. Set OLLAMA_NUM_PARALLEL=1 to prevent concurrent requests from exceeding VRAM. Monitor with nvidia-smi.
Pitfall 7: Slow Model Loading from NAS
Problem: Models take a very long time to load initially. Solution: NFS over a 1Gbps connection is the bottleneck. Models are 4-5GB each, so initial load takes 30-40 seconds. Consider 10Gbps networking or storing frequently-used models on local SSD with NAS as backup.
Pitfall 8: GPU Passthrough Breaks After Proxmox Update
Problem: GPU passthrough stops working after a Proxmox kernel update. Solution: Kernel updates can change IOMMU behavior. After updates, verify VFIO binding:
lspci -nnk -s 27:00
dmesg | grep -i vfio
update-initramfs -u -k all
Always test GPU passthrough after host kernel updates before relying on the AI assistant for important work.
LinkedIn Version
BUILDING A PRIVATE AI ASSISTANT ON MY HOMELAB
I built a self-hosted AI assistant using Ollama and Open WebUI, running on my Proxmox homelab with an NVIDIA RTX 3060 Ti.
Why? Privacy. Control. Learning.
Every prompt I type stays on my network. My models run on my GPU. My conversations are stored on my NAS. Nothing goes to the cloud.
The stack:
Proxmox VE for virtualization
Ubuntu VM with GPU passthrough (PCIe/IOMMU)
Ollama for local LLM inference
Open WebUI for a ChatGPT-like interface
NAS integration for persistent model storage
What surprised me:
Open WebUI is genuinely polished – rivals commercial AI interfaces
GPU passthrough was the hardest part (expect 2-4 hours first time)
7B/8B models on an 8GB GPU are great for daily tasks
The privacy benefit is more significant than I expected
The open-source AI ecosystem has matured to the point where running your own AI assistant is not just possible – it is practical.
If you have a homelab with a spare GPU, this is one of the most rewarding projects you can build right now.
Full setup guide on my blog: NetworkThinkTank.blog
Just deployed a fully self-hosted AI assistant on my Proxmox homelab using Ollama and Open WebUI – complete with GPU passthrough and NAS storage integration. Every prompt stays private, every model runs locally, and the web interface rivals ChatGPT. Full build guide with Docker configs and real deployment tips on the blog.
Follow-Up Article Ideas
“Scaling Up: Adding a Second GPU to Your Ollama Homelab for Larger Language Models” – Covers multi-GPU passthrough in Proxmox, running 13B and 70B parameter models across multiple GPUs, VRAM pooling strategies, and benchmarking multi-GPU vs. single-GPU inference performance.
“Building a RAG Pipeline: Teaching Your Self-Hosted AI About Your Own Documents” – Covers Retrieval Augmented Generation (RAG) setup with Open WebUI’s document upload feature, embedding models, vector databases (ChromaDB), indexing your personal knowledge base, and making your AI assistant an expert on your own files.
“Hardening Your Self-Hosted AI: Security Best Practices for Homelab LLM Deployments” – Covers network segmentation for AI services, authentication and access control in Open WebUI, SSL/TLS configuration, firewall rules, monitoring for unauthorized access, Docker security hardening, and safely exposing your AI assistant outside your home network with VPN or Cloudflare Tunnel.
After weeks of building, troubleshooting, and optimizing my CCNA lab environment, I am excited to share the entire project — now fully documented and open-sourced on GitHub. This post walks through the journey from an initial EVE-NG deployment to a fully automated Proxmox-based lab using Terraform, Ansible, and custom shell scripts.
The EVE-NG CCNA Lab project started as a straightforward network simulation environment for CCNA study. It quickly evolved into a full infrastructure-as-code project covering:
EVE-NG lab deployment with API-driven automation
Migration from EVE-NG to Proxmox for better performance and scalability
Custom shell scripts for image management, licensing, and node orchestration
A Python script (generate_readme.py) to auto-generate comprehensive documentation
qcow2 disk image optimization achieving a 39% storage reduction
Terraform and Ansible playbooks for reproducible infrastructure deployment
GitHub Documentation and the generate_readme.py Script
One of the key pieces of this project is the generate_readme.py Python script. Rather than manually maintaining a README that would inevitably fall out of sync with the actual project structure, I wrote a script that scans the repository and automatically generates a comprehensive README.md file.
The script inspects every directory — configs/, scripts/, terraform-ansible/, topology/, and images/ — and produces a fully formatted document with a table of contents, script references, setup instructions, and troubleshooting tips. Running it is as simple as:
cd scripts/
python generate_readme.py
The generated README covers 13 sections including Overview, Project Structure, Prerequisites, Quick Start, Lab Topology, Scripts Reference, qcow2 Image Management, EVE-NG API Usage, Proxmox Deployment, Configuration Files, Troubleshooting, Known Limitations, and License information. At 340 lines, it serves as a complete guide for anyone wanting to replicate or build upon this lab.
EVE-NG to Proxmox Migration
EVE-NG is a fantastic network emulation platform, but I ran into limitations around resource management and integration with modern IaC tools. The decision to migrate to Proxmox was driven by several factors:
Better resource control: Proxmox provides fine-grained CPU, memory, and storage allocation through its API
Terraform integration: The Proxmox Terraform provider enables declarative infrastructure definitions
Thin provisioning: Proxmox handles thin-provisioned qcow2 images natively, which was critical for storage optimization
Ansible compatibility: Post-deployment configuration is seamless with Ansible playbooks targeting Proxmox VMs
The migration involved exporting router and switch images from EVE-NG, converting and optimizing the qcow2 disk images, and then redeploying them on Proxmox using Terraform. The entire workflow is captured in the terraform-ansible/ directory of the repository.
Automation Scripts
The scripts/ directory contains six purpose-built shell scripts that automate every aspect of lab management:
eve-ng-api-auth.sh: Handles cookie-based API authentication with EVE-NG, exporting session tokens for use in subsequent API calls. Includes examples for listing labs, getting node details, and starting all nodes.
start-lab-nodes.sh: Automates the process of starting all lab nodes through the EVE-NG REST API with proper sequencing and health checks.
scp-upload-images.sh: Securely transfers qcow2 images to the EVE-NG or Proxmox host via SCP with progress tracking and integrity verification.
qcow2-optimize.sh: The image optimization workhorse — converts, compresses, and thin-provisions qcow2 disk images (more on this below).
fix-permissions.sh: Ensures correct file ownership and permissions on EVE-NG image directories, a common source of lab startup failures.
iol-license-fix.sh: Generates and applies the proper IOL (IOS on Linux) license file, which is required for Cisco IOL images to boot correctly.
Each script is documented with usage instructions and can be run independently or chained together for a complete deployment workflow.
qcow2 Image Optimization
One of the most impactful parts of this project was optimizing the qcow2 disk images. Network appliance images (Cisco IOSv, IOSvL2, CSR1000v, etc.) often ship with significant wasted space — preallocated but unused disk blocks that consume real storage.
The qcow2-optimize.sh script automates a multi-step optimization pipeline:
Sparsification: Uses virt-sparsify to zero out unused blocks within the guest filesystem
Compression: Applies qcow2 internal compression via qemu-img convert -c
Thin provisioning: Ensures metadata is set for thin-provisioned allocation on the hypervisor
Integrity check: Runs qemu-img check to verify image health post-optimization
The results were significant: total image storage dropped from 30GB to 18.3GB — a 39% reduction. This is especially meaningful in a home lab where storage is often limited. The optimized images boot identically to the originals but consume far less disk space on the Proxmox host.
Terraform and Ansible Deployment
The final piece of the puzzle is fully automated deployment using Terraform and Ansible. The terraform-ansible/ directory contains everything needed to stand up the lab from scratch:
Terraform handles the infrastructure provisioning:
VM creation on Proxmox with defined CPU, memory, and disk parameters
Network interface configuration with VLAN tagging
Cloud-init integration for initial bootstrapping
State management for tracking deployed resources
Ansible manages the post-deployment configuration:
init-proxmox.yml: Initializes the Proxmox host with required packages, storage configuration, and network bridges
deploy-vm.yml: Deploys individual VMs with their specific configurations
remove-gateway.yml: Cleans up default gateway routes that can interfere with lab routing exercises
Configuration variables are stored in group_vars/all.yml (with a .sample template provided), and the hosts inventory file defines the Proxmox target. The ansible.cfg sets sensible defaults for host key checking and privilege escalation.
With this setup, spinning up a complete CCNA lab goes from a manual multi-hour process to a single command:
This project is a living repository — I plan to continue adding to it as I progress through my CCNA studies and expand the lab. Future additions may include:
Additional topology configurations for specific CCNA exam topics
Integration with network monitoring tools
CI/CD pipeline for automated lab testing
Support for additional platforms (VIRL, GNS3)
If you are studying for the CCNA or building your own home lab, feel free to fork the repository and adapt it to your needs. Contributions and feedback are always welcome.
Hey everyone! If you have been following my blog, you know I love combining Python with network engineering. From automating backups with Netmiko to monitoring IP SLAs with DNA Center, I am always looking for ways to make our lives as network engineers easier. Today, I am excited to walk you through my latest project: an AI-Powered Network Health Checker. Don’t worry — this is totally beginner friendly. If you can write a basic Python script, you can follow along!
What Does This Tool Do?
In a nutshell, this tool pulls real-time data from your network devices (think CPU usage, memory utilization, interface errors, etc.), feeds that data into a simple machine learning model, and tells you whether each device is healthy or if there might be an issue. The output is super straightforward — you will see messages like “Device is healthy” or “Potential issue detected.” No PhD in data science required!
Step 1: Pulling Device Data with Python
Just like in my previous posts on network automation, we start by connecting to our devices and grabbing the data we need. I used the Netmiko library to SSH into each device and pull key metrics. Here is a simplified version of the script:
This script connects to a Cisco IOS device, grabs CPU usage, memory utilization, and interface error counts. You can easily expand this to loop through multiple devices from an inventory file — just like we did in the backup config script project.
Step 2: Building a Simple ML Model for Anomaly Detection
Here is where the AI magic comes in — but I promise it is simpler than it sounds. We are using scikit-learn’s Isolation Forest algorithm, which is perfect for anomaly detection. It learns what “normal” looks like from your data and flags anything that seems off.
The Isolation Forest works by randomly partitioning data points. Anomalies are isolated faster because they are different from the majority of the data. The contamination parameter tells the model roughly what percentage of data points are expected to be anomalies — I set it to 0.1 (10%) as a starting point, but you can tune this for your environment.
Step 3: Putting It All Together
Now let us combine everything into a single script that loops through your devices, pulls the data, and runs it through the model:
If you are a network engineer who is curious about AI and machine learning, this is a great beginner project to get your feet wet. You don’t need to understand every detail of how Isolation Forest works under the hood — just know that it is a tool that can help you spot problems before they become outages.
As always, if you have questions or want to share how you have customized this for your own network, drop a comment below or reach out to me. Happy automating!
Backing up network device configurations is one of the most critical tasks in network administration. A missed backup could mean hours of manual reconfiguration after a failure. In this post, we will walk through a Python script that automates this process by connecting to a router or switch via SSH and saving the running-config to a local file. We will use Netmiko, a popular Python library that simplifies SSH connections to network devices. Whether you manage a handful of switches or hundreds of routers, this script gives you a repeatable, automated way to capture configurations on demand.
Prerequisites
Before getting started, make sure you have:
Python 3.8 or higher installed
SSH access to your target network device
Device credentials (username and password)
The Netmiko library installed
To install Netmiko, run:
pip install netmiko
You can also clone the full project repository and install from the requirements file:
The core of our script uses Netmiko’s ConnectHandler to establish an SSH session. You provide the device type, hostname or IP address, and your credentials. Netmiko handles the SSH negotiation and drops you into an authenticated session.
Netmiko supports a wide range of device types including cisco_ios, cisco_nxos, arista_eos, and juniper_junos. You simply pass the appropriate device type string and Netmiko adapts its behavior accordingly.
Retrieving the Running Configuration
Once connected, pulling the running configuration is a single method call. We use send_command to execute show running-config on the device and capture the output as a string:
Netmiko handles paging automatically, so even if your configuration is long, you will get the complete output without needing to send space or press Enter to page through it.
Saving the Configuration to a File
With the configuration captured in memory, the next step is writing it to a file. Our script creates a backups directory automatically and saves each configuration with a timestamped filename so you never overwrite a previous backup:
When you run the script, you will see output like this:
[] Connecting to 192.168.1.1:22 (cisco_ios)… [+] Successfully connected to 192.168.1.1 [] Retrieving running-config… [+] Retrieved 15234 characters of configuration [+] Configuration saved to backups/192_168_1_1_running-config_2026-04-10_14-30-00.txt [+] Disconnected. Backup complete!
What is Next
This script provides a solid foundation for network configuration backups. Here are some ideas for extending it:
Loop through a list of devices from a CSV or YAML inventory file to back up your entire network in one run
Schedule the script with cron (Linux) or Task Scheduler (Windows) for automatic daily backups
Add email or webhook notifications on success or failure
Compare configurations between backups to detect unauthorized changes using difflib
Integrate with Git to version-control your configurations automatically
Wrapping Up
Automating network device backups does not have to be complicated. With Python and Netmiko, you can connect to any router or switch, pull the running configuration, and save it to a timestamped file in just a few lines of code.
If you found this useful, check out my other posts on network automation including Monitoring IP SLAs with Python, DNA Center, and NetBox and Build a Home Lab Like a Pro. Stay tuned for more content from the NetworkThinkTank!
If you manage a network of any size, you know that keeping tabs on performance metrics like latency, jitter, and packet loss is critical. Cisco IP SLA (Service Level Agreement) operations are the go-to feature for probing network paths and measuring these metrics directly from your routers and switches. But manually checking IP SLA statistics across dozens or hundreds of devices? That does not scale.
In this post, I will walk you through a Python-based tool I built that pulls IP SLA data from Cisco DNA Center via its REST API, enriches it with device metadata from NetBox, and generates automated performance reports. Whether you are running a handful of branch routers or a large enterprise campus, this approach gives you a scalable, repeatable way to monitor network performance.
What is IP SLA?
Cisco IP SLA is a built-in feature on Cisco IOS and IOS-XE devices that allows you to generate synthetic traffic to measure network performance. You can configure operations like ICMP echo (ping), UDP jitter, HTTP GET, and more. Each operation continuously measures metrics such as round-trip time (RTT), latency, jitter, packet loss, and availability. These metrics are essential for validating SLA compliance, troubleshooting performance issues, and capacity planning.
The Tools
This project brings together three key components. First, Python does the heavy lifting for API calls, data parsing, and report generation. Second, Cisco DNA Center provides a centralized REST API for pulling device inventory and running CLI commands across your entire network without SSH-ing into each device individually. Third, NetBox acts as our network source of truth, storing device metadata like site assignments, roles, platforms, and IP addresses that we use to enrich the raw SLA data.
How It Works
The IP SLA Monitor tool follows a simple three-step workflow:
Authenticate with DNA Center and pull IP SLA operation statistics from all monitored devices using the command-runner API.
2. Query NetBox for each device to enrich the data with site name, device role, platform, and management IP.
3. Evaluate each SLA operation against configurable thresholds for latency, jitter, and packet loss, then generate JSON and CSV reports.
The tool can run as a one-shot collection or in a continuous monitoring loop with a configurable polling interval. Alerts are logged to the console when any operation exceeds your defined thresholds.
The Python Code
The project is organized into three main scripts:
ip_sla_monitor.py is the main orchestration script that ties everything together. It loads configuration from a .env file, initializes the DNA Center and NetBox clients, collects SLA data, enriches it, evaluates thresholds, and saves reports.
dnac_integration.py handles all communication with the Cisco DNA Center REST API including authentication, device inventory retrieval, and IP SLA data collection via the command-runner API.
netbox_integration.py connects to the NetBox API to look up device metadata by hostname, returning site assignments, device roles, platform types, and IP addresses.
Getting Started
Getting up and running is straightforward. Clone the repository, set up a virtual environment, install the dependencies from requirements.txt, and configure your .env file with your DNA Center and NetBox credentials. The only Python packages required are requests for HTTP API calls, python-dotenv for environment variable management, and urllib3. No complex frameworks or heavy dependencies. Full setup instructions are in the GitHub repo README.
Threshold Alerting
One of the most useful features is configurable threshold alerting. You define your acceptable limits for latency, jitter, and packet loss in the .env file, and the tool flags any SLA operation that exceeds those limits. For example, with default thresholds of 100ms latency, 30ms jitter, and 1% packet loss, a branch router showing 115ms latency and 2.1% packet loss would be immediately flagged as an alert in the console output and reports.
Sample Output
The tool generates both JSON and CSV reports. The JSON report includes a summary section with total operations, passing and failing counts, and average latency, followed by detailed per-operation data enriched with NetBox metadata. The CSV report provides the same data in a tabular format that you can easily import into Excel or feed into other monitoring tools. Sample output files are included in the GitHub repository under the output directory.
What is Next
This project is a solid foundation, but there is plenty of room to extend it. Some ideas for future enhancements include adding webhook or email notifications for alerts, integrating with Grafana for real-time dashboards, storing historical data in a time-series database like InfluxDB, and expanding the command-runner integration to pull live SLA statistics directly from devices.
Wrapping Up
Python automation combined with APIs from DNA Center and NetBox gives network engineers a powerful toolkit for monitoring IP SLAs at scale. Instead of manually checking IP SLA stats on individual devices, you can automate the entire workflow and get enriched reports in minutes.
If you found this useful, check out my other posts on network automation including Automating Network Device Backups with Python and Netmiko and Build a Home Lab Like a Pro. Stay tuned for more content from the NetworkThinkTank!