How I Built an AI Network Monitoring Tool (Beginner Friendly)

Hey everyone! If you have been following my blog, you know I love combining Python with network engineering. From automating backups with Netmiko to monitoring IP SLAs with DNA Center, I am always looking for ways to make our lives as network engineers easier. Today, I am excited to walk you through my latest project: an AI-Powered Network Health Checker. Don’t worry — this is totally beginner friendly. If you can write a basic Python script, you can follow along!

What Does This Tool Do?

In a nutshell, this tool pulls real-time data from your network devices (think CPU usage, memory utilization, interface errors, etc.), feeds that data into a simple machine learning model, and tells you whether each device is healthy or if there might be an issue. The output is super straightforward — you will see messages like “Device is healthy” or “Potential issue detected.” No PhD in data science required!

Step 1: Pulling Device Data with Python

Just like in my previous posts on network automation, we start by connecting to our devices and grabbing the data we need. I used the Netmiko library to SSH into each device and pull key metrics. Here is a simplified version of the script:

from netmiko import ConnectHandler
import re
device = {
'device_type': 'cisco_ios',
'host': '192.168.1.1',
'username': 'admin',
'password': 'yourpassword',
}
connection = ConnectHandler(**device)
cpu_output = connection.send_command('show processes cpu')
cpu_match = re.search(r'CPU utilization for five seconds: (\\d+)%', cpu_output)
cpu_usage = int(cpu_match.group(1)) if cpu_match else 0
mem_output = connection.send_command('show processes memory')
mem_match = re.search(r'Processor Pool Total:\\s+(\\d+)\\s+Used:\\s+(\\d+)', mem_output)
if mem_match:
mem_total = int(mem_match.group(1))
mem_used = int(mem_match.group(2))
mem_usage = (mem_used / mem_total) * 100
else:
mem_usage = 0
intf_output = connection.send_command('show interfaces')
error_matches = re.findall(r'(\\d+) input errors', intf_output)
total_errors = sum(int(e) for e in error_matches)
print(f"CPU Usage: {cpu_usage}%")
print(f"Memory Usage: {mem_usage:.1f}%")
print(f"Total Interface Errors: {total_errors}")
connection.disconnect()

This script connects to a Cisco IOS device, grabs CPU usage, memory utilization, and interface error counts. You can easily expand this to loop through multiple devices from an inventory file — just like we did in the backup config script project.

Step 2: Building a Simple ML Model for Anomaly Detection

Here is where the AI magic comes in — but I promise it is simpler than it sounds. We are using scikit-learn’s Isolation Forest algorithm, which is perfect for anomaly detection. It learns what “normal” looks like from your data and flags anything that seems off.

import numpy as np
from sklearn.ensemble import IsolationForest
training_data = np.array([
[15, 40, 0], [20, 45, 1], [18, 42, 0],
[22, 50, 2], [17, 38, 0], [19, 44, 1],
[21, 47, 0], [16, 41, 1], [20, 43, 0], [18, 46, 2],
])
model = IsolationForest(contamination=0.1, random_state=42)
model.fit(training_data)
new_device_data = np.array([[cpu_usage, mem_usage, total_errors]])
prediction = model.predict(new_device_data)
if prediction[0] == 1:
print("Device is healthy")
else:
print("Potential issue detected")

The Isolation Forest works by randomly partitioning data points. Anomalies are isolated faster because they are different from the majority of the data. The contamination parameter tells the model roughly what percentage of data points are expected to be anomalies — I set it to 0.1 (10%) as a starting point, but you can tune this for your environment.

Step 3: Putting It All Together

Now let us combine everything into a single script that loops through your devices, pulls the data, and runs it through the model:

from netmiko import ConnectHandler
from sklearn.ensemble import IsolationForest
import numpy as np
import re
devices = [
{'device_type': 'cisco_ios', 'host': '192.168.1.1', 'username': 'admin', 'password': 'yourpassword'},
{'device_type': 'cisco_ios', 'host': '192.168.1.2', 'username': 'admin', 'password': 'yourpassword'},
]
training_data = np.array([
[15, 40, 0], [20, 45, 1], [18, 42, 0],
[22, 50, 2], [17, 38, 0], [19, 44, 1],
[21, 47, 0], [16, 41, 1], [20, 43, 0], [18, 46, 2],
])
model = IsolationForest(contamination=0.1, random_state=42)
model.fit(training_data)
def get_device_metrics(device):
connection = ConnectHandler(**device)
cpu_output = connection.send_command('show processes cpu')
cpu_match = re.search(r'CPU utilization for five seconds: (\\d+)%', cpu_output)
cpu_usage = int(cpu_match.group(1)) if cpu_match else 0
mem_output = connection.send_command('show processes memory')
mem_match = re.search(r'Processor Pool Total:\\s+(\\d+)\\s+Used:\\s+(\\d+)', mem_output)
mem_usage = (int(mem_match.group(2)) / int(mem_match.group(1))) * 100 if mem_match else 0
intf_output = connection.send_command('show interfaces')
error_matches = re.findall(r'(\\d+) input errors', intf_output)
total_errors = sum(int(e) for e in error_matches)
connection.disconnect()
return [cpu_usage, mem_usage, total_errors]
for device in devices:
print(f"Checking device: {device['host']}")
metrics = get_device_metrics(device)
print(f" CPU: {metrics[0]}% | Memory: {metrics[1]:.1f}% | Errors: {metrics[2]}")
prediction = model.predict(np.array([metrics]))
if prediction[0] == 1:
print(" Status: Device is healthy")
else:
print(" Status: Potential issue detected")

Here is what the output looks like:

Checking device: 192.168.1.1
CPU: 18% | Memory: 43.2% | Errors: 0
Status: Device is healthy
Checking device: 192.168.1.2
CPU: 85% | Memory: 92.1% | Errors: 47
Status: Potential issue detected

What’s Next?

This is just the starting point. Here are some ideas to take it further:

  • Add more metrics like uplink bandwidth utilization or BGP neighbor status
  • Save your model to a file using joblib so you don’t retrain every time
  • Set up a cron job or scheduled task to run the script at regular intervals
  • Send alerts via email or Slack when an issue is detected
  • Build a simple dashboard with Flask to visualize device health

Get the Code

I have uploaded the full project to my GitHub repo. Feel free to clone it, play around with it, and make it your own:

GitHub: https://github.com/NetworkThinkTank-Labs/ai-network-health-checker

Final Thoughts

If you are a network engineer who is curious about AI and machine learning, this is a great beginner project to get your feet wet. You don’t need to understand every detail of how Isolation Forest works under the hood — just know that it is a tool that can help you spot problems before they become outages.

As always, if you have questions or want to share how you have customized this for your own network, drop a comment below or reach out to me. Happy automating!

Automating Network Device Backups with Python and Netmiko

Backing up network device configurations is one of the most critical tasks in network administration. A missed backup could mean hours of manual reconfiguration after a failure. In this post, we will walk through a Python script that automates this process by connecting to a router or switch via SSH and saving the running-config to a local file.
We will use Netmiko, a popular Python library that simplifies SSH connections to network devices. Whether you manage a handful of switches or hundreds of routers, this script gives you a repeatable, automated way to capture configurations on demand.

Prerequisites

Before getting started, make sure you have:

  • Python 3.8 or higher installed
  • SSH access to your target network device
  • Device credentials (username and password)
  • The Netmiko library installed

To install Netmiko, run:

pip install netmiko

You can also clone the full project repository and install from the requirements file:

git clone https://github.com/NetworkThinkTank-Labs/backup-config-script.git
cd backup-config-script
pip install -r requirements.txt

Connecting to a Router or Switch

The core of our script uses Netmiko’s ConnectHandler to establish an SSH session. You provide the device type, hostname or IP address, and your credentials. Netmiko handles the SSH negotiation and drops you into an authenticated session.

Here is the connection function from our script:

def connect_to_device(host, username, password, device_type, port=22, enable_secret=None):
device = {
‘device_type’: device_type,
‘host’: host,
‘username’: username,
‘password’: password,
‘port’: port,
}
if enable_secret:
device[‘secret’] = enable_secret
connection = ConnectHandler(**device)
if enable_secret:
connection.enable()
return connection

Netmiko supports a wide range of device types including cisco_ios, cisco_nxos, arista_eos, and juniper_junos. You simply pass the appropriate device type string and Netmiko adapts its behavior accordingly.

Retrieving the Running Configuration

Once connected, pulling the running configuration is a single method call. We use send_command to execute show running-config on the device and capture the output as a string:

def backup_running_config(connection):
running_config = connection.send_command(‘show running-config’)
return running_config

Netmiko handles paging automatically, so even if your configuration is long, you will get the complete output without needing to send space or press Enter to page through it.

Saving the Configuration to a File

With the configuration captured in memory, the next step is writing it to a file. Our script creates a backups directory automatically and saves each configuration with a timestamped filename so you never overwrite a previous backup:

def save_config_to_file(config, hostname, output_dir=’backups’):
os.makedirs(output_dir, exist_ok=True)
timestamp = datetime.now().strftime(‘%Y-%m-%d_%H-%M-%S’)
safe_hostname = hostname.replace(‘.’, ‘‘).replace(‘:’, ‘‘)
filename = f'{safe_hostname}running-config{timestamp}.txt’
filepath = os.path.join(output_dir, filename)
with open(filepath, ‘w’) as f:
f.write(config)
return filepath

This produces backup files like:

backups/192_168_1_1_running-config_2026-04-10_14-30-00.txt

Running the Script

The script accepts command-line arguments so you can target any device without editing the code:

python backup_config.py –host 192.168.1.1 –username admin –password mypassword

You can also specify a different device type, SSH port, output directory, or enable password:

python backup_config.py –host 10.0.0.1 –username admin –password mypassword –device-type cisco_nxos –output-dir /var/backups/network

When you run the script, you will see output like this:

[] Connecting to 192.168.1.1:22 (cisco_ios)… [+] Successfully connected to 192.168.1.1 [] Retrieving running-config…
[+] Retrieved 15234 characters of configuration
[+] Configuration saved to backups/192_168_1_1_running-config_2026-04-10_14-30-00.txt
[+] Disconnected. Backup complete!

What is Next

This script provides a solid foundation for network configuration backups. Here are some ideas for extending it:

  • Loop through a list of devices from a CSV or YAML inventory file to back up your entire network in one run
  • Schedule the script with cron (Linux) or Task Scheduler (Windows) for automatic daily backups
  • Add email or webhook notifications on success or failure
  • Compare configurations between backups to detect unauthorized changes using difflib
  • Integrate with Git to version-control your configurations automatically

Wrapping Up

Automating network device backups does not have to be complicated. With Python and Netmiko, you can connect to any router or switch, pull the running configuration, and save it to a timestamped file in just a few lines of code.

Check out the full source code and setup instructions in the GitHub repository: https://github.com/NetworkThinkTank-Labs/backup-config-script

If you found this useful, check out my other posts on network automation including Monitoring IP SLAs with Python, DNA Center, and NetBox and Build a Home Lab Like a Pro. Stay tuned for more content from the NetworkThinkTank!

Monitoring IP SLAs with Python, DNA Center, and NetBox

Automate IP SLA monitoring across your network using Python, Cisco DNA Center APIs, and NetBox as your source of truth.

GitHub Repo: https://github.com/NetworkThinkTank-Labs/ip-sla-monitor

Introduction

If you manage a network of any size, you know that keeping tabs on performance metrics like latency, jitter, and packet loss is critical. Cisco IP SLA (Service Level Agreement) operations are the go-to feature for probing network paths and measuring these metrics directly from your routers and switches. But manually checking IP SLA statistics across dozens or hundreds of devices? That does not scale.

In this post, I will walk you through a Python-based tool I built that pulls IP SLA data from Cisco DNA Center via its REST API, enriches it with device metadata from NetBox, and generates automated performance reports. Whether you are running a handful of branch routers or a large enterprise campus, this approach gives you a scalable, repeatable way to monitor network performance.

What is IP SLA?

Cisco IP SLA is a built-in feature on Cisco IOS and IOS-XE devices that allows you to generate synthetic traffic to measure network performance. You can configure operations like ICMP echo (ping), UDP jitter, HTTP GET, and more. Each operation continuously measures metrics such as round-trip time (RTT), latency, jitter, packet loss, and availability. These metrics are essential for validating SLA compliance, troubleshooting performance issues, and capacity planning.

The Tools

This project brings together three key components. First, Python does the heavy lifting for API calls, data parsing, and report generation. Second, Cisco DNA Center provides a centralized REST API for pulling device inventory and running CLI commands across your entire network without SSH-ing into each device individually. Third, NetBox acts as our network source of truth, storing device metadata like site assignments, roles, platforms, and IP addresses that we use to enrich the raw SLA data.

How It Works

The IP SLA Monitor tool follows a simple three-step workflow:

  1. Authenticate with DNA Center and pull IP SLA operation statistics from all monitored devices using the command-runner API.
  2. 2. Query NetBox for each device to enrich the data with site name, device role, platform, and management IP.
  3. 3. Evaluate each SLA operation against configurable thresholds for latency, jitter, and packet loss, then generate JSON and CSV reports.
  4. The tool can run as a one-shot collection or in a continuous monitoring loop with a configurable polling interval. Alerts are logged to the console when any operation exceeds your defined thresholds.

The Python Code

The project is organized into three main scripts:

ip_sla_monitor.py is the main orchestration script that ties everything together. It loads configuration from a .env file, initializes the DNA Center and NetBox clients, collects SLA data, enriches it, evaluates thresholds, and saves reports.

dnac_integration.py handles all communication with the Cisco DNA Center REST API including authentication, device inventory retrieval, and IP SLA data collection via the command-runner API.

netbox_integration.py connects to the NetBox API to look up device metadata by hostname, returning site assignments, device roles, platform types, and IP addresses.

Getting Started

Getting up and running is straightforward. Clone the repository, set up a virtual environment, install the dependencies from requirements.txt, and configure your .env file with your DNA Center and NetBox credentials. The only Python packages required are requests for HTTP API calls, python-dotenv for environment variable management, and urllib3. No complex frameworks or heavy dependencies. Full setup instructions are in the GitHub repo README.

Threshold Alerting

One of the most useful features is configurable threshold alerting. You define your acceptable limits for latency, jitter, and packet loss in the .env file, and the tool flags any SLA operation that exceeds those limits. For example, with default thresholds of 100ms latency, 30ms jitter, and 1% packet loss, a branch router showing 115ms latency and 2.1% packet loss would be immediately flagged as an alert in the console output and reports.

Sample Output

The tool generates both JSON and CSV reports. The JSON report includes a summary section with total operations, passing and failing counts, and average latency, followed by detailed per-operation data enriched with NetBox metadata. The CSV report provides the same data in a tabular format that you can easily import into Excel or feed into other monitoring tools. Sample output files are included in the GitHub repository under the output directory.

What is Next

This project is a solid foundation, but there is plenty of room to extend it. Some ideas for future enhancements include adding webhook or email notifications for alerts, integrating with Grafana for real-time dashboards, storing historical data in a time-series database like InfluxDB, and expanding the command-runner integration to pull live SLA statistics directly from devices.

Wrapping Up

Python automation combined with APIs from DNA Center and NetBox gives network engineers a powerful toolkit for monitoring IP SLAs at scale. Instead of manually checking IP SLA stats on individual devices, you can automate the entire workflow and get enriched reports in minutes.

Check out the full source code, sample outputs, and setup instructions in the GitHub repository: https://github.com/NetworkThinkTank-Labs/ip-sla-monitor

If you found this useful, check out my other posts on network automation including Automating Network Device Backups with Python and Netmiko and Build a Home Lab Like a Pro. Stay tuned for more content from the NetworkThinkTank!

AI-Powered Networking: How Artificial Intelligence is Transforming Network Management in 2026

The networking landscape is undergoing a seismic shift. Artificial Intelligence (AI) is no longer a futuristic concept — it’s actively reshaping how networks are designed, monitored, and managed. From self-healing infrastructure to predictive threat detection, AI-powered networking is the hottest trend in IT right now.

What is AI-Powered Networking?

AI-powered networking refers to the use of machine learning (ML), deep learning, and intelligent automation to manage, optimize, and secure networks. Unlike traditional networks that rely on manual configurations and reactive troubleshooting, AI-driven networks are proactive, adaptive, and self-optimizing.

Key Trends Driving AI in Networking in 2026

1. Intent-Based Networking (IBN)

Intent-Based Networking allows administrators to define desired network outcomes in plain language, and the AI automatically translates those intentions into configurations. Cisco’s DNA Center and similar platforms are leading this revolution, dramatically reducing human error and configuration time.

2. AIOps for Network Operations

AIOps (Artificial Intelligence for IT Operations) platforms are now mainstream in large enterprises. These tools correlate data from multiple sources, detect anomalies before they cause outages, and even recommend or automatically apply fixes. Tools like Moogsoft, Splunk, and Cisco ThousandEyes are at the forefront of this trend.

3. AI-Driven Network Security

Cybersecurity threats are evolving faster than human analysts can respond. AI-powered security tools like Darktrace, CrowdStrike, and Palo Alto’s Cortex XDR use behavioral analytics and machine learning to detect zero-day threats, insider attacks, and advanced persistent threats (APTs) in real time.

4. Smart SD-WAN with AI Optimization

SD-WAN has been a hot topic for years, but in 2026, AI is taking it to the next level. AI-enhanced SD-WAN solutions dynamically route traffic based on real-time application performance data, automatically shifting workloads between MPLS, broadband, and 5G links to guarantee optimal user experience.

5. Autonomous Networks (Zero-Touch Provisioning)

Zero-touch provisioning powered by AI enables network devices to be automatically configured and deployed without manual intervention. This is critical for the massive scale of IoT deployments, edge computing, and 5G infrastructure rollouts happening globally.

Real-World Benefits of AI in Networking

  • Reduced downtime: Predictive analytics identify potential failures hours or days before they occur.
  • Faster troubleshooting: AI reduces Mean Time to Resolution (MTTR) by up to 90% in some deployments.
  • Enhanced security posture: Continuous behavioral monitoring catches threats that signature-based tools miss.
  • Operational cost savings: Automation reduces the need for manual intervention, lowering OpEx significantly.
  • Improved user experience: Dynamic traffic shaping ensures applications always have the bandwidth they need.

Challenges and Considerations

While the promise of AI networking is immense, it’s not without challenges. Data privacy concerns, the need for large volumes of quality training data, the risk of AI model bias, and the shortage of skilled professionals who understand both networking and AI are all hurdles that organizations must navigate carefully.

Conclusion

AI-powered networking is no longer optional for organizations that want to stay competitive. Whether you’re a network engineer looking to upskill, an IT manager evaluating new solutions, or a business leader planning digital transformation, understanding AI’s role in networking is essential. The future of networking is intelligent, autonomous, and AI-driven — and that future is already here.