How I Built an AI Network Monitoring Tool (Beginner Friendly)

Hey everyone! If you have been following my blog, you know I love combining Python with network engineering. From automating backups with Netmiko to monitoring IP SLAs with DNA Center, I am always looking for ways to make our lives as network engineers easier. Today, I am excited to walk you through my latest project: an AI-Powered Network Health Checker. Don’t worry — this is totally beginner friendly. If you can write a basic Python script, you can follow along!

What Does This Tool Do?

In a nutshell, this tool pulls real-time data from your network devices (think CPU usage, memory utilization, interface errors, etc.), feeds that data into a simple machine learning model, and tells you whether each device is healthy or if there might be an issue. The output is super straightforward — you will see messages like “Device is healthy” or “Potential issue detected.” No PhD in data science required!

Step 1: Pulling Device Data with Python

Just like in my previous posts on network automation, we start by connecting to our devices and grabbing the data we need. I used the Netmiko library to SSH into each device and pull key metrics. Here is a simplified version of the script:

from netmiko import ConnectHandler
import re
device = {
'device_type': 'cisco_ios',
'host': '192.168.1.1',
'username': 'admin',
'password': 'yourpassword',
}
connection = ConnectHandler(**device)
cpu_output = connection.send_command('show processes cpu')
cpu_match = re.search(r'CPU utilization for five seconds: (\\d+)%', cpu_output)
cpu_usage = int(cpu_match.group(1)) if cpu_match else 0
mem_output = connection.send_command('show processes memory')
mem_match = re.search(r'Processor Pool Total:\\s+(\\d+)\\s+Used:\\s+(\\d+)', mem_output)
if mem_match:
mem_total = int(mem_match.group(1))
mem_used = int(mem_match.group(2))
mem_usage = (mem_used / mem_total) * 100
else:
mem_usage = 0
intf_output = connection.send_command('show interfaces')
error_matches = re.findall(r'(\\d+) input errors', intf_output)
total_errors = sum(int(e) for e in error_matches)
print(f"CPU Usage: {cpu_usage}%")
print(f"Memory Usage: {mem_usage:.1f}%")
print(f"Total Interface Errors: {total_errors}")
connection.disconnect()

This script connects to a Cisco IOS device, grabs CPU usage, memory utilization, and interface error counts. You can easily expand this to loop through multiple devices from an inventory file — just like we did in the backup config script project.

Step 2: Building a Simple ML Model for Anomaly Detection

Here is where the AI magic comes in — but I promise it is simpler than it sounds. We are using scikit-learn’s Isolation Forest algorithm, which is perfect for anomaly detection. It learns what “normal” looks like from your data and flags anything that seems off.

import numpy as np
from sklearn.ensemble import IsolationForest
training_data = np.array([
[15, 40, 0], [20, 45, 1], [18, 42, 0],
[22, 50, 2], [17, 38, 0], [19, 44, 1],
[21, 47, 0], [16, 41, 1], [20, 43, 0], [18, 46, 2],
])
model = IsolationForest(contamination=0.1, random_state=42)
model.fit(training_data)
new_device_data = np.array([[cpu_usage, mem_usage, total_errors]])
prediction = model.predict(new_device_data)
if prediction[0] == 1:
print("Device is healthy")
else:
print("Potential issue detected")

The Isolation Forest works by randomly partitioning data points. Anomalies are isolated faster because they are different from the majority of the data. The contamination parameter tells the model roughly what percentage of data points are expected to be anomalies — I set it to 0.1 (10%) as a starting point, but you can tune this for your environment.

Step 3: Putting It All Together

Now let us combine everything into a single script that loops through your devices, pulls the data, and runs it through the model:

from netmiko import ConnectHandler
from sklearn.ensemble import IsolationForest
import numpy as np
import re
devices = [
{'device_type': 'cisco_ios', 'host': '192.168.1.1', 'username': 'admin', 'password': 'yourpassword'},
{'device_type': 'cisco_ios', 'host': '192.168.1.2', 'username': 'admin', 'password': 'yourpassword'},
]
training_data = np.array([
[15, 40, 0], [20, 45, 1], [18, 42, 0],
[22, 50, 2], [17, 38, 0], [19, 44, 1],
[21, 47, 0], [16, 41, 1], [20, 43, 0], [18, 46, 2],
])
model = IsolationForest(contamination=0.1, random_state=42)
model.fit(training_data)
def get_device_metrics(device):
connection = ConnectHandler(**device)
cpu_output = connection.send_command('show processes cpu')
cpu_match = re.search(r'CPU utilization for five seconds: (\\d+)%', cpu_output)
cpu_usage = int(cpu_match.group(1)) if cpu_match else 0
mem_output = connection.send_command('show processes memory')
mem_match = re.search(r'Processor Pool Total:\\s+(\\d+)\\s+Used:\\s+(\\d+)', mem_output)
mem_usage = (int(mem_match.group(2)) / int(mem_match.group(1))) * 100 if mem_match else 0
intf_output = connection.send_command('show interfaces')
error_matches = re.findall(r'(\\d+) input errors', intf_output)
total_errors = sum(int(e) for e in error_matches)
connection.disconnect()
return [cpu_usage, mem_usage, total_errors]
for device in devices:
print(f"Checking device: {device['host']}")
metrics = get_device_metrics(device)
print(f" CPU: {metrics[0]}% | Memory: {metrics[1]:.1f}% | Errors: {metrics[2]}")
prediction = model.predict(np.array([metrics]))
if prediction[0] == 1:
print(" Status: Device is healthy")
else:
print(" Status: Potential issue detected")

Here is what the output looks like:

Checking device: 192.168.1.1
CPU: 18% | Memory: 43.2% | Errors: 0
Status: Device is healthy
Checking device: 192.168.1.2
CPU: 85% | Memory: 92.1% | Errors: 47
Status: Potential issue detected

What’s Next?

This is just the starting point. Here are some ideas to take it further:

  • Add more metrics like uplink bandwidth utilization or BGP neighbor status
  • Save your model to a file using joblib so you don’t retrain every time
  • Set up a cron job or scheduled task to run the script at regular intervals
  • Send alerts via email or Slack when an issue is detected
  • Build a simple dashboard with Flask to visualize device health

Get the Code

I have uploaded the full project to my GitHub repo. Feel free to clone it, play around with it, and make it your own:

GitHub: https://github.com/NetworkThinkTank-Labs/ai-network-health-checker

Final Thoughts

If you are a network engineer who is curious about AI and machine learning, this is a great beginner project to get your feet wet. You don’t need to understand every detail of how Isolation Forest works under the hood — just know that it is a tool that can help you spot problems before they become outages.

As always, if you have questions or want to share how you have customized this for your own network, drop a comment below or reach out to me. Happy automating!

Leave a comment