This policy sets out the standards and procedures for monitoring network devices within our organization. It aims to ensure the continuous availability, performance, and security of the network infrastructure, minimize network downtime, and facilitate timely and effective responses to network incidents.
This policy applies to all network devices such as switches, routers, firewalls, load balancers, and wireless access points owned or operated by the organization.
Availability Monitoring: Network devices should be continuously monitored for uptime. Any unexpected downtime or non-responsive device should trigger an immediate alert to the Network Operations Center (NOC) or designated IT staff.
Performance Monitoring: Network performance should be continuously monitored. Key performance indicators (KPIs) including bandwidth utilization, network latency, and packet loss should be routinely logged and assessed. Any degradation in network performance should trigger an alert.
Capacity Monitoring: Network devices should be monitored for resource utilization. This includes metrics such as CPU usage, memory usage, and storage space. Alerts should be generated when predefined thresholds are exceeded.
Error and Anomaly Monitoring: All network devices should be monitored for error messages, interface errors, hardware errors, or any other anomalies. All such errors should be logged and alerts should be generated based on the severity of the error.
Configuration Monitoring: All changes to network device configurations should be logged and monitored. Any unauthorized changes should trigger an immediate alert.
Security Monitoring: Network devices should be monitored for signs of security incidents. This includes unauthorized access attempts, intrusion detection/prevention system (IDS/IPS) alerts, and other suspicious activities.
Device Health Monitoring: The physical and operational health of all network devices should be monitored. This includes temperature, power status, fan status, and any hardware diagnostics that the device can provide.
The Network Operations Center (NOC) or designated IT staff are responsible for monitoring network devices as per this policy.
The NOC or designated IT staff should respond promptly to all alerts generated by the monitoring system.
IT staff should perform periodic reviews of this policy to ensure it continues to align with organizational needs and industry best practices.
Review and Changes to the Policy:
This policy shall be reviewed at least annually or as required by changes in network infrastructure or business needs. All changes to this policy must be approved by the IT Director or designated authority.
Non-compliance with this policy can lead to disciplinary action, up to and including termination. Any suspected non-compliance should be reported to the IT Director or designated authority immediately.
Any exceptions to this policy must be approved in advance by the IT Director or designated authority. An exception request should include a valid business reason for the exception and any compensating controls that will be implemented to mitigate the associated risks.
Setting thresholds for network device monitoring is a crucial part of the process, and it often needs to be tailored to the specific requirements and capabilities of the organization’s network infrastructure. Here are some general guideline thresholds you might consider:
Availability Monitoring: Any network device that becomes non-responsive or goes offline should immediately trigger an alert.
Bandwidth Utilization: If bandwidth utilization exceeds 80% of the total capacity for an extended period (e.g., 15 minutes), an alert should be generated. This threshold can help prevent saturation.
Latency: Depending on the network’s geographical span, the acceptable average latency might vary. For a LAN, if latency exceeds 50ms consistently, an alert should be generated.
Packet Loss: Any observed packet loss greater than 1% should trigger an alert.
CPU Usage: An alert should be generated if CPU utilization on a network device consistently exceeds 80% for a sustained period (e.g., 5 minutes).
Memory Usage: Similar to CPU usage, if memory utilization consistently exceeds 80% for an extended period, an alert should be generated.
Error and Anomaly Monitoring: Any anomaly or error message should be logged. Severity levels can be set for different types of errors. High-severity errors, such as hardware failures, should immediately trigger an alert.
Configuration Monitoring: Any configuration change should be logged. Unauthorized changes should immediately trigger an alert.
Security Monitoring: Any suspected unauthorized access or security incident should immediately trigger an alert.
Device Health Monitoring:
Temperature: Each device often has a recommended operating temperature range provided by the manufacturer. If the device’s temperature exceeds the upper limit, an alert should be triggered.
Power Status: Any power failure or fluctuation should immediately trigger an alert.
Please note that these are general guidelines and actual thresholds should be set based on specific network infrastructure, network devices’ capabilities, business needs, and acceptable levels of risk. Regular review and adjustment of these thresholds is recommended to adapt to changes in the network environment and business requirements.