Architect – Support Management (Help Desk / Service Desk)

Building out a proper help desk with different levels and delegated alerting is critical for several reasons:

Efficient Use of Resources: By delegating alerts based on their level of urgency and the expertise required to handle them, you ensure that more experienced (and typically higher-paid) staff members aren’t spending their time on issues that could be handled by less experienced staff. This allows your team to handle more issues overall, increasing the efficiency of your operations.

Reduced Resolution Time: When alerts are correctly delegated, they are likely to be resolved faster. Lower-level issues can be addressed quickly by junior staff, and higher-level issues are immediately directed to the most experienced staff who are best equipped to resolve them.

Improved Learning Opportunities: By including different levels in the help desk, junior staff members can learn from experience with real-world issues, preparing them to handle more complex problems in the future.

Better Service Levels: Having a well-structured help desk with delegated alerting ensures a more consistent level of service. Users can expect lower priority issues to be addressed within a predictable timeframe, and can be assured that critical issues will be escalated and resolved quickly.

Reduced Alert Fatigue: Alert fatigue happens when individuals receive so many alerts that they can no longer effectively pay attention to each one. By properly delegating and prioritizing alerts, you ensure that individuals only receive the alerts that are most relevant to their role and expertise, reducing the risk of alert fatigue.

Enhanced System Reliability: With this setup, your team can more effectively monitor and maintain your systems, leading to higher overall system reliability and availability.

In summary, a properly structured help desk with delegated alerting can lead to significant improvements in the efficiency and effectiveness of your operations, the development of your team, the service level you provide to users, and the reliability of your systems.

 

ISSUES WITH TOO MANY ALERTS

alert fatigue” is a common problem in industries where monitoring systems are crucial, such as IT, healthcare, and manufacturing. Alert fatigue happens when so many alerts are generated that it becomes overwhelming, leading to potentially important alerts being ignored or missed.

Here are some best practices to manage alert fatigue and improve your alerting systems:

Prioritize alerts:

Prioritize alerts based on their impact and urgency. Not all alerts are created equal. Some might indicate critical issues that need immediate attention, while others might be low-priority or informational alerts that can be addressed later.

  • Critical alerts: These need immediate attention as they may cause serious issues, such as system outages.
  • Warning alerts: These aren’t as severe but indicate potential issues that should be addressed soon.
  • Informational alerts: These simply provide information and don’t need immediate action.

Group alerts:

Group similar alerts together. If multiple alerts are caused by the same underlying issue, they should be grouped together to avoid overwhelming the team with redundant information.

Automate alerts:

Automation is key to managing a large volume of alerts. This can involve automatically resolving known issues, reducing the volume of alerts through better thresholds, or using AI to help identify and group related alerts.

Set appropriate thresholds:

A lot of noise in alerting comes from thresholds that are too sensitive. By adjusting these to levels that are indicative of actual issues, you can reduce unnecessary alerts.

Make alerts actionable:

Every alert should have clear next steps. If it’s not clear what needs to be done when an alert is received, it could lead to delays in resolving the issue.

Regularly review alert triggers:

Over time, the system may change, so the triggers for some alerts may no longer be relevant. Regularly reviewing and updating these can help keep the volume of alerts to a minimum.

Train your team:

Make sure everyone on the team understands the alert system, what each alert means, and what they need to do when they receive one. This can reduce confusion and ensure that critical alerts are dealt with promptly.

Use alert routing:

Alerts should be sent to the relevant teams or individuals who are equipped to deal with them. Not everyone needs to receive every alert. Routing alerts can help ensure they’re seen by the people best equipped to handle them.

Remember, the goal of your alert system is to highlight critical issues that need immediate attention, not to bombard your team with constant notifications. With these strategies, you can reduce alert fatigue and ensure your team is focusing on the alerts that matter most.

 

PRIORITY LEVELS BASED ON EXPERTISE/JOB TITLE

It can be very effective to differentiate alerts based on the level of expertise required to handle them. However, it’s essential to approach this with care to ensure critical alerts are handled promptly, and team members are utilized effectively.

Here’s a simple way to incorporate expertise levels into your alert system:

Level 1 (L1) – Novice: These alerts can be handled by less experienced team members or those in a learning phase. Examples might include simple server restarts, password resets, or other routine tasks that can be followed up with a basic understanding and a standard procedure.

Level 2 (L2) – Intermediate: These alerts are more complex and might require troubleshooting skills or specific knowledge about system configuration or certain applications. L2 support might handle issues like network connectivity problems, minor hardware issues, or software bugs.

Level 3 (L3) – Expert: These alerts are the most complex and usually require deep system expertise or development skills. This might include critical system outages, data loss, complex software bugs, or security breaches.

This approach can help ensure that alerts are handled by the most appropriate team members. Junior team members can gain experience by handling real, but low-risk issues, while senior team members can focus on more complex problems. It also helps in capacity planning and improves overall system resilience.

However, this model assumes that the expertise of your team members is accurately reflected in their titles or roles, which may not always be the case. It’s essential to continually reassess the distribution of tasks to ensure it aligns with actual capabilities and skills, rather than solely relying on job titles.

 

BEST PRACTICES AND RECOMMENDATIONS WHEN USING DIFFERENT LEVELS

Proper use of priority levels in alerting is crucial to ensure that the most urgent and impactful issues are dealt with promptly, and that less critical issues do not overwhelm your team. Here are some best practices to follow:

Define the priority levels clearly:

Each priority level should have a clear, universally understood definition. These definitions should be based on the potential impact of the issue, the urgency with which it needs to be resolved, and the expertise needed to address it. The definition should be specific enough that there is no ambiguity about which level an alert falls into.

Use a consistent number of levels:

Having too many priority levels can make the system overly complicated and lead to confusion. A three or four-level system (like High/Critical, Medium, Low, and Informational) is often enough to differentiate between the most and least critical issues.

Assign appropriate response times:

Each priority level should have an associated response time that reflects its urgency. For example, high priority issues might require immediate attention, medium priority issues could be resolved within a few hours, and low priority issues might be addressed within a day.

Regularly review and update priority levels:

Your business, and the systems you use, will change over time. Regularly reviewing and updating your priority levels ensures that they continue to reflect the current reality of your business. This should involve seeking feedback from the team members who use the system.

Train your team:

Everyone who uses the alert system needs to understand the priority levels and how to use them. This includes understanding the criteria for each level, the expected response times, and the process for escalating issues if necessary.

Keep the system flexible:

While it’s important to have clear definitions and rules, there should also be some flexibility to handle unique situations. For example, there might be times when an ordinarily low-priority issue becomes high priority due to other factors, such as a high-traffic event on your website.

By properly managing the priority levels of your alerts, you can help to ensure that your team focuses on the most impactful issues, reduces alert fatigue, and maintains a high level of system performance and reliability.

 

HELP DESK vs SERVICE DESK

The terms “Help Desk” and “Service Desk” are both used in the IT industry, often interchangeably, but they do have slightly different connotations based on the ITIL (Information Technology Infrastructure Library) framework.

Help Desk: This term is often used to refer to a resource intended to provide the customer or end user with information and support related to the company’s or institution’s products and services. The purpose of a help desk is usually to troubleshoot problems or provide guidance about products such as computers, electronic equipment, or software.

Service Desk: A Service Desk, in the ITIL context, is the single point of contact (SPOC) between the service provider (IT) and users for day-to-day activities. It’s a place for users to place requests, report issues, and ask for general guidance. A service desk has a broader scope than a traditional help desk and aims to deliver a holistic approach to IT service management (ITSM).

The delegation and prioritization of alerts, issues, or tickets can occur in either a help desk or service desk context. However, the term “service desk” is more commonly associated with the ITIL framework, which is where practices like incident management, problem management, and request fulfillment come from.

Furthermore, the structure of the help desk or service desk team often incorporates a tiered support level system. For instance, Tier 1 (first-line support) handles basic customer issues, while Tier 2 and Tier 3 (second and third-line support) handle more complex issues requiring specific expertise.

To sum up, while the terms “Help Desk” and “Service Desk” are often used interchangeably in practice, “Service Desk” generally implies a broader and more holistic approach to IT service management as per the ITIL guidelines.

 

EXAMPLE

Here’s an example for an IT operations team in a medium to large company:

Let’s assume that your company has a complex infrastructure with several applications running on multiple servers, both on-premises and in the cloud. You are using a monitoring system that is capable of sending alerts when certain thresholds are crossed.

Example Implementation of Best Practices:

Prioritize alerts: You can set up your monitoring system to categorize alerts into three categories based on the severity:

  • Critical alerts: Outages, resource saturation (CPU, memory, disk), security threats
  • Warning alerts: Approaching capacity limits, erratic system behavior, software errors
  • Informational alerts: Successful backups, system updates, user login/logout records

Group alerts: If multiple servers are reporting high CPU usage at the same time, this could be grouped into a single alert indicating that there might be an infrastructure-wide issue causing this.

Automate alerts: Implement automation tools like Ansible, Puppet, or Chef to automatically resolve recurring or known issues. For instance, if disk space is running low on a server, a script can automatically clear temp files and generate an informational alert.

Set appropriate thresholds: You might currently be getting alerts when CPU usage goes above 70%. However, if this is a common occurrence and it doesn’t impact the system’s functionality, consider raising this to 80% or 90% to reduce alert volume.

Make alerts actionable: Each alert should have an associated runbook or playbook that provides step-by-step instructions for diagnosing and resolving the issue.

Regularly review alert triggers: Have a quarterly review of your alerting system to update thresholds and conditions based on the evolving system performance and business needs.

Train your team: Ensure that your team members understand each type of alert and know what action to take when they see one. Regularly conduct training sessions to keep everyone updated.

Use alert routing: Send database-related alerts to your DBA team, network issues to your network team, and application issues to your dev team. Use an incident management platform like PagerDuty or OpsGenie to manage on-call schedules and route alerts to the right people at the right time.

Remember, this setup would evolve and improve over time, reducing noise and focusing more on critical issues that impact your business. It’s a dynamic process, and regular review and updating are crucial for success.