When observing the notification flow from firewalls in the infrastructures of the MSP clients I manage, I frequently encounter a specific situation: receiving a separate email alert for every successful SSL VPN connection. While initially this might sound “security-focused” or like “knowing everything,” this approach eventually creates a significant operational burden and a security vulnerability.
Such notifications arrive in the same mailbox, mixed with critical alerts (IPsec tunnel down, backup failed), routine informational alerts (VPN connection established), vendor license/transaction emails, and weekly summaries. As a result, a situation that truly requires action can easily get lost in the daily email bombardment. As an MSP operator, this “alarm noise” issue has always been a priority problem area for me.
Alarm Noise and Blinded Operators
Alarm noise, simply put, is when the volume of notifications from a system exceeds an operator’s capacity to meaningfully process them. This leads to alarm fatigue and ultimately causes critical events to be overlooked. Imagine hundreds, or even thousands, of emails landing in your inbox every day when working with multiple clients on the MSP side. A significant portion of these, like “VPN connection established,” are actually just informational records that don’t require immediate action.
Amidst this information overload, truly urgent events, such as an IPsec tunnel going down or a backup job failing, get lost among the routine VPN connection notifications. Faced with a constantly ringing alarm system, an operator gradually becomes desensitized and tends to ignore a genuine threat signal, thinking, “Is it another unnecessary notification?” This is one of the most dangerous scenarios in cybersecurity and general IT operations.
For example, in the notification stream from Sophos XGS firewalls, a critical tunnel-down alarm and a routine “user connected to VPN” message can appear consecutively in the same channel and in a similar format. An overlooked IPSEC_TUNNEL_DOWN notification can lead to hours of undetected downtime or data leakage risk, while SSL_VPN_CONNECTION_ESTABLISHED notifications merely fill up the inbox.
Subject: [Firewall Alert] VPN Connection Established - User: ali.can
To: it_alerts@domain.com
From: sophos-firewall@domain.com
Date: Tue, 14 May 2024 10:05:12 +0300
Body: User ali.can connected to SSL VPN from 192.0.2.10.
Subject: [Firewall Alert] IPsec Tunnel Status - Tunnel Down: Site-to-Site VPN
To: it_alerts@domain.com
From: sophos-firewall@domain.com
Date: Tue, 14 May 2024 10:07:30 +0300
Body: IPsec tunnel 'BranchOffice_VPN' to 203.0.113.5 is down.
Subject: [Firewall Alert] VPN Connection Established - User: ayse.yilmaz
To: it_alerts@domain.com
From: sophos-firewall@domain.com
Date: Tue, 14 May 2024 10:08:55 +0300
Body: User ayse.yilmaz connected to SSL VPN from 192.0.2.11.
In the representative email flow above, the IPsec Tunnel Status - Tunnel Down alert can easily get lost between two unnecessary VPN connection notifications. Therefore, the “notify everything” approach, rather than providing security, becomes an anti-pattern that kills the signal-to-noise ratio.
The Root of the Problem: The “Notify Everything” Approach
So, why is this “notify everything” approach so common? It usually starts with good intentions, with the thought of “let’s not miss anything.” However, there’s a big difference between the ease of accessing information and the meaningful processing of that information. Many security and network devices have the ability to log and report every event by default. This capability, if not configured correctly, turns into a curse rather than a blessing.
Especially in MSP operations, when multiple clients’ firewalls or other infrastructure components send notifications to the same operator, this volume increases exponentially. Each device reporting every action of every user via email eventually makes the mailbox completely unreadable. This not only fills email inboxes but also unnecessarily occupies system resources. A firewall constantly sending emails, a log server constantly processing the same type of logs, can eventually even lead to performance degradation.
The fundamental flaw of this approach is treating all events with the same priority and channel. A user connecting to a VPN does not have the same urgency as a cyberattack attempt or a server’s disks filling up. However, the “notify everything” logic doesn’t make this distinction and presents all events as equally “urgent.” This is similar to a “fire alarm!” constantly ringing in the operator’s mind; after a while, no one takes the alarm seriously.
{
"timestamp": "2024-05-14T10:05:12Z",
"event_type": "SSL_VPN_CONNECTION_ESTABLISHED",
"severity": "INFO",
"user": "ali.can",
"source_ip": "192.0.2.10",
"destination_ip": "10.0.0.1",
"device_id": "sophos-fw-01"
}
A log record like the one above clearly shows that a VPN connection has been established. This information can be valuable for anomaly detection or auditing, but it’s not an “emergency” that needs to land in the operator’s inbox every time it occurs. Such severity: INFO level events should be stored in log collection systems instead of emails and should be queryable when needed. Correctly configuring firewall logging mechanisms is one of the first steps to reduce this noise. For example, in firewalls like Sophos XGS, it’s possible to differentiate between logging and reporting settings and focus email notifications only on specific critical events.
Classifying and Prioritizing Notification Channels
One of the most effective ways to combat alarm noise is to separate notifications into different channels and priorities based on their importance. Instead of every event arriving through the same channel in the same way, we should create different response mechanisms for different severity levels. This increases efficiency in MSP operations and allows us to focus on real threats.
A simple classification scheme could be:
-
Critical Alarms Requiring Action (P1 - High Priority): Situations requiring immediate human intervention.
- Examples: IPsec tunnel outage, backup failure, critical service down, ransomware detection, high CPU/memory usage (above threshold).
- Notification Channel: Email (can be supported by SMS or a paging system), a specific Slack/Teams channel, automatic ticket creation.
- Expected Response: Review and intervention within 15-30 minutes.
-
Routine Information for Record-Keeping (P2 - Medium Priority): Events that don’t require immediate intervention but need to be monitored and analyzed later.
- Examples: VPN connection established/disconnected, user logins/logouts, successful patching operations, slight changes in capacity trends.
- Notification Channel: Log management system (SIEM), central dashboard (Grafana), weekly/monthly summary reports (batch digest via email).
- Expected Response: Periodic review, background analysis for anomaly detection.
-
General Announcements for Information (P3 - Low Priority): Information that generally reflects system health and has no operational impact.
- Examples: License renewal warnings, vendor announcements, system performance summaries.
- Notification Channel: A separate email folder, internal wiki, announcement boards.
- Expected Response: Information gathering.
This classification reduces the operator’s mental load and ensures that each notification is evaluated in the correct context. When a critical alarm arrives, the operator knows it’s a real problem and needs to intervene quickly.
# Representative notification routing logic
def route_notification(event):
if event["severity"] == "CRITICAL":
send_email(event, to="ops_team@domain.com")
send_sms(event, to="on_call_engineer")
create_ticket(event)
elif event["severity"] == "WARNING":
send_email(event, to="ops_team@domain.com")
log_to_siem(event)
elif event["severity"] == "INFO":
log_to_siem(event)
# Maybe add to a daily digest for review
else:
log_to_siem(event)
This representative Python code shows a simple logic for how notifications can be routed in an automation platform. Different actions are triggered based on the severity of each event. This is a topic I focused on heavily, especially when developing my own automation platform.
The Right Place for Routine Information: Log Management and SIEM
Routine events like “VPN connection established” are certainly not worthless. On the contrary, they are quite valuable as raw data for anomaly detection. What’s wrong is for this raw data to land in a human’s mailbox every time. The right place for such events is a log management system or a SIEM (Security Information and Event Management) platform.
A log management system collects, indexes, and makes queryable all logs from various sources (firewalls, servers, applications, etc.) in a central location. For example, in my own system, I use Loki for log collection and InfluxDB for storing time-series data. Grafana is an indispensable tool for visualizing this data and setting up alarms on it.
When information about a user connecting to a VPN is sent to the SIEM, it is stored there along with all other network activities. This allows for the detection of anomalous situations, such as whether a specific user connected to the VPN at unusual hours or from an unusual IP address. This is a much more powerful and scalable approach than tracking every connection notification via email.
For example, a representative VPN connection log might look like this and should be processed by a SIEM:
<13>May 14 10:05:12 sophos-fw-01 user: ali.can, event: SSL_VPN_CONNECT, src_ip: 192.0.2.10, dst_ip: 10.0.0.1, duration: 0, status: success
This log record can be sent to the SIEM via a system log collector like journald or directly from the firewall. Then, in a tool like Grafana, specific queries can be run on these logs to check, for example, if there have been more than ten VPN connections from the same user in the last hour. If such a situation is detected, it can be considered an anomaly, and that’s when an email or a more critical notification can be triggered.
Threshold and Anomaly-Based Alarm Design
At the core of an effective monitoring and observability strategy lies intelligent alarm design. Instead of reporting every event, we should generate alarms when specific thresholds are exceeded or when an unusual situation (anomaly) is detected. This is key to increasing the “signal-to-noise” ratio.
For example, for VPN connections, instead of “every connection,” we can focus on:
- Threshold-Based Alarm: A certain number (e.g., more than 5) of failed VPN login attempts from the same user within a specific time frame (e.g., 5 minutes). This could indicate a brute-force attack or an account compromise attempt.
- Anomaly-Based Alarm: A user successfully connecting to the VPN outside their usual connection hours or from a geographical location they have never connected from before. This could indicate the use of stolen credentials.
Such alarms transform raw data into meaningful security intelligence. Log management systems and visualization tools (like Grafana) allow us to easily configure these threshold and anomaly-based alarms. For example, in Grafana, I can trigger an alarm when certain conditions occur using an InfluxDB or Loki query.
# Grafana Alert Rule (pseudo-code)
alert_rule:
name: "Multiple Failed VPN Logins for a User"
for: "5m"
query: |
sum by (user) (
increase(vpn_login_attempts_failed_total{type="ssl_vpn"}[5m])
) > 5
annotations:
summary: "Multiple failed VPN login attempts detected for {{ $labels.user }}"
description: "User {{ $labels.user }} has failed to log in to VPN more than 5 times in the last 5 minutes from {{ $labels.source_ip }}"
labels:
severity: "critical"
alertgroup: "security"
channels:
- "email_ops_team"
- "slack_security_channel"
The representative Grafana alert rule above can trigger a critical alarm when a specific user has more than 5 failed VPN login attempts within 5 minutes. This is a much more valuable and action-oriented alert than emails reporting every successful VPN connection. Such a configuration reduces the false-positive rate and helps draw the operator’s attention to truly important events. Correctly setting threshold values and anomaly detection rules requires continuous trial and error and operational experience.
Practical Steps and Implementation Tips
There are concrete steps that can be taken to reduce alarm noise and improve the signal-to-noise ratio in MSP operations. These steps include not only technical configurations but also operational processes.
-
Review Firewall Logging and Notification Settings:
- In firewalls like Sophos XGS, configure email notifications only for
CRITICALandWARNINGlevel events. RouteINFOlevel events to a log server instead of email. - Disable email notifications for events like VPN connection established/disconnected. For these, send them to a log management system using Syslog or SNMP traps.
- In firewalls like Sophos XGS, configure email notifications only for
-
Implement a Centralized Log Management Solution:
- Collect logs from all firewalls, servers (Windows Server/AD, Linux
systemd/journald), network devices, and applications into a central platform (e.g., Loki, Splunk, ELK Stack). - Indexing and querying logs in this platform forms the basis for anomaly detection.
- Collect logs from all firewalls, servers (Windows Server/AD, Linux
-
Create Monitoring and Visualization Dashboards:
- Visualize collected log data using a tool like Grafana. Create dedicated dashboards for events such as VPN connections, failed login attempts, and firewall policy violations.
- These dashboards are much more effective than email for real-time status tracking and trend analysis.
-
Define Threshold and Anomaly-Based Alert Rules:
- In your log management system or visualization tools like Grafana, create alerts that trigger when specific thresholds are exceeded or anomalies are detected.
- Route these alerts to email, Slack, or a paging system only for critical events.
-
Periodic Reporting and Auditing:
- For routine events (e.g., VPN connections), generate monthly or weekly summary reports, aggregating them on a per-client basis. These reports contain important information for auditing and compliance but do not interrupt the operator’s daily workflow.
- Monitor changes and access controls on Active Directory and file servers with tools like Netwrix to automate compliance reporting. This is critical for detecting issues like “privilege creep.”
By implementing these steps, I provide a cleaner, more action-oriented monitoring environment in the infrastructures of the clients I manage as an MSP. My main goal is not just to collect more data, but to deliver the right data at the right time, to the right channel.
Conclusion
The alarm noise we encounter in MSP operations is not a simple annoyance; it’s a problem with serious implications for security and operational efficiency. Seemingly innocent notifications, such as emails for every successful VPN connection, can eventually lead to critical signals being lost and operators missing important threats.
To deal with this situation, we must abandon the “notify everything” mindset and classify notifications by importance and urgency. While using channels that require immediate action for critical events (email, SMS), we should prefer log management systems (SIEM) and visualization dashboards for routine information. With threshold and anomaly-based alarm design, we can transform raw data into meaningful security intelligence, drawing operators’ attention to what truly matters.
Let’s remember that more data doesn’t always mean better security. What’s important is to present the right data at the right time, to the right person, and in the right context. This approach will both reduce the fatigue of our teams and make our clients’ infrastructures more secure and uninterrupted. For me, this is a matter of balance that we constantly need to optimize.