When your phone rings at midnight, or you see the same alarm repeatedly hitting your inbox, your first reaction is usually to look directly at what the alarm is pointing to. If it says “backup failed,” we rush to the backup software. But sometimes, even if the alarm itself is correct, it doesn’t tell us the whole story. The real problem might be hidden much deeper, in a different layer.
The bad news is, I encounter these kinds of scenarios frequently. An incident we experienced with a manufacturing client, which at first glance seemed like just a backup problem, was a great example of how critical inter-system correlation is. One night, suddenly, “Backup failed” alerts started pouring in from Acronis.
A Night Between Alarm Noise and the Real Problem
The error message from the Acronis alerts that night was quite clear: “Cannot connect to the machine where network share … is located. The machine may be unavailable.” This meant that the SMB network share, which we had defined as the backup target, was unreachable. This was a critical part of the backup plan; the nightly backup of the database server on our Hyper-V host was being written as a .tibx archive to this remote SMB share.
My first thought, naturally, was that there might be a problem with the server hosting the network share. Perhaps it was offline, or its network connection was down. However, just as I was preparing to check the connections with these thoughts in mind, another email arrived from Sophos Central within the same minutes. This was an alert that the IPsec VPN tunnel on the firewall was “down.”
It was at this exact point, despite my fatigue, that a light bulb went off in my head. The two different alarms had to have a common point. This IPsec tunnel, as its naming suggested, was specifically dedicated to backup traffic. This meant that the only way to reach the remote SMB share, which was the backup target, was through this tunnel.
Uncovering the Root Cause: Making the Connection
The picture began to clear. The chain of events was as follows:
- The IPsec tunnel went down. This was the first alert from Sophos Central.
- The remote SMB target became inaccessible. Because the tunnel was down, the path to the network share, which was the backup target, was blocked.
- Acronis reported an error: “cannot connect to the share.” Unable to reach the SMB target, Acronis naturally could not complete the backup operation and generated a “Backup failed” error.
As you can see, there was no problem with the backup software itself. Acronis tried to do its job but failed because a fundamental network dependency had disappeared. If the Sophos Central alert hadn’t come, or if I hadn’t correlated the two alarms, I would probably have spent hours unnecessarily debugging on the backup server and the machine hosting the SMB share.
This situation showed me once again that no matter how many alarms your monitoring systems generate, if you cannot establish a meaningful correlation between these alarms, you are essentially just generating noise. Finding the root cause can become like looking for a needle in a haystack.
Lessons and Pragmatic Approaches
There are several important lessons to be learned from this incident, and all of them align with my field experience:
Independent Monitoring of Dependencies is Essential
Putting backup traffic into a separate IPsec tunnel was a good design decision in terms of security and network segmentation. This separated backup traffic from other business traffic, reducing potential security risks. However, this design came with a cost: that tunnel had now become the backup’s “sole dependency.” If the tunnel wasn’t monitored, it meant the backup wasn’t truly being monitored either. The fundamental infrastructure components required for a successful backup plan (network connectivity, storage access, etc.) must be monitored directly and independently.
Alarm Correlation Must Be Automatic or Human-Driven
When alarms from multiple monitoring sources (firewall, backup, hypervisor, storage, etc.) scream independently, the root cause often becomes invisible. In such cases, correlation must either be established in the mind of an operator like me, which is a tiring and error-prone process, or it must be done automatically by a designed triage layer or a SIEM-like solution. In SMB environments, it’s not always possible to implement a comprehensive SIEM solution, so manual correlation capability and a correct monitoring strategy become even more important.
”Backup Failed” Alarm Can Be Misleading
Focusing solely on the “backup failed” alarm can lead you to misplace the source of the problem. As in this case I experienced, the problem was not with the backup software itself, but with a fundamental dependency of the backup process. Therefore, when triaging backup errors, it’s necessary to look not only at the error message but also at other events occurring in other layers of the system (network, storage, server) within the same timeframe.
These kinds of experiences remind me that IT operations require not only technical knowledge but also attention to detail and the ability to establish relationships between systems. Sometimes, behind the simplest-looking problems, there can be a complex chain affecting multiple layers.
Have you also experienced similar “alarm noise” and root cause hunting stories? What was your most interesting correlation discovery? Share in the comments, let’s learn.