A moment when SSH access was suddenly cut off. No panic, just an unexpected disruption in a working system. Such situations can be stressful even for experienced system administrators. However, behind every challenge lie important lessons. A few weeks ago, an incident we experienced while working on the OpenSSH service configuration on our Windows servers painfully demonstrated how a simple mistake can lead to significant consequences.
In this post, I will recount firsthand how a single line added to the sshd_config file completely cut off SSH access on multiple servers, how we fell into this error, and most importantly, the lasting lessons we learned from it. This is not just about a technical solution, but also a story of how we made our operational processes more robust.
Unexpected Obstacles: The Hidden Danger of a New ListenAddress Line
It all started when we edited the sshd_config file to bind incoming SSH connections to our servers to a specific IP address. Following standard procedures, we added a new ListenAddress line to the end of the file. However, a detail we were unaware of at the time would lead to a major outage in the hours that followed.
# Example sshd_config content (simplified)
...
# Existing settings
...
Match Group administrators
# ListenAddress is invalid within this Match block!
ListenAddress 192.168.1.100
PasswordAuthentication yes
...
Due to the structure of the sshd_config file, some directives are not valid within certain blocks. In our case, the ListenAddress line we added last was accidentally placed inside the Match Group administrators block at the end of the file. This simple formatting error caused the SSH daemon (sshd) to fail to start at all. Our attempts to connect to the servers via SSH were unsuccessful.
Analysis Instead of Panic: Incorrect Service Monitoring Error
When our access was cut off, the first reaction was, “Why isn’t SSH working?” We began investigating the situation with a colleague who had physical or remote console access to the server. During initial checks, we saw that the ssh-agent service was up. This gave us the illusion for a moment that everything was fine.
However, the ssh-agent and sshd (SSH daemon) services are different. ssh-agent manages SSH keys, while sshd is the main service that listens for incoming SSH connections. The fact that ssh-agent is running does not mean that sshd is healthy. This confusion was the second important factor that delayed the diagnosis of the problem. Because we were not monitoring the correct service, we overlooked the fact that the sshd service, in fact, was not starting.
Lessons Learned and Permanent Solutions
The lessons we learned from this incident enabled us to fundamentally change our operational processes to prevent similar errors in the future. To avoid encountering such a situation again, we standardized the following rules:
- Configuration Priority: Any changes made to critical service configuration files like
sshd_configshould always be made at the beginning of the file, before the firstMatchblock. This eliminates the risk of accidentally adding lines insideMatchblocks. - Mandatory Configuration Test: After making changes to any service’s configuration file, a configuration test must be performed using
sshd.exe -T(or the relevant service’s test command) before restarting the service. If the test is unsuccessful, the restart operation will not be performed until the changes are reverted. This eliminates the risk of the service failing to start. - Correct Service Monitoring: In system health checks, it must be clearly defined which service is being monitored. By monitoring the status of the
sshdservice instead ofssh-agent, we can accurately determine what is actually running and what is not. - Encoding Control: Especially in a Windows environment, care must be taken with character encoding when saving configuration files. A compatible encoding like ASCII should be used. Different encodings can prevent services from reading the file.
We not only implemented these rules but also integrated these checks into our automation code. Now, configuration changes are recorded with a “Pull Request” (PR) and deployed after passing automated tests. This both reduces the risk of human error and ensures transparent tracking of all changes made.
Conclusion: A Simple Mistake, A Profound Lesson
A single line added to the sshd_config file completely cutting off SSH access might initially seem like a minor technical glitch. However, this incident reminded us once again how critical every “simple” detail can be and how robust our operational processes need to be.
After this experience, we became more careful in configuration management, always performed tests, and started monitoring our systems more intelligently. We must remember that in the IT world, the biggest lessons are often learned in the most unexpected moments and from the simplest mistakes. Have you had similar experiences? What was your most expensive mistake, and what did you learn from it? You can enrich this conversation by sharing in the comments.