Windows Server Domain Health Checklist

A significant portion of insidious errors encountered in Active Directory (AD) environments stem not from sudden critical crashes, but from replication, DNS, and synchronization irregularities accumulating in the background over months. When a domain controller (DC) goes offline or a Group Policy Object (GPO) in the SYSVOL folder fails to reach clients, the problem often didn’t start that day, but months earlier due to a faulty DNS configuration or an incomplete replication.

At ITWISE, in the system infrastructures we design and manage, we approach Active Directory health (Domain Health) not as a reactive problem-solving process, but as a proactive routine. In this guide, I present an in-depth checklist covering six critical areas that form the backbone of the Windows Server Active Directory architecture, including command-line and PowerShell-based verifications that you can directly apply in the field.

1. DNS Integration and Name Resolution Verification

The heart of the Active Directory architecture is the DNS service. Even the slightest disruption in DNS resolution can lead to DCs failing to find each other, clients being unable to join the domain, and replication coming to a complete halt. In AD-integrated DNS zones, it is crucial for each DC to consider its own DNS server as primary, but with a correct design regarding the use of the loopback address (127.0.0.1).

One of the most common mistakes is directly entering 127.0.0.1 as the primary DNS setting on the DC’s network interface card (NIC) and leaving the secondary DNS field blank. This situation causes the DC to be unable to query itself during server startup, in the seconds before the DNS service is up, leading to delays in AD service startup (island state). The correct practice is to enter the IP address of another reliable DC as the primary DNS and define the loopback address as the secondary DNS.

The first step to verify DNS health is to check if critical SRV records exist on DNS. Clients find domain controllers via these SRV records. You should verify whether LDAP and Kerberos service records are resolved using the following nslookup command sequence:

nslookup -type=all _ldap._tcp.dc._msdcs.itwise.local

The output of this command should list the IP addresses and hostnames (A records) of all active Domain Controllers in your environment. If you see missing or stale records (tombstone period expired but not deleted from DNS), it indicates a permission or synchronization issue in DNS dynamic updates (Secure Dynamic Updates).

2. Replication Status and Synchronization Between DCs

In environments with multiple Domain Controllers, the Active Directory database (NTDS.dit) must be consistently replicated among all DCs. The replication architecture relies on a topology automatically generated by the KCC (Knowledge Consistency Checker). However, blockages that can occur in intersite and intrasite replication can lead to database inconsistencies over time.

The most powerful tool we have to check replication health is the repadmin command. Open the command prompt as an administrator on any DC and use the following command to get a general replication summary:

repadmin /replsummary

The output of this command shows all DCs’ source and destination-based replication attempts, error counts, and the time elapsed since the last successful replication within the domain. An example of a successful output is as follows:

Source AD Node          largest delta    fails/total %%   error
 DC01                      18m:42s         0 /   5    0
 DC02                      12m:10s         0 /   5    0

Destination AD Node     largest delta    fails/total %%   error
 DC01                      12m:10s         0 /   5    0
 DC02                      18m:42s         0 /   5    0

If you see a non-zero value in the fails column, you should run the following command to identify detailed replication errors and which replication partner has the issue:

repadmin /showrepl

If the replication error is Access Denied (Error Code: 5) or RPC Server is Unavailable (Error Code: 1722), this usually stems from firewall rules between DCs (RPC dynamic ports - TCP 49152-65535 or TCP 135) or a breakdown in computer account password synchronization.

3. FSMO Role Distribution and Verification

Although Active Directory has a multi-master database structure, some critical operations must be managed by a single server. These roles are called FSMO (Flexible Single Master Operations) roles. There are a total of five roles: two forest-wide (Schema Master, Domain Naming Master) and three domain-wide (RID Master, PDC Emulator, Infrastructure Master).

Knowing which DCs host the FSMO roles and verifying their accessibility is essential for system stability. The health of the PDC Emulator role, in particular, which is the primary source of time synchronization (NTP), is critically important. All clients and other DCs in the domain environment obtain time information from the DC holding the PDC Emulator role.

We can use PowerShell or the traditional netdom tool to quickly check the distribution of FSMO roles:

netdom query fsmo

This command lists which DCs hold the five roles. You must ensure that the DC hosting the PDC Emulator role synchronizes its time from a reliable external NTP server (e.g., pool.ntp.org). All other DCs should hierarchically pull time from this PDC. To verify the NTP configuration on the PDC, you can run the following command sequence:

w32tm /query /source

If the output shows Local CMOS Clock or Free-running System Clock, it means your time synchronization is closed off from the outside world. This situation can cause Kerberos tickets to become invalid due to the maximum 5-minute time skew rule, leading to authentication errors.

4. SYSVOL and GPO Replication Health (DFSR)

The SYSVOL folder is a shared directory that must be present on every DC, where Group Policies (GPO) and logon scripts are distributed to clients. Since Windows Server 2008, DFSR (Distributed File System Replication) has been used for SYSVOL replication instead of the older FRS (File Replication Service). If your infrastructure was upgraded from an older domain and still uses FRS, you need to migrate to DFSR urgently.

Checking the status of DFSR replication and whether there are any files waiting in the queue (backlog) is the most definitive way to understand if GPO changes are being reflected on all DCs. To check DFSR health, we can use the Get-DfsrBacklog command in PowerShell or the dfsrdiag tool from the command line:

Get-DfsrBacklog -SourceComputerName "DC01" -DestinationComputerName "DC02" -GroupName "Domain System Volume"

If this command produces no output (i.e., the backlog is zero), your SYSVOL replication is completely up-to-date. If there are files accumulating in the queue, the DFSR database may be corrupted, or the service may be paused due to insufficient disk space.

Examine the DFSR service event logs (Event Viewer -> Applications and Services Logs -> DFS Replication) for Event ID 2213 or Event ID 4012 errors. Event ID 2213 indicates that DFSR has put itself into a protected state after an unexpected power outage or server shutdown. In this case, you need to manually trigger replication again:

wmic /namespace:\\root\microsoftdfs path dfsrreplicatedfolderinfo where replicatedfoldername='SYSVOL Share' call resume replication

5. Active Directory Database (NTDS.dit) and Disk Health

The physical database file of Active Directory, ntds.dit, and its transaction logs are stored by default under the C:\Windows\NTDS directory. If the disk containing this directory fills up, it can lead to sudden database corruption and a complete halt of AD services (Active Directory Domain Services - NTDS).

One of the biggest risks in DCs running on virtual platforms (VMware ESXi, Hyper-V) is errors made during disk expansion or snapshot management. Taking a snapshot of a DC at the virtualization layer and reverting to that snapshot can lead to disaster scenarios called USN Rollback, which completely breaks replication. Although modern Windows Server versions (Server 2012 and later) reduce this risk with VM-GenerationID support, using snapshots on DC servers is still not recommended.

The ntdsutil tool is used to check database integrity and perform offline defragmentation if necessary. However, remember that the AD service must be stopped before performing these operations. You can follow these steps to verify the logical integrity of the database:

Step	Command / Operation	Description
1	`Stop-Service ActiveDirectoryDomainServices -Force`	Stop AD Services.
2	`ntdsutil`	Start the database management tool.
3	`files`	Switch to the file management submodule.
4	`integrity`	Check the physical integrity of the NTDS.dit file.
5	`quit`	Exit the module.
6	`Start-Service ActiveDirectoryDomainServices`	Restart the services.

If any errors (Jet Database Error) are detected in the database during these verifications, you will need to boot the server into Directory Services Restore Mode (DSRM) and restore from a system state backup taken with backup software like Acronis, or completely remove the DC from the environment (by performing metadata cleanup) and reconfigure it with a clean installation.

6. Security Hardening, Auditing, and Privilege Creep

Domain health is not just about technical replication and service status; the security of the directory tree and the protection of the permission hierarchy are also integral parts of this health. Over time, “privileged users” or “inactive computer accounts” created for temporary projects but not deleted after their work is done become the biggest open doors for attackers.

Specifically, the AdminSDHolder object and the SDProp process protect the permission inheritance settings of highly privileged groups (Domain Admins, Enterprise Admins, etc.) within Active Directory. Even if manual changes are made to the permissions of users who are members of these groups, the system reverts these permissions to the AdminSDHolder template every 60 minutes. Monitoring whether this mechanism works correctly is critical.

To prevent privilege creep in the environment and clean up inactive accounts, you should periodically run PowerShell queries. For example, the following command is perfect for identifying inactive computer accounts that have not logged on to the domain for the last 90 days:

Search-ADAccount -AccountInactive -TimeSpan 90.00:00:00 -ComputersOnly | Select-Object Name, LastLogonDate

This list should be regularly scanned, and unused computer accounts should be disabled, then deleted. Similarly, you should tighten your audit policies for detecting expired or never-changed service accounts, and instantly log every permission change on the domain with Active Directory change auditing tools like Netwrix.

Next Step: Automated Reporting and Monitoring Infrastructure

Manually performing the steps in this checklist is a great way to understand your system’s current state initially. However, in a sustainable system architecture, these controls need to be automated and connected to a monitoring platform.

As a next step, you can write PowerShell scripts that parse the outputs of the repadmin and dfsrdiag commands mentioned here, run these scripts daily with Windows Task Scheduler, and send the results as metrics to your central monitoring dashboards like Grafana or InfluxDB. In ITWISE operations, we prioritize implementing automations that instantly alert our on-call system engineers if these metrics exceed threshold values (e.g., replication delay exceeding 30 minutes). The health of your infrastructure is only as good as the quality of the proactive measures you take.

1. DNS Integration and Name Resolution Verification

2. Replication Status and Synchronization Between DCs

3. FSMO Role Distribution and Verification

4. SYSVOL and GPO Replication Health (DFSR)

5. Active Directory Database (NTDS.dit) and Disk Health

6. Security Hardening, Auditing, and Privilege Creep

Next Step: Automated Reporting and Monitoring Infrastructure

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Sophos XGS Rule Order: A Checklist for NAT, IPS, and Web Filter

Securing a Server in the First 45 Minutes: VPS Hardening Checklist

6-Watt Home Server with an N100 Mini PC: Starting a Homelab from

1. DNS Integration and Name Resolution Verification

2. Replication Status and Synchronization Between DCs

3. FSMO Role Distribution and Verification

4. SYSVOL and GPO Replication Health (DFSR)

5. Active Directory Database (NTDS.dit) and Disk Health

6. Security Hardening, Auditing, and Privilege Creep

Next Step: Automated Reporting and Monitoring Infrastructure

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Sophos XGS Rule Order: A Checklist for NAT, IPS, and Web Filter

Securing a Server in the First 45 Minutes: VPS Hardening Checklist

6-Watt Home Server with an N100 Mini PC: Starting a Homelab from

Klavye Kısayolları