I had heard through the grapevine a while ago that MSFT officially, or at least semi-officially, did not support AD lag sites. Lag sites, if you recall, are a way of configuring the AD replication topology to create one or more DCs that maintain replicas that are out of date with respect to the rest of the domain for the purposes of data recovery. Lag sites basically use AD replication as a sort of hot backup.
Guido and I incorporated lag sites in our AD disaster recovery workshop, and I know several large companies that use lag sites in their production forests. So I was surprised to hear that MSFT didn't support them. In fact, because lag sites simply use a specific (but entirely valid) kind of site topology and replication schedule, I didn't see how MSFT could not support lag sites.
I've since heard from other MSFT people that in fact MSFT does not support lag sites as a data recovery or disaster recovery strategy. This is not to say that lag sites don't work as advertised; they do. The problem is that lag sites leave a lot of holes in your DR strategy.
- Because lag site DCs replicate on a schedule, there is no guarantee that a change made on a production DC will not immediately replicate to a lag site DC. It depends on when the change occurs relative to the lag site replication schedule. To mitigate this problem, people sometimes implement two or more lag sites, with offset schedules. This requires an additional DC in each domain, which require additional management, etc.
- There is a substantial security liability with lag site DCs. Account deletions/disablements will not replicate immediately to the lag site DCs, which means they can be used to authenticate a user after the user’s account has been deleted. Even if the lag site DCs are properly configured so that they don’t publish SRV records, a user or administrator with a little knowledge can still use the lag site DCs to authenticate, even after their account has been deleted.
- There is no way to prevent someone from forcing replication to a lag site… for instance, the command line tool REPADMIN used with the /FORCE option will force immediate replication to all lag site DCs. This completely invalidates lag site DCs as a reliable backup mechanism.
- Most people use virtual DCs when implementing a lag site to avoid the additional hardware costs. Running DCs on VMs has its own set of potential problems, many of which are described in Running Domain Controllers in Virtual Server 2005 and Considerations when Hosting Active Directory Domain Controllers in Virtual Hosting Environments.
- Even when the lag site works properly, recovering deleted objects from them is not simply a matter of reanimating them or authoritatively restoring them. There is still the matter of recovering linked attributes such as group memberships. The data on the lag site DCs can be used to help recover these linked attributes, but it is still a time consuming proposition.
- The only supported way to restore AD to a previous point in time is by restoring a DC from a system state backup (from KB 888794): “To roll back the contents of Active Directory to a previous point in time, restore a valid system state backup.” This makes the lag site DCs redundant from a service recovery point of view…. You have to maintain system state backups anyway.
Do lag sites work? Sure, but the problem is that you can't really count on them as a backup mechanism. And we all know that the ONE time you need your backups to work, is the one time they're going to fail.
So what should you do with respect to AD backup and recovery? Well, nothing is more reliable than a couple of good system state backups for each domain. And unless you have a completely automated way of provisioning all the data in AD from external sources (e.g. your HR system), system state backups are your only real solution for performing a domain or forest recovery.
The downside of using system state backups for recovering deleted objects is that finding the right backup image and restoring it to a recovery DC is a giant PITA that can take hours. And the implementation of system state backups in WS08 makes using them even more painful. Using the snapshot feature in WS08 is one way to maintain a fairly fine-grained backup of your DIT , but you still have to figure out which image to restore from, and you still have to recover all of the linked attributes for the recovered objects. That's why products like RestoreADmin make sense. RestoreADmin keeps fine-grained online backups of your AD objects and makes the entire object recovery exercise almost completely trivial. It's the easiest way to mitigate to risks of accidentally deleted AD objects.
We'll talk about lag sites, object recovery, and domain/forest recovery at the AD DR Birds-of-a-Feather session at TechEd. I hope to see you there.