Quite a while ago, Brent Harman and I delivered the world-famous Guido and Gil's Masters of Disaster AD Disaster Recovery Workshop in Dallas. Things went pretty smoothly except for one mysterious behavior with authoritative restore. We figured it out after an hour or so, but I never recorded the problem before.
The lab setup included two domains, with two DCs in each domain. The first DC in the root domain was the sole DNS name server. The first DC in each domain was also a GC. Each domain contained OUs with groups and users in them, and there were domain local as well as universal groups containing members from both domains. Domains and forest were WS 2003 functional level.
The lab included two auth restore exercises. Each DC had a current system state backup. The first exercise had the student delete a user from an OU in the child domain, and then use auth restore to recover the user in the child domain, and non-auth restore and LDIFDE to recover the domain local group memberships in the root domain. The second exercise was essentially the same, except the goal was to delete and auth restore an entire OU, including the group memberships of the users in the OU. Pretty basic, if tedious, stuff.
To refresh your memory, the auth restore process in a multi-domain environment looks something like this:
- Boot DC in child domain into DSRM
- Use NTBACKUP to perform system state (non-auth) restore of DIT
- Use NTDSUTIL to perform auth restore of objects to be recovered. This increases the version number on the object's attributes by 10000 for each day between the date of the backup and the date of the restore, which causes the object to replicate out to the other DCs, rather than being overwritten by the tombstoned object from the other DCs. The auth restore process also creates LDIF files containing forward link information (e.g. group memberships) and a text file containing the GUID of the restored object for use in other domains.
- Boot the child DC into normal mode.
- Use LDIFDE to recreate the group memberships for the restored object in the child domain (not necessary in our case since we were in WS2K3 DFL/FFL)
- Boot a DC in the root domain into DSRM
- Use NTBACKUP to perform system state (non-auth) restore of DIT
- Use NTDSUTIL to create LDIF files for the root domain group memberships of the restored object, using the text file from step 3 as input.
- Boot the root DC into normal mode.
- Use LDIFDE to recreate the group memberships of the restored object in the root domain
There were no significant problems with the first exercise. Pretty much everyone was able to restore the deleted user, along with the user's group memberships in both domains with no trouble. The problem showed up in second exercise where we recovered the OU.
Several people ran into a problem after restoring the OU however. After restoring the OU (which happened to contain the user they recovered in the first exercise), the students discovered that the user they restored in the first exercise was gone! Even worse, when they tried to auth restore the user again, the user would appear, and then magically disappear again within a few minutes.
After making sure the students were actually doing the auth restore properly, we pondered the problem for a while until we understood what was happening. A local Dallas MSFT fellow who was sitting in on the class figured it out.
Basically, the problem occurs when you auth restore the same object from the same backup more than once in a single day. Let's just look at a single attribute of the object, say distinguishedName. the version number of the attribute is initially 1. This is the value of the version number that is stored in the system state backup as well. After the object is deleted the version number is set to 2, and the new distinguishedName value replicates out from the originating DCs. When you use NTBACKUP to non-authoritatively restore the entire DIT, the version number of the distinguishedName attribute is again 1. If you were to restart the DC in normal mode, the other DCs in the domain would overwrite the distinguishedName attribute because their version number is higher. That's why you have to use NTDSUTIL to perform an authoritative restore of the object before restarting the DC in normal mode. The authoritative restore process increments the version number of all of the object's attributes by 10000 per day since the backup, which results in a version number larger than the corresponding version numbers on the DCs in the domain. The attributes from the authoritatively restored object replicate out to the DCs replication partners, and overwrite the values there.
When the students went through the same process with the OU, they auth restored the OU subtree, including all the objects contained in the OU. Because the date of the backup and the date of the restore were unchanged, the version numbers were all incremented by 10000 again. The attributes of the OU and its contained objects all replicated out, overwriting the corresponding attributes on the other DCS, except for the attributes of the originally deleted user object. Why? Let's trace through what happened to the originally deleted user object and see.
When the object was originally created, the version number on its attributes was set to 1 (say). When it was deleted, the version numbers were incremented to 2. When the object was non-authoritatively restored from backup, the version numbers were restored to 1. When the object was authoritatively restored, the version numbers went to 10001, and that's what replicated out to the other DCs. When the OU (and the objects contained in it) was deleted for the second exercise, the version numbers went to 10002. The students did the non-authoritative restore from backup, which returned the version numbers to 1. The authoritative restore process then set the version numbers for all the restored objects to 10001. This caused the attributes for all the restored objects to replicate out to the other DCs, except for the attributes of the originally deleted object. It's version numbers on the other DCs were 10002, which is still greater than 10001. So the originally deleted object was restored, but inbound replication overwrote it, even though it had been authoritatively restored!
There are a couple of ways around this problem. One is to use the VERINC option of NTDSUTIL to increase the version numbers by more than 10000 per day. The other is to perform another backup immediately after the original authoritative restore. Or in the case of this workshop, just delete and restore a different OU :).