Mark S. Rasmussen improve.dk
Mar 02
2008

I spend a lot of time architecting for scalability, availability and security during my daily work. Currently I’ve got a distributed system consisting of several windows services communicating across machines using WCF and authenticating through Active Directory.

In such a situation, if the Active Directory Domain Controller (let’s just call it DC from now on) dies, everything more or less dies as no clients/servers are able to authenticate incoming requests anymore. Security is paramount, so the services are not allowed to simply cache the domain logon, thus the logon has the occur at each service call – requiring a fully working DC.

In this post I’ll attempt to implement a secondary failover DC and investigate how it affects a downtime situation. I’ll be using a couple of simple WCF based applications to test the DC. I will be using three virtual machines. Luxor is the primary DC. MGM is part of the domain, and this is the machine hosting the WCF server. Later on I’ll add the third virtual machine, Excalibur, being the failover DC. The WCF client will be running from my own machine. Note that I will not show how to install the primary DC, there are plenty other great guides on how to setup the primary DC. I also won’t be going into how to install the primary DNS server.

This is the WCF server interface:

using System;
using System.Security.Principal;
using System.ServiceModel;
using System.ServiceModel.Channels;
using System.ServiceModel.Description;
 
namespace Contracts
{
	public class WCFHelper
	{
		/// <summary>
		/// Creates an instance of the specified interface type by channeling to the service host.
		/// </summary>
		/// <typeparam name="T">The interface type to create.</typeparam>
		/// <param name="binding">The binding protocol to use.</param>
		/// <param name="endpointAddress">The complete address for the endpoint.</param>
		/// <returns>An instance of type T.</returns>
		public static T CreateChannel<T>(Binding binding, string endpointAddress)
		{
			// Create an endpoint for the specified binding & address
			ServiceEndpoint endpoint = new ServiceEndpoint(ContractDescription.GetContract(typeof(T)), binding, new EndpointAddress(endpointAddress));
			binding.SendTimeout = new TimeSpan(0, 0, 1);
			binding.ReceiveTimeout = new TimeSpan(0, 0, 1);
			binding.OpenTimeout = new TimeSpan(0, 0, 1);
			binding.CloseTimeout = new TimeSpan(0, 0, 1);
 
			// Create a channel factory of type T
			ChannelFactory<T> factory = new ChannelFactory<T>(endpoint);
			factory.Credentials.Windows.ClientCredential.UserName = "RedundancyCheck";
			factory.Credentials.Windows.ClientCredential.Password = "RedundancyCheck";
			factory.Credentials.Windows.ClientCredential.Domain = "IPAPER";
			factory.Credentials.Windows.AllowedImpersonationLevel = TokenImpersonationLevel.Impersonation;
 
			// Return the created channel
			return factory.CreateChannel();
		}
 
		/// <summary>
		/// Creates a ServiceHost hosting the specific implementation TImplementationType of interface TInterfaceType at the specified endpointAddress.
		/// </summary>
		/// <typeparam name="TInterfaceType">The interface type to host.</typeparam>
		/// <typeparam name="TImplentationType">The implementation type to host.</typeparam>
		/// <param name="binding">The binding protocol to use.</param>
		/// <param name="endpointAddress">The endpoint where the service should be hosted at.</param>
		/// <param name="mexEndpointAddress">The endpoint where the service metadata should be hosted at.</param>
		/// <returns>An instance of ServiceHost.</returns>
		public static ServiceHost CreateServiceHost<TInterfaceType, TImplentationType>(Binding binding, string endpointAddress, string mexEndpointAddress)
		{
			// Create new service host
			ServiceHost host = new ServiceHost(typeof(TImplentationType));
 
			// Create endpoints
			host.AddServiceEndpoint(typeof(TInterfaceType), binding, endpointAddress);
 
			// Create metadata endpoint if it doesn't exist
			ServiceMetadataBehavior smb = host.Description.Behaviors.Find<ServiceMetadataBehavior>();
			if (smb == null)
			{
				smb = new ServiceMetadataBehavior();
				smb.MetadataExporter.PolicyVersion = PolicyVersion.Policy15;
				host.Description.Behaviors.Add(smb);
			}
			if (binding is NetTcpBinding)
				host.AddServiceEndpoint(ServiceMetadataBehavior.MexContractName, MetadataExchangeBindings.CreateMexTcpBinding(), mexEndpointAddress);
			else if (binding is BasicHttpBinding)
				host.AddServiceEndpoint(ServiceMetadataBehavior.MexContractName, MetadataExchangeBindings.CreateMexHttpBinding(), mexEndpointAddress);
			else
				throw new ArgumentOutOfRangeException("Invalid binding: " + binding);
 
			return host;
		}
	}
}

Notice that I’ve hardcoded the user “IPAPERRedundancyCheck” with a password of “RedundancyCheck” – this is for test purposes only, don’t even bother commenting on password security :) Also notice that I’ve set a timeout of 1 second – when the DC fails, I don’t want to spend 30 seconds before knowing if it’s down, I want to know about it right away. And since all machines are running locally, 1 second is plenty.

This is the server that’ll be running on the MGM machine on port 8000:

using System;
using System.Security;
using System.Security.Permissions;
using System.ServiceModel;
using Contracts;
 
namespace WcfServer
{
	class Program
	{
		static void Main(string[] args)
		{
			// Create service host
			ServiceHost host = WCFHelper.CreateServiceHost&lt;IServer, Server&gt;(new NetTcpBinding(SecurityMode.Transport), "net.tcp://localhost:8000", "net.tcp://localhost:8000/mex");
 
			// Open host
			host.Open();
 
			Console.Read();
		}
	}
 
	[ServiceBehavior(IncludeExceptionDetailInFaults = true)]
	public class Server : IServer
	{
		public string Ping()
		{
			// Security check
			try
			{
				new PrincipalPermission(null, "ADTest").Demand();
			}
			catch
			{
				throw new SecurityException();
			}
 
			// Let caller know that we're alive
			return "Pong!";
		}
	}
}

We’ve got an implementation of the IServer interface with the single method “Ping”. We test that the user is part of the “ADTest” role by demanding it on the current principal. If something fails we throw a SecurityException which will let the client now authentication failed. If the client is authenticated, we return a pong.

And finally we have the client that’ll be running on my own machine:

using System;
using System.ServiceModel;
using System.Threading;
using Contracts;
 
namespace WcfClient
{
	class Program
	{
		static void Main(string[] args)
		{
			while (true)
			{
				try
				{
					IServer server = WCFHelper.CreateChannel(new NetTcpBinding(SecurityMode.Transport), "net.tcp://192.168.0.35:8000");
 
					using (server as IDisposable)
					{
						try
						{
							Console.WriteLine(server.Ping());
						}
						catch (Exception ex)
						{
							Console.WriteLine(ex.Message);
						}
					}
				}
				catch (CommunicationObjectFaultedException)
				{ }
 
				Thread.Sleep(250);
			}
		}
	}
}

We create a channel to the MGM machine (static IP of 192.168.0.35). We’ll continue to call the IServer service every 250ms, writing either the result of the Ping function, or a message explaining any problems that have occurred. Note that in real life situations we’d not create a new channel each time, but in this case we have to, so we really do authenticate on each call (so we’re affected immediatly when the DC dies).

The following clip will show what happens when we run the server & client while the DC goes down (by pausing the virtual machine). Note that we’re receiving a timeout exception, not a SecurityException. This is because I’m using SecurityMode.Transport which requires us to authenticate before even reaching the service, thus the method is never invoked, and the PrincipPermission.Demand() call is NOT the one failing us, it’s the WCF security layer trying to open a TCP transport. As soon as the DC (running on LUXOR) fails, we lose connectivity with our service (running on MGM).

The goal obviously is to prevent this from happening, we cannot have all our services brought to a standstill if the DC fails. The first step in installing a failover AD DC is to get a DNS secondary server up and running (on the soon to be secondary DC machine) so we have redundant DNS functionality.

A quick recap of the servers:

LUXOR = Primary DC, primary DNS
EXCALIBUR = To be secondary DC, secondary DNS
MGM = Client server

Now that we’ve got the secondary DNS set up, we’re ready to install Active Directory on the secondary AD server (EXCALIBUR). The following video shows how easy it is to install a failover DC:

That’s it! After the server reboots, it now functions as a failover DC in case the primary one kicks the bucket. I’ll end this post post by running my WcfServer application on the MGM server whilst both LUXOR and EXCALIBUR are running. You’ll se a fluent stream of “Pong”s returning. After shutting down LUXOR, the WCF client will immediately start reporting connection problems, but after a short while it automatically starts returning Pongs again – it got a hold of the second DC! Now, if I shut down the second DC aswell, we’ll get errors in our client again. If I then restart the primary DC, after a short while, the client starts Ponging again – we got a hold of the primary DC. So we haven’t eliminated downtime completely, but we’ve reduced it to a 5-30 sec period before everything automatically switches over to the failover DC.

This is my first blog post utilizing videos – does it work? Do you prefer seeing live video like this, or a long series of screenshots? I know what I prefer :)

Mark S. Rasmussen
I'm the Technical Lead at iPaper where I cuddle with databases, mold code and maintain the overall technical responsibility. I'm an avid speaker at user groups & conferences. I love life, motorcycles, photography and all things technical. Say hi on Twitter, write me an email or look me up on LinkedIn.
Comments
  • Kristian Kristensen March 2, 2008

    Good stuff :-)

  • Sam Magee May 5, 2008

    The article was great. Love to use of videos and Virtual PC, it makes everything very easy to follow.

  • Mike Bright November 21, 2008

    Very good! The video really adds a lot of information that would have to be typed out otherwise. Now, if you could combine the screenshots and video or zoom in and pause for a sec on the parts where we need to focus or need to see specific details it would be perfect!

  • GiusMan March 10, 2009

    Very cool.
    … got a question hope u can answer.. once the secondary DC is set, how do u keep it up to date… menaing.. if u add a new user on DC1, how does it get to DC2 ?

    thx

  • Mark S. Rasmussen March 10, 2009

    As long as they’re connected, they’ll automatically synchronize any changes. This goes for both AD and DNS changes.

  • Ryan August 12, 2009

    In our environment we have a hung DC every now and then at which point our monitoring system reports some services are down.

    Example of our Environment:
    Domain Controller called DC1 and DC2
    Member Server called: Member
    Member server doing monitoring called: Monitor

    So DC1 will become hung on restart at which point the server called "Monitor" will start reporting that services are down on server "Member". We use a domain account which is a member of the admin group on all servers to use to check for down services. I know this is a kerberos/netlogon issue but can’t explain why DC1 being hung would affect the server Monitor from being able to successfully login to Memeber. The netlogon service should notice DC1 is hung and failover to DC2….any insight would be much appreciated

  • GiusMan August 17, 2009

    Mark,
    it is safe to say that u can repeat this procedure to add a 3rd failover DC ? what make the difference between primary and secondary DCs?
    the reason I am asking is that i want to replace the DC1 and i thought this would be a great way to do it: create DC3 … decommission DC1… what do u think ?

  • Ryan September 9, 2009

    I think a key point to make is that you are instantly "pausing" your virtual machine. Does pausing make the server un pingable?

    Basically what i’m trying to understand is what check does the netlogon service do to verify the DC it’s authenticating against is really able to process authentication requests. I think it only pings..and if the DC responds to ping, then it assums the DC is up and functioning. Thus if a DC hangs and is still pingable, client workstation/member server’s netlogon service will not try and discover a new DC. Or if a DC is just slow to reboot…..

  • tasmanian_devil October 19, 2009

    [ServiceContract(Namespace = "http://test.dk/Server&quot;)] is Win 2003 Server with AD ?

  • KNSas February 19, 2010

    I found this article very useful. I have one question:

    do you think there will be any issues when many applications are dependent on the DC and the PDC is down for a long time. This demo shows that if a DC (FSMO-PDC I hope it is) in same site is down , the app picks up the next availale DC in 30 seconds. Did you try this across sites? will there be any issues with authentication if the PDC is down for a long time.

Leave a Comment