I’m a big fan of strongly typed database querying as well as returning strong typed results. Due to the nature of static languages, you’ll get compile time checking of all our tables and columns. You can easily rename columns as you can be sure all your (internal) references are accounted for.
Returning strongly typed lists of objects instead of DataReaders/DataTables / any other object based containers will also make it easier to transfer through data layers as you’re always certain of what’s available for you to read and what’s not.
But it comes at a cost. Performance.
I set out to test various different ways we could query our database and generate a strongly typed List with the results in it. I’m using the standard AdventureWorks 2005 database for my testing. For the actual profiling, I’ll be using my CodeProfiler class from a previous blogpost.
I’ll be using two different entities, Product and CompleteProduct:
They both map to the same table, Production.Product. CompleteProduct covers all columns, Product just covers the ones I’m interested in.
Method #1 - Manually mapping from DataReader
Pros: Fast
Cons: Has to be written for each object model
Method #2 - Manully mapping from DataTable
Pros: Pretty fast, easy row access, disconnected from database
Cons: Has to be written for each object model
Comment: I often use a getDT()’ish way of accessing my tables. It’s easier than using DataReaders as they’re disconnected, you can traverse them back and forth multiple times.
Method #3 - Automatic mapping using DataContext.Translate
Pros: Works for all object models, easy to use
Cons: Slow
Method #4 - Linq to SQL query for complete entity
Pros: Could it be easier? Works for all object models
Cons: Slow
Method #5 - Linq to SQL query for partial entity
Pros: Easy, works for all object models
Cons: Slow
Comments: In the previous method I retrieved complete Product entities. This time I’m retrieving only some columns of the CompleteProduct entity, mapping them over into a new Product. I’ve included this to see if there’s any performance difference in comparison to loading complete entities.
Method #6 - Automatically mapping from DataReader
Pros: Simple to use, works for all object models
Cons: Slow, reflection based
Method #7 - Enhanced automatic mapping from DataReader
Pros: ???
Cons: ???
Comments: This is an enhanced version of the previous method that I’ve made. Explaining the inner workings is outside the scope of this particular topic so I’ll have to explain it in my next post. For now, just imagine something very cool and sexy. UPDATE: you can read about the method here.
Method #8 - SubSonic SqlQuery
Pros: Works for all object models
Cons: Slow
Comments: I’ve never used SubSonic before, so I may have overlooked some obvious performance enhancements, thus, take my numbers with a grain of salt.
Method #9 - NHibernate CreateCriteria
Pros: Easy, works for all object models, concise querying
Cons: Slow
Comments: I’ve never used NHibernate before, so I may have overlooked some obvious performance enhancements, thus, take my numbers with a grain of salt.
Method #10 - Compiled LinqQuery
Pros: Among the fastest, no matter the rowcount, works for all object models
Cons: Requires a bit more setting up and helper functionality to store the compiled queries
The profiling
I profiled each of the test cases by returning 1, 10, 100, 1.000, 10.000 and 100.000 rows. The numbers are the total execution time of 50 iterations of each method. I’m on a quad core machine so I set the max parallelization degree to three, thus saving myself a core for running SQL Server and Windows.
The results
The following two graphs shows the total runtime of 50 iterations for each different method, as well as the total runtime divided by number of rows. Runtime in seconds along the Y-axis, number of rows along the X-axis. It basically shows that all LINQ to SQL flavors suffer from initial setup, but they scale extremely well, ending up outperforming all other contestants by a fair margin. One could argue that it’s rarely relevant in cases of more than a thousand rows as we’ll rarely pull out that much data at once. Never the less, I find it interesting - it’s something I’ll have to look under the hood to find an explanation of. Ignoring some variance due to the low row number, DataReader, DataTable and my Enhanced Automapped DataReader functions outperform most contestants.
UPDATE:
It seems the reason the LINQ flavors are so speedy is because their object mapper methods are being cached in the local thread cache. And since the CodeProfiler will execute all iterations on the same thread, each iteration (except the first) will simply reuse the mapper methods. See System.Data.Linq.SqlClient.ObjectReaderCompiler.Compile() for reference.
UPDATE 2:
As Kris Vandermotten duly commented, I should have tested a compiled LINQ query as well as that really is a separate scenario. I’ve added a tenth method as well as a new line in the graphs. The compiled LINQ query, although requiring a bit more plumbing code, really blows the competition away. It’s fast for even single row queries since we no longer suffer from the compilation time. For larger rowsets the advantage diminishes as the saved compilation time is spread out over a large amount of rows.
UPDATE 3:
As a friend of mine, Jakob Andersen points out, I really should be a bit more specific in regards to the purpose of these comparisons.
I am in no way trying to make the claim that it’s unreasonable that an OR/M framework hurts performance. Neither am I claiming that my comparisons between LINQ to SQL, Subsonic and nHibernate are fair. Really, to make a fair comparison, one should compare the very specific features of the frameworks, not a feature set as general as what I’m doing in these tests.
The thing is, some of these frameworks (nHibernate, I’m looking at you especially) just offer so much more functionality that it’s a burden they have to live with. While this extra functionality might cost a bit when we’re just trying to test the actual object translation performance, it might gain you a whole lot more in real life scenarios. Take for instance, if you utilize nHibernates caching, you might save the time that goes towards the database lookups - saving you much more than the actual translation process in most cases.
So, as mentioned in a couple of tests - take these results with a grain of salt. They’re rather out-of-the-box simple tests of direct querying & result translation performance. We’re ignoring all of the extra OR/M features that we have available, and we’re ignoring that these different frameworks offer very different functionality.
Total runtime
Runtime per row
Remember, even though there seems to be somewhat of a speed difference, the fastest method takes only 0,000074 seconds, while the slowest takes 0,00085 seconds. No matter what, you should’nt consider refactoring your solutions in regards to performance unless you really have a high hit count, or unless it’s for valid architectural reasons. Otherwise you might just risk having Ted Dziuba shove his foot up your ass.
I often do code profiling, usually involving a stopwatch and some boilerplate code. I decided to make a quick’n’simple class that’ll help me get rid of some of that code and concentrate on the actual code being profiled.
There are just four functions in the class, all overload variations. In a nutshell, they’ll allow you to profile a single action with & without warmup, multiple iterations and multiple iterations run i parallel. The code is more or less self explanable so I’ll just throw it out there:
Using the CodeProfiler class, it’s easy for us to do code profiling, including measuring the effects of parallelization. Here’s an example of calculating the square root of the numbers 1 through 10 million 100 times, using 1-16 simultaneous threads. FYI I’m runing on a quad core machine.
And the result as expected, the performance panning out when threads >= cores. Using more threads than there is cores will usually not result in added performance, just increased context switching which is not desirable. Here’s the resulting execution times with seconds on the Y axis and number of threads on the X axis.
Time flies fast. Back in 2006, during my time as an MSP, I made a series of presentations on securing intellectual property in .NET code, resulting in my Securing .NET Code article. Although it’s about two years old, most points are still valid today, unfortunately.
I recorded a screencast of this article sometime in 2007, but I never really got it published, except for a link on the Microsoft Denmark site. It was my first screencast and unfortunately I made some mistakes, the biggest one being the click sounds from the mouse. An even bigger mistake was me deleting the original Camtasia recording files so I can’t create a new version without them.
Never the less, if you prefer watching to reading, it might be interesting to look through.
In my earlier blog post about the dangers of using Response.TransmitFile, I gave an example of a workaround involving spawning a new thread in the ASP.NET page. While this does solve the issue at hand, it presents us with a new way to kill our application even quicker than last.
Usually when an uncaught exception occurs in an ASP.NET application, we will be presented with a “friendly” error message like the one below:
While there is an overhead of exceptions being thrown, they’re not directly dangerous and will at worst affect scalability (ignoring the actual reason of the exception being thrown). The problem is that ASP.NET will only catch exceptions on the processing thread. That means, if you spawn a new thread and an exception is thrown (and is not caught inside the thread itself), it will propagate and eventually crash the w3wp.exe process.
Safe
Unsafe, will crash w3wp.exe
There are several repercussions of the w3wp.exe crashing. There’s a major overhead in spawning a new w3wp.exe process on the next request, you will loose all session (if you’re using inprocess session storage), application and cache state. If you have error reporting turned on, you furthermore also see the “DW20.exe” process running and taking up 100% CPU for a significant amount of time (depending on w3wp.exe memory usage, etc) - if this happens often, you might have a large amount of DW20.exe error reporting processes running, effectively crashing your server.
So how do we avoid this? Simple, make sure all code in spawned threads is handling exceptions:
If you’re experiencing this issue, you will see errors in the System event log like this one:
A process serving application pool ‘DefaultAppPool’ terminated unexpectedly. The process id was ‘708’. THe process exit code was ‘0xe0434f4d’.
Before attempting to optimize code or fix any kind of load issue, you should first gather data and become aware of what bottlenecks you’re experiencing. A great way to do this is through the Performance Monitor application. Recently I tried monitoring my ASP.NET applications, but all my counters had a value of 0. As I thought initially, it’s a simple problem, but the solution was not easily found.
In somecases it might be due to lack of permissions on the performance counter registry keys.
In my case it’s because I was running Server 2003 x64, but my IIS was running in 32 bit mode (due to a couple of reasons, mainly lack of x64 support in some 3rd party components). When you run the IIS worker processes in 32 bit mode, the performance counters that are used are part of the SysWow64 scheme. The problem with this is that the usual Performance Monitor application will not read these 32 bit performance counters, and as a result you will see them all with a value of 0.
The fix is simple… Simply open up C:\Windows\SysWOW64\perfmon.exe instead of the usual Performance Monitor link in the Administrative Tools directory. This version of perfmon is the good old 32 bit version that will read the 32 bit ASP.NET performance counters. This trick applies for all x64 versions of Windows.
I got quite a lot of comments for my XmlDocument fluent interface, and I’m very glad I did. I’m always open towards new ways to solve problems, and I got a couple of suggestions to my post that I afterwards experimented with. One of those is using the XmlSerializer to serialize strongly typed classes (or structs - performance is the same) into XML. Jon von Gillern originally suggested it, but Kris Vandermotten made me want to test it out.
There are two aspects of these solutions, one is readability & maintanability, the other is pure performance. I said that my XmlDocument wrapper would be a lot faster than the serialization way using Reflection, but Kris wasn’t so sure. Admittedly, I hadn’t tested it out, so I though I might actually be wrong in that assumption. Let the testing commence.
I’ll be using my User XML snippet as an example. This is how the XML is generated using my API:
Note that I just retrieve the complete XML in a string, I don’t print or save this, it’s just to get a valid comparison point. This is how we’ll generate the same code using the XmlSerializer:
Note that only the last codesnippet is the one being looped, the other two are simply one-time helpers to actually create the XML. I have run the tests in a number of iterations to get a total code time, furthermore, I’ve run each of the iteration tests 10 times to calculate the average execution time. This is the basic code to run the tests:
And finally, the results (times in ms on a base 10 logarithmic scale):
As expected, the XmlSerializer is somewhat slower on the low iteration numbers, this is due to the initial code emits XmlSerializer will do, as Kris also mentioned. This is also the reason XmlSerializer is actually speeding up as the iterations go up, the initial compilation is meaning less and less. XmlOutput has a rather linear use of time. Never the less, the initial compilation time is neglible as it’s only the first request that has this performance hit (and we could sgen followed by ngen this to avoid it). Thus, if we simply reset the timer after the first iteration, this is the new graph we get (note that we can’t plot the 1st iteration as a value of 0 cannot be plotted on the logarithmic scale):
This time XmlSerializer behaves a lot more linearly like XmlOutput, but it’s still several factors slower than XmlOutput. In conclusion, speed does not seem to be the advantage of XmlSerializer. Depending on your scenario, using strongly typed classes might be more appropriate, but I really believe this is scenario dependent and thus I’ll leave that out of the discussion.
I misread Kris’ comment about sgen, I read it as ngen. I’ve removed my comment regarding this. To be fair, I’ve redone the performance tests, using sgen on the assembly during compilation. And I must say, it certainly does improve the performance somewhat of the serializer, though still not enough to compete with XmlOutput/XmlDocument.
Just before last weekend I noticed that a website I’m responsible for started spitting out “Server is busy” messages, not something you want to see on a website with millions of visitors per day. The quickfix was to recycle the application pool, and thus I solved the symptoms by setting a 15 mins recycle cycle on all the application pools. Not exactly optimal, but sometimes pissing your pants is the way to go.
The first step I made to analyze what was causing this is the Performance Monitor tool. We weren’t experiencing above average traffic, so that couldn’t explain it. What first struck me was the the “ASP.NETRequests Queued” queue was 0, not 5000+ as I’d expected! That meant the requests were not being queued, so the server didn’t have trouble handling the requests themselves. The reason was to be found in the “ASP.NETRequests Current” counter. This was constantly rising even though the CPU, memory and disk counters looked fine. It obviously didn’t look like a performance problem, more like a configuration issue. So I increased the appQueueRequestLimit to 20k and set the recycle cycle to 30 minutes, at most the “ASP.NETRequests Current” went to about 10k before being recycled and thus reset to 0.
Now, that didn’t fix the problem, just the symptom. We hadn’t experienced this issue previously, so I thought back at what changes had been made in the latest release version. The primary functionality of the system is to serve images, thus we have an Image.ashx file with a responsibility of serving the images as well as logging various parameters of the request. The previous system version had a funtionality like so:
Find image path
Response.TransmitFile()
Logging
The disadvantage of doing it that way is that the client will not have the image served before the statistics have been logged, even though that’s purely a serverside functionality. I wanted the client to receive the image as quickly as possible, and then letting the server continue its job afterwards. The obvious solution is to spawn a new thread doing the logging, but with the amount of requests we’ve got, I really don’t want to spawn any more threads than absolutely neccessary, excessive context switching will have a negative impact when the thread count gets high enough. So the new version functioned like this:
Find image path
Response.TransmitFile()
Response.Flush()
Response.Close()
Logging
This had the great advantage that the client receives the image immediatly while the server continues logging afterwards. We use only a single thread, the actual request thread. A friend of mine pointed out I might want to move the logging out of the ASP.NET worker process so as to not block incoming requests. The thing is, this will require new thread spawning, and I really don’t mind blocking a worker process as we can easily tune the amount of concurrent worker processes, and the “Server too busy” functionality is actually there for a reason - I don’t wanna end up in a situation where the server is running a million logging threads but still accepting new connections willingly - in that case, I’d really like the server to block new requests.
Anyways, although this looked good, this was the sole reason for the “Server too busy” errors we were experiencing! After some testing I discovered that if you call Response.TransmitFile() and then afterwards call Response.Close(), the request process is stuck! It will simply keep on living, and thus the “ASP.NETRequests Current” counter will keep increasing. It will not be removed until a pool recycle event is fired! This does not happen if you use Response.WriteFile, Response.BinaryWrite or if you manually stream the file, only if you use TransmitFile!
This will kill your application:
But this won’t:
I can replicate the exact same errors on Server 2003 with IIS running i *32 mode, Vista x64 and Server 2003 in x64 mode. It does not matter if you’re running ASPX pages or ASHX HttpHandlers, same problem.
I used this code snippet to get a list of the current active requests in IIS (to verify that the “ASP.NETRequests Current” and “W3SVC_W3WPActive Requests” are not lying:
So obviously the requests are there, they’re just stale.
If we take a look at an IISTrace trace, we can see all of the requests in the “Send data” state. They have all sent all the data and no further data is being sent, but they’re still stuck in the “Send data” state:
For all the other ways to send the file, the request exits the Send data state as soon as all processing is done (that is, not directly after Response.Close). Calling Response.End has no influence.
Symptoms
You may be experiencing this problem without knowing it. Unless you have a some load on your site, chances are you will never actually see this problem. While the requests will go stale and continue to live, a recycle event will kill them off as the process is closed. But you will see this in your System log:
A process serving application pool ‘Classic .NET AppPool’ exceeded time limits during shut down. The process id was ‘13304’.
Since the requests continue living, recycling the pool will time out and thus force the process to shut down, and thereby generating the above event. This may lead to increased memory usage depending on your recycle settings. So unless you have more requests than the Request queue limit setting on your application pool, within the recycle period, you will not notice this problem.
Fix
The easiest way to get around this problem (bug?) is to just spawn a new thread doing the logging so the main thread will complete right after TransmitFile. In most cases the logging operation will be rather fast so the threads will be shortlived and thus not create too many concurrent threading operations.
Bonus code
Jonathan Gilbert posted a couple of great comments regarding spawning your own threads in the process and the possibility of extracing the actual logging process into a separate service. Since my blogs comments suck in regards to posting code, here are his code parts:
In this post I’ll take a practical approach at talking about what SQL Server Mirroring is, the advantages and considerations that follows.
Availability, not scalability
SQL Server Mirroring aims to increase database availability, not scalability. Boiled down, a mirrored database consists of a principal database on SQL Server instance X and an exact mirror of that database on instance Y. Everytime a transaction has occured on X, it is executed on Y as well. While this is happening, the Y instance database is in recovery mode, meaning you cannot query it directly, and thus you cannot use this as a secondary readonly database to achieve scalability.
While you can run mirroring on different SQL Server instances on the same machine, this defeats the purpose as most errors will be hardware/system based, and usually these will affect all instances on the same phyiscal server. Trying out mirroring across instances is a good way to test it out however. In my demos I will be using virtual PCs, each loaded with SQL Server 2005 Enterprise sp1.
As the name implies, maintaining high performance is the key issue in this mode. Whenever a transaction is completed on the principal, the log is sent to the mirror, but the principal does not wait for this to complete. Thus if the mirror were to die out, throw an error during execution, the databases would become out of synch. The mirror could also trail behind due to a difference in computing power or other external factors. If the principal fails, you will have to do a manual failover to the mirror.
High safety (synchronous) - also known as “high protection”
As with the high performance mode, each time a transaction occurs on the principal, it is sent to the mirror. The principal will not commit the transaction until the mirror has committed the transaction also. Thus you will never risk your databases being out of synch. The downside is that your mirrored setup will be no faster than the slowest machine that is part of the mirror, plus the implicit overhead in server chatting. As with the high performance mode, you will have to make a manual failover in case of principal failure.
High safety with automatic failover (synchronous)
This mode involves a third instance besides the principal and mirror, known as the witness. The witness instance constantly monitors the principal for availability, and if a problem is detected, it will automatically perform a failover. At this point, the database as a whole will still be available, but you should manually get the failed mirror up and running again and reinitiate mirroring.
Make sure all instances have been upgraded to service pack 1. If you do not have service pack 1, you will receive the following warning when trying to start the mirror:
Unified SQL Server service account
If you’re using Active Directory, make sure the SQL Server service accounts are running as the same user. Local System will not work as it does not have any network credentials. If you’re not using Active Directory, just make sure the services are running on an account with the same name & password on each machine. Make sure you change the service account through the SQL Server Configuration application and not the services console. Alternatively you can specify user accounts that should be used for mirror replication, but having the servers run on the same domain account is the easiest way.
Full recovery model
Make sure the database you want to mirror is setup to use the full recovery backup model, otherwise there’ll be no log to ship to the mirror instance and mirroring will not be possible.
Licensing
Mirroring is supported in the SQL Server Standard and Enterprise editions. Neither Express nor Workgroup edition will work. Enterprise supports the high performance operating mode, Standard only supports the two high safety modes. You can use the free Express version for the mirror server. Note that you do NOT need an extra SQL Server license for the mirroring server, provided that it does nothing else but maintain the mirrored database - take note of the 30 days clause.
Test setup
My demos will be using three servers, all running Windows Server 2003 Standard (fully updated) through Virtual PC. All three have SQL Server 2005 Enterprise installed. I will be using the Microsoft sample database Adventureworks. You can download the AdventureWorks database at CodePlex.
The three servers are RIO, MANDALAY and MGM (yes, I like Vegas). MGM will only be used for setting up a witness, RIO and MANDALAY will both host the actual databases, MANDALAY being the initial principal and RIO being the initial mirror. All servers are completely fresh installations using SQL Server authentication.
I will be connecting to the instances from SQL Server Management Studio running on my desktop computer.
Initial setup of the databases
The first step is to restore the AdventureWorks .bak backup file to both servers. On the principal (MANDALAY) we should make a normal restore (RESTORE WITH RECOVERY) so the database is online. On the mirror (RIO), we should restore into the recovering state so no changes can be made (RESTORE WITH NORECOVERY). You can watch how it’s done here, or skip on to the next section.
Mirroring configuration
Now that we’ve got both databases setup, we’re ready to setup the actual mirror. A couple of notes on what I’m doing during the setup. In the first demo, I’ll setup a synchronous high safety mirror with a witness. As all the virtual PCs are running on the same machine, I’ll have to use different ports for the endpoints. Whether you want to use encryption for the endpoint communication is scenario specific. Encryption will have an overhead - albeit a minor one - so it’s up to you to determine if it’s neccessary. As our SQL Services are running on the same account across the machines, we do not have to specify any custom service account names during the setup.
For some reason, SQL Server needs a fully qualified domain name for the instance addresses. If you’re setting this up on a computer that is part of the domain, you should simply use the domain name, [Computer].[domain]:port. In this example my desktop is not part of the Active Directory domain and thus it’ll use addresses like TCP://COMPUTER:PORT which is not accepted. I’ll fix it by simply writing the machine IP addresses manually instead. The IP for MANDALAY is 192.168.0.31 and for RIO it’s 192.168.0.33. Note that you should ALWAYS use FQDNs, using IPs are not recommended as it may result in configuration as well as runtime issues. See Adam Machanics blogpost on the same issue as I ran into a couple of times.
Testing the mirror
Besides my DBA excesses, I’m a developer. And what’s more natural than to whip together a small application that tests the mirrors availability?
It continuously attempts to connect to the databases using three different connection strings:
The first connects directly to MANDALAY, the principal database. The second one goes to RIO, the mirror. And the last one is the mirror enabled connection string that combines the two. The principal should respond and act like any other normal database. The mirror will throw an exception as we cannot interact with a datbase in recovery mode. The combined connection will automatically connect to the current principal database, whether it be MANDALAY or RIO.
To detect a broken connection quickly, I connect to the databases every 250ms and display a green bar if the connection succeeded (and an UPDATE & SELECT went well), and red if any kind of exception arose. To detect a connection timeout in a timely fashion, I’m using my QuickOpen functionality. The SUM(SafetyStockLevel) is the result of a simple SELECT query being done on the database (the UPDATE modifies the same table, hence the changing values), just to make sure we’re actually talking to a live database.
In the following test, it gets a bit more complicated to follow. I’ve got two SQL Server Profiler windows open, the left one is monitoring the MANDALAY server, the right one is monitoring the RIO server. The windows are so small you cannot see what actually gets logged, but that is the point. The only relevant data in this case is the bottom rightmost value, Rows:X that displays an increasing rowcount when the server is active.
I will start out by starting up the mirror testing application. We should see a red bar for the mirror database (RIO) as we cannot connect to it directly, while the principal (MANDALAY) and the mirrored connection should both show green bars. The MANDALAY profiler should also show activity, whilst the RIO profiler should not show any activity.
When it’s running, I’ll initiate a manual mirror failover. A failover means we’re switching roles, thus RIO will become the new principal and MANDALAY will become the mirror. Effectively this should mean the combined connection still shows a green bar, MANDALAY shows red and RIO switches to green.
The TCP/IP connection retry algorithm
The failover went perfect. There’s a short amount of downtime as the actual failover takes place, but shortly thereafter, we get green bars again - but one too many. When we started out, we got a red bar when trying to connect to the mirror, RIO. Shouldn’t we be getting a red bar when trying to connect to MANDALAY after we’ve switched the roles so MANDALAY has now become the new mirror? As you can see in the profilers, only RIO is being queried, so although MANDALAY is not responding, the connection string pointing to MANDALAY is succeeding. And what’s more confusing is that the new instance of the testing application showed the expected result, a green bar for RIO and red for MANDALAY - at the same time as the existing one showed all greens.
The explanation is due to the connection retry algorithm for TCP/IP connections. When we have a mirrored connection, the partner names are cached when used. Although we couldn’t query RIO before, the partner name was cached. Thus when we make the failover and MANDALAY looses connection, it’ll automatically make a retry attempt by connecting to the mirror partner, RIO. When the database comes up again, RIO is responding to both connections successfully. So although the connection string specifies MANDALAY as the endpoint, we’re actually talking to RIO directly.
Now, when the cache times out, or if we start a new application (the cache is tied to the SqlClient within a specific AppDomain), the partner name has not been cached and a retry will not be attempted, and that’s why the new instance shows the expected result alongside the old instance.
When a database dies
This is the scenario we’ve been preparing for. But what happens when one of the databases die? In high safety mode, a transaction has to be committed on both the principal and on the mirror before it’s declared successful, but in case the mirror dies (whether due to the service stopping, the physical hardware crashing or something else) the principal will enter a disconnected state, still offering full availability. When you get the mirror database up and running again, it will automatically synchronize with the principal and the mirror will continue as if nothing had happened. High performance mode will also continue unaffected with a dead mirror, and it will also automatically resynch when the mirror comes back online.
Here’s a quick video demonstrating the availability of the principal when the mirror goes down (the short red bars are caused by Virtual PC pausing all VPCs when the menu bar is open).
If the principal dies, we need to promote the mirror to the principal role. As soon as the mirror has undertaken the principal role, we have access to our database again. This can be done safely in the synchronous high safety operating mode as we do not risk any dataloss due to all transactions being made simultaneously in both databases. In the high performance mode thugh, we cannot do this as there could potentially be transactions that has not yet been transferred to the mirror, which would result in data loss. In this case we have to get the principal back online - or accept possible data loss, depending on what’s acceptable.
In the following video I’ll simulate a dead principal. I’ll show the availability tester application running, using MANDALAY as the principal, RIO being the mirror. I’ll the pause the MANDALAY server, effectively simulating it dropping completely off the network. You’ll then notice all red bars in the tester application, as expected. To get the database up again, we have to do a manual failover to the mirror server, making it the new principal. We do that by executing the query:
There is no GUI that’ll execute this query. Soon after I’ve executed the query, we get green bars again. Notice that all bars are green, this is due to the connection retry algorithm as explained earlier - the MANDALAY server is still offline. As I refresh the database list on RIO, you’ll notice that the database is now listed as “Principal, Disconnected”, confirming that the RIO database has undertaken the principal role, while disconnected from the mirror. I’ll then resume MANDALAY, and as I refresh the MANDALAY database list, you’ll notice that the database automatically changed state into “Mirror, Suspended / Restoring” - it picked up on the role change and is now awaiting further commands. It will not automatically resynch as mirroring is suspended when we force the principiality change through the ALLOW_SERVICE_DATA_LOSS parameter. We first have to resume the mirroring functionality. After having resumed mirroring, I’ll wait for the databases to synch up completely, after which I’ll do a manual failover so MANDALAY becomes the principal again. There’s a short downtime as the failover takes place, but after that, we’ve got green bars and MANDALAY returns as the principal, RIO changing to mirror.
And there we go, we just experienced a principal going down with minimal service unavailability.
High security with a witness
The restoration process we just experienced can be automated if we choose to utilize a third server known as the witness. The witness server continually monitors the principal and will initiate a failover in case the principal dies, as well as restoring the mirrors functionality when it returns (that is, converting the previous principal to the new mirror). It requires a third server, MGM (IP 192.168.0.35) for the witness part. Setup is more or less as usual, we just need to include the FQDN for the witness server.
In the last video I’ll show how to setup the mirroring including a witness server. I will then start the availability testing application and pause the principal server afterwards. This will immediatly result in red boxes all over, but after a short while, the RIO server (the new principal) becomes green, and a moment after, the mirrored connection itself also becomes green. The MANDALAY box is also green, but again, this is due to the retry mechanism as explained earlier. You’ll then see that the previous mirror database has now changed status to “Principal, Disconnected”, proving that it has overtaken the principal responsibility and that it has lost connection to the mirror. I’ll then show how the MANDALAY database has changed status to mirror, and how we can do a manual failover so everything goes back to normal. This is the point when FQDNs became neccessary. Using IPs resulted in the mirror not being able to make the failover. As stated earlier, using IPs is a bad practice, you should always use FQDNs. I’ve added the AD hostnames to my hosts file so I can enter them from my non-AD member desktop machine.
Conclusion
SQL Mirroring is a great way to increase your availability rate in case of database failures. You need to understand precisely where mirroring will help and where it won’t make a difference. It won’t help when you fire out a TRUNCATE [WRONG_TABLE] statement since it’ll just be replicated on the mirror, for that you’ll still have to make a rollback via the logs. It’ll help you in the case of a server crashing due to either hardware, software or network failures (depending on your network setup) and so forth. It’ll also enable you to do rolling upgrades.
While configuration is rather straight forward, mirroring will add complexity to your setup and errors may be harder to track down. You also have to consider the licensing requirements depending on the level of mirroring you’re planning to use.
Switches are rarely nice in an architectural aspect, but they are often required none the less. One of the ways we can reduce the risk of errors as well as increase readability is to use enumeration values instead of constants. Unfortunately this only works for numeric types, we cannot create a string enumeration. Here’s a workaround. This is a typical console application, taking in an input value (stored in the input variable) and switching on the content:
The first step is to define the enumeration of values we need to have in our switch statement:
We cannot convert from strings to the Input enumeration type directly, so we’ll have to use a magic function like this:
Using the above function, we can refactor our initial code like so:
Take notice that I’m passing in true as the third parameter of the Enum.Parse method, this means the type conversion will not be case sensitive, you can change this parameter as needed, or maybe refactor it into a parameter of the function. If the conversion fails - if a matching enumeration does not exist - an ArgumentException is thrown.
When performing queries against a SQL Server database, there are a couple of methods readily available. However, an option is missing.
The primary timeout value is that of SqlConnection.ConnectionTimeout. This specifies how long time the SQL Server service has to respond to a connection attempt. You cannot set this value directly, you’ll have to set it as part of the connection string:
Note that the value is expressed in seconds, not milliseconds. The default value is 30 seconds. Secondly, we can use the SqlCommand.CommandTimeout value. This sets the timeout value of a specific query running on SQL Server. The problem with these two is that we’re missing a point in the pipeline, which goes:
TCP Connection to SQL Server -> SqlConnection.Open -> SqlCommand.Execute
The last two are covered, but if for some reason the SQL Server is dead, taken off the network, totally overloaded, we may get a timeout on the TCP level - and this could take a while. We currently have no way of controlling this timeout besides a server wide network level setting. Often, it’s not desirable to have your application potentially spending several minutes before receiving a TCP timeout - or sometimes simply wait indefinitely. We need some way to control this.
What I present below is an example of a SqlConnection extension method called QuickOpen (in lack of a better name, it isn’t quicker, it simply fails quicker). It’ll take a timeout parameter in milliseconds, after which it’ll throw a simple Exception. You can modify this to a more proper exception, this is just to show the point. Overall, using this method will introduce a slight delay (a couple of ms), so use it only when necessary, or when a couple of ms per SqlConnection.Open doesn’t matter.