Lesson Learned #1:
The owner of a database or a job should never, ever, neverever be a Windows authenticated user. Especially in a business environment.
Always create an SQL authenticated user specifically for dba management and have all databases, jobs, etc. owned by that user. Jobs should log in as that user.
Lesson Learned #2:
SQL replication from database “X” to database “Y” involves a third database: “Y_distribution“. This database gets created behind your back, along with the jobs to utilize it.
But woe betide you if you think you don’t need to be concerned about it. Monitor that sucker and the jobs that it uses.
Lesson Learned #3:
The Internet is your friend. Google search is your friend. People who put up helpful posts on their blogs are your friends. You are not alone.
How I learned all these lessons:
Items in brackets [] indicate how to locate information in the MSSMS Object Explorer. -> indicates a tree node, ** indicates a right-click
First, let me say that I am not a DBA – I know little about MS SQL Server and what I do know is only because of the occasional pile of doggie-do that drops from the sky and lands in my lap in the form of an OMG emergency (like the one I am about to relate).
Part way through my workday, the super-urgent emergency that had interrupted my high-priority task was itself interrupted by a developing panic-in-the-streets emergency. I was informed that a database I had never heard of was filling up all the drive space on a drive that does not hold any of our databases. Never mind that it didn’t make sense, the disk space situation was now so critical that the server was in danger of crashing within hours.
In our case, we are replicating database “A” on Server1 to database “A” on Server2, where Server2 is also the replication distributor. But I had never heard of the “A_distribution” that was filling up.
It took some searching, but I found the database. This “A_distribution” database showed up in MSSMS on Server2 under the list of System Databases [Server2->Databases->System Databases]. But what was it and where did it come from?
A quick search on the web (see Lesson Learned #3) for SQL _distribution informed me that this is a database used by the replication process. Changes made to “A” on Server1 generate commands which are held in “A_distribution” until they can be executed against “A” on Server2. Depending on certain settings in the replication configuration, these entries can hang around for a while before two jobs (“Agent history clean up: A_distribution” and “Distribution clean up: A_distribution” [Server2->SQL Server Agent->Jobs]) get around to deleting them.
Our problem was that the replication had been set up by a person who left the company several months ago. When the replication was set up, the clean up jobs where assigned his Windows user as their owner and/or login – this includes the clean up jobs mentioned [Server1->SQL Server Agent->Jobs->jobname**properties->General]. Once that person left, his Windows user eventually expired and the clean-up jobs were failing [Server1->SQL Server Agent->Jobs->jobname**View History].
Everything looked okay; replication was working just fine. But the jobs that should have been removing old commands could not, and “A_distribution” was getting bigger and bigger.
Add to that that the “A_distribution” database was created, by default, on the C drive of Server2. We typically do not put databases on the C drives of any server, preferring to reserve C for operating systems and system applications. So our C drives tend to be small. Because of the small size of the C drive, by the time our monitoring software detected the low volume of free space, it was too late. The database had grown to such a size that we had only hours before it would gobble up the last few free sectors.
Now that I realized the cause of the problem, I changed the owner of the jobs (see Lesson Learned #1). I re-ran the 2 jobs [Server1->SQL Server Agent->Jobs->jobname**Start Job at Step…], and though the first one completed successfully, the second one died, reporting that the log file for “A_distribution” was full and could not be expanded due to – you guessed it – low disk space.
We needed to move the database to another drive. But it didn’t seem like it would be easy to do. The usual procedure would involve detaching the database [Server2->Databases->database**Tasks->Detach…]. However this option is not available for System Databases. Back to the Internet.
Many many thanks to Aravind Lakshminarayanan for this post which details how to move a replicated database and its distribution database to another drive without having to remove and rebuild replication. This saved me a lot of aggravation and general stumbling-around-in-the-dark. I didn’t need to move the replicated database (as that was already on Server2‘s D drive, but I did follow his steps for moving the “A_distribution” database and logfile.
Once that was done, the clean up jobs ran flawlessly, the sun came out to shine on me (even though it was past 2200h) and world-wide peace and order were finally established.
Why this post is here:
The lessons seem obvious and trivial. The story is boring to everyone but me. So why post it?
As a reminder to myself that the obvious and trivial require the utmost attention and are critical. I went through a lot of adrenaline and cortisol while this was going on. It was not a pleasant feeling to know that you have to diffuse the bomb with 30 seconds left and there is no one around who knows which wire to cut.
All this could have been avoided if Lesson #1 was not only learned, but inculcated as a rule… a law… even better, a habit.