In order to keep your SQL Server up and running smoothly you need to constantly be performing routine maintenance and monitoring work. If you do not keep a watchful eye over your SQL Server instances performance and stability might suffer. Or worse yet you might not be able to recover your server should you have a total server melt down. In this article I will be discussing some of the daily tasks a DBA should be performing. Additionally I will be providing a few scripts and suggestions to help minimize the amount of time you have to spend performing these daily tasks. Keep in mind every environment is a little different and requires different sets of daily tasks. The tasks covered in this article are those common tasks that are applicable to most, if not all SQL Server Installations, but by no means consider the list here complete.
Verifying Services Availability
You need to make sure your critical SQL Server services are available. One of the first tasks you should perform in the morning when arriving at work is to determine if all your SQL Server services are up and running. It is better to be proactive, then reactive. There is nothing worse than having a customer call you and tell you one of the services you support is not running.
Ideally I would implement a monitoring tool that will automatically email you when one of your critical services is not working. This way as soon as a server becomes unavailable you can react to bring it back up. You could build a homegrown tool to do this or purchase an off the shelf product to perform your monitoring. If you don’t have an automated tool you could use the Registered Servers tool inside of SQL Server Management Studio to send a single T-SQL statement to all your servers. Verifying that each of your SQL Server instances are working and responding to simple T-SQL statement allows you to rest easy knowing each instance is responding as expected. By monitoring your instances you might eliminate most of those awkward calls asking whether or not a particular server is up and running.
It is especially important to verify your services are running after OS patches or Service Packs have been installed. A number of times I have found that our servers have not come back online properly due to some system maintenance work. Do not rely on your operations staff to verify your services. Normally when they are applying critical update patches they have 100’s or 1000’s of servers to patch. Because of this they might have overlooked one of your servers. Therefore it is best that you also check your servers after OS patches or Service Packs have been installed.
Monitoring SQL Agent Jobs:
Mostly likely you are running at least one nightly SQL Agent job on each SQL Server instance you manage. You need to be checking these jobs daily to make sure they ran successfully. Of course you can put notifications in the SQL Server jobs so you get emails you when a job fails. But I find sometimes that emails get buried in with a bunch of other emails and can go unnoticed. Therefore you should do a quick scan first thing every morning to make sure that all your nightly SQL Agent jobs have run successfully.
There are a number of ways to quickly review your SQL Server Agent jobs; just search the web and you will find a few options. I personally have written a Reporting Services report that provides me with a list of all failed SQL Server Agent jobs, or job steps across all of the instances that I manage. I have created a subscription to this report so that every morning when I arrive at work I have a report in my inbox that I can review to see if any jobs failed during the night. Having this one single report that covers all my instances allows me to quickly identify those SQL Server agent jobs that failed overnight. If I find one that failed, many times I can fix the problem and re-run it before anyone even notices. The code in Listing 1 is the code behind my Reporting Services reports that runs on each instance I am monitoring.
SET NOCOUNT ON DECLARE @c INT -- Table of failed jobs DECLARE @failed_jobs TABLE ( job_id uniqueidentifier, step_id int, step_name sysname, sql_message_id int, sql_severity int, message nvarchar(1024), run_status int, FailureDate datetime, server nvarchar(30) ) INSERT @failed_jobs( job_id, step_id, step_name, sql_message_id, sql_severity, message, run_status, FailureDate, server ) SELECT job_id, step_id, step_name, sql_message_id, sql_severity, message, run_status, FailureDate, server FROM ( SELECT job_id, step_id, step_name, sql_message_id, sql_severity, message, run_status, ( DATEADD(ss, CAST(SUBSTRING(CAST( jh.run_duration + 1000000 AS char(7)), 6, 2) AS int), DATEADD(mi, CAST(SUBSTRING(CAST( jh.run_duration + 1000000 AS char(7)), 4, 2) AS int), DATEADD(hh, CAST(SUBSTRING(CAST( jh.run_duration + 1000000 AS char(7)), 2, 2) AS int), DATEADD(ss, CAST(SUBSTRING(CAST( jh.run_time + 1000000 AS char(7)), 6, 2) AS int), DATEADD(mi, CAST(SUBSTRING(CAST( jh.run_time + 1000000 AS char(7)), 4, 2) AS int), DATEADD(hh, CAST(SUBSTRING(CAST( jh.run_time + 1000000 AS char(7)), 2, 2) AS int), CONVERT(datetime, CAST(jh.run_date AS char(8))) )))))) ) AS FailureDate, server FROM msdb.dbo.sysjobhistory AS jh ) AS jh WHERE (GETDATE() > jh.FailureDate) AND (jh.run_status = 0) -- Identify how many days to go back and look for failures AND (DATEADD(dd, -1 , GETDATE()) < jh.FailureDate) SELECT @c=count(*) FROM @failed_jobs IF @c > 0 BEGIN SELECT SUBSTRING(j.name, 1, 50) AS JobName, SUBSTRING(jh.step_name, 1, 50) AS StepName, message, jh.FailureDate AS FailureDate FROM @failed_jobs jh INNER JOIN msdb.dbo.sysjobs j ON jh.job_id = j.job_id END ELSE SELECT 'No Failed Jobs For Reporting Period' JobName, ' ' StepName, ' ' FailureDate
Listing 1: List jobs that have failed in the last 24 hours
When you are monitoring SQL Server agent jobs you should also review the job history. Occasionally something might happen in your environment that makes a job run longer than normal. If you know the average time a SQL Server agent jobs runs, and then you see it take less or more than then norm, this might indicate a problem worth looking into.
Occasionally I see SQL Server agent jobs that appear to be hung. When this happens the next scheduled run will not occur if it is hung and still running. In order to avoid missing the next run cycle of a job you need to make sure hung jobs are stopped, and the problem that caused the job to hang is resolved prior to the next run. Therefore you need to monitor for long running jobs that might have hung. There are a number of ways to do this. If you search the web for “long running SQL agent jobs” you will find a number of solutions you can use to identify your long running jobs.
Monitoring Database Backups
Knowing whether you have backups of your databases is critical. In fact this should be one of your top priorities as a DBA. Without database backups you will not be able to recovery your databases. You need to check daily to make sure your backup strategy is working, and backing up all of your databases as expected. Therefore every morning you should verify that you have a set of backups. One way to do that is to run a script, similar to the one I have provided in Listing 2:
-- Find DBs that have not been backed up in last 24 hour SELECT name AS database_name , backup_finish_date , coalesce(type,'NO BACKUP') AS last_backup_type FROM (SELECT database_name , backup_finish_date , CASE WHEN type = 'D' THEN 'Full' WHEN type = 'I' THEN 'Differential' WHEN type = 'L' THEN 'Transaction Log' WHEN type = 'F' THEN 'File' WHEN type = 'G' THEN 'Differential File' WHEN type = 'P' THEN 'Partial' WHEN type = 'Q' THEN 'Differential partial' END AS type FROM msdb.dbo.backupset x WHERE backup_finish_date = (SELECT max(backup_finish_date) FROM msdb.dbo.backupset WHERE database_name = x.database_name ) ) a RIGHT OUTER JOIN sys.databases b ON a.database_name = b.name WHERE b.name <> 'tempdb' -- Exclude tempdb AND (backup_finish_date < dateadd(d,-1,getdate())
or backup_finish_date is null)
Listing 2: Script to display all databases that have not been backed up in the last 24 hours
The script in Listing 2 lists each database that does not have a backup that has finished in the last 24 hours. It reports the time the last backup was run and the type of backup. You can use the results returned from this script to verify each of your databases have had a backup in the last 24 hours.
Check the Event Log
Computer hardware is always prone to having problems. Some of these problems, like disk drive failures, might come on quickly or might slowly degrade before eventually having a total melt down. The system Event logs sometimes will have messages that indicate that hardware is having intermittent problems.
Additionally SQL Server, as well as other applications, write messages to the event log. Some of these messages are informational, while others have warning messages and error information. These application type errors may not cause a problem with applications and therefore go unnoticed. By reviewing the event log daily for these warning and errors type messages you have a better chance at resolving issues that might eventually turn into an outage of your server, or SQL Server instance.
You should consider looking at the event logs daily. Make sure you review all the different event logs, like system, application, etc. Checking these logs daily provides you with confidence that your servers are functioning normal and no abnormal or rogue applications are running on your machine.
Check the SQL Server ERRORLOG file
SQL Server maintains an ERRORLOG file. The ERRORLOG file contains both informational, warning and errors messages that happen. Some of the events written to the ERROR log might indicate some problems with your SQL Server instance, like corruption in a database, or slow response to an I/O requests. You should be reviewing the current ERRORLOG file for any messages that are out of the ordinary. By doing this you might have time to resolve a problem before it escalates into a critical problem, or outage of your SQL Server instance. You should get into the habit of looking at the ERRORLOG daily.
The ERRORLOG file can easily be read by using the undocumented extended stored procedure named xp_readerrorlog. Browse the web to find out more information about how to use this undocumented stored procedures to read your error logs. You can also use SQL Server Management Studio, or your favorite text editor to review the ERRORLOG.
Disk Space Availability
As your databases grow over time they will eat up more and more of your available disk space. Unless you have turned off the auto grow options on all of your databases, your databases will grow automatically whenever they need more disk space. These auto growth events could take up much or all of your free disk space quickly if some large rogue data import process was run.
It is important to make sure you always have enough space for your databases to grow. In order to ensure that your drives have appropriate free space and auto growth events have not taken lots of your free disk space you need to monitor available drive space. You can use the undocumented xp_fixeddrrives extended stored procedure to show the amount of free space on each drive. By building a process around xp_fixeddrives output, you can easily be alerted when your drive free space falls below a defined threshold.
Be Proactive, Monitor Daily
DBAs should be proactive and monitor each SQL Server environment daily. By monitoring daily you can quickly identifying problems or potential problems and solve them before they escalate into bigger problems. The listings I provided here and the scripts are by no means complete. You should consider what you think is important to monitor daily for your environment. Once you have developed your list, I encourage you to determine ways to automate your morning monitor tasks to help streamline your monitoring process.