One DBA's Ongoing Search for Clarity in the Middle of Nowhere


*or*

Yet Another Andy Writing About SQL Server

Tuesday, December 22, 2015

The transaction log for database 'ABC' is full due to 'ACTIVE_BACKUP_OR_RESTORE'

I have recently had a client with a problem each morning where they were having processes fail with this message in their SQL Server Error Log:
Date 12/01/2015 5:58:48 AM
Log SQL Server (Current - 12/01/2015 1:00:00 PM)
Source spid118
Message Error: 9002, Severity: 17, State: 3.
--
Date 12/01/2015 5:58:48 AM
Log SQL Server (Current - 12/01/2015 1:00:00 PM)
Source spid118
Message The transaction log for database 'ABC' is full due to 'ACTIVE_BACKUP_OR_RESTORE'.
Their question was how their log could be full since their database is in SIMPLE recovery.   

Their particular situation is that database ABC is a 650GB OLTP database, and the backup to the remote location was taking 7-8 hours each night over the time in question:

server_name
database_name
backup_start_date
backup_finish_date
physical_device_name
type
BackupSizeGB
Server1
ABC
12/4/2015 23:30
12/5/2015 7:11
VNBU0-29756-4248-1450758615
D
630
Server1
ABC
12/3/2015 0:00
12/3/2015 7:46
VNBU0-33644-31396-1450499421
D
630
Server1
ABC
12/1/2015 23:30
12/2/2015 6:22
VNBU0-30500-35052-1450413013
D
424
Server1
ABC
11/30/2015 23:30
12/1/2015 6:37
VNBU0-18236-33032-1450326613
D
468
Server1
ABC
10/30/2015 23:30
10/31/2015 4:51
VNBU0-5696-14276-1447734610
D
386
Server1
ABC
10/29/2015 0:05
10/29/2015 5:27
VNBU0-14976-21580-1447475427
D
378
Server1
ABC
10/27/2015 23:31
10/28/2015 4:59
VNBU0-18040-27960-1447389025
D
367
Server1
ABC
10/26/2015 23:31
10/27/2015 4:34
VNBU0-20180-26980-1447302625
D
356
Server1
ABC
10/25/2015 23:31
10/26/2015 5:00
VNBU0-22808-28180-1447216223
D
372
Server1
ABC
10/24/2015 23:31
10/25/2015 5:21
VNBU0-6160-15336-1447129821
D
372
Server1
ABC
10/23/2015 23:31
10/24/2015 5:01
VNBU0-5396-24044-1447043425
D
359
Server1
ABC
10/22/2015 23:31
10/23/2015 7:47
VNBU0-8796-18884-1446957027
D
375
Server1
ABC
10/21/2015 23:37
10/22/2015 5:29
VNBU0-18032-28004-1446870623
D
364
Server1
ABC
10/20/2015 23:31
10/21/2015 5:00
VNBU0-8692-19836-1446784216
D
371


The way FULL backups work in SQL Server, the transaction log is not released for re-use during a FULL backup, even if regular LOG backups are occurring or the database is in SIMPLE recovery.  This is due to the fact that the portion of the LOG that is used during the FULL has to be persisted during the FULL in order to be backed up at the end of the FULL – that is, the FULL backup includes the data at the start of the FULL (23:30) *plus* the LOG used until the end of the FULL (in the case of the 12/04-12/05 backup, the LOG used from 23:30 to 07:11).  This is the meaning of the ACTIVE_BACKUP_OR_RESTORE message – the LOG is waiting for the end of the active FULL backup before it can be released for re-use, which in this case was causing the LOG/LDF file to grow to fill its mount point.

What this means is that the LOG file has to be large enough (or be able to auto-grow large enough) to hold all of the work done during the FULL backup.  In this system (like many others) there is maintenance done overnight (the largest on this system being a purge job which runs for 7-8 hours each night against database ABC almost exactly during this same time window).  This maintenance was generating hundreds of GB of LOG each night during the FULL backup, resulting in the errors shown above.

For this client, the problem turned out to be that the NetBackup process performing these particular backups had been replaced by a snapshot style backup and disabled at the end of October but had recently been accidentally re-enabled.  Shutting off this NetBackup schedule killed these backups, directly resolving the issue.

The alternative would be to have a sufficiently large LOG/LDF file to hold the work done during the FULL backup - even in SIMPLE recovery this is required.  Another option would be to examine the nightly maintenance and schedule it to a different window, away from the FULL backup to minimize the amount of LOG/LDF that needs to persist during the FULL.

Many people don't consider this situation, and it doesn't come up frequently - usually only in this specific situation of a relatively large database that is also high traffic to fill the LOG during the FULL.

Hope this helps!

--

NOTE: as I was finishing polishing this for release I noticed that Erik Darling (blog/@dumb_chauffeur) released a similar post earlier today "What happens to transaction log backups during full backups?" on the Brent Ozar Unlimited blog.  His post has good detail on proving the situation's existence via the Log Sequence Numbers (LSNs) stored in the FULL and LOG backups.




Tuesday, December 8, 2015

T-SQL Tuesday #73 - Not Running CHECKDB is Naughty!


This month T-SQL Tuesday is hosted by Bradley Balls (blog/@SQLBalls) and his topic of choice is “Naughty or Nice.”

(If you don’t know about T-SQL Tuesday check out the information here – each month there is a new topic and it is a great excuse to write each month (and to start writing!) because someone offers a topic, so you already have taken the first step!).

--

Probably the number one problem (Naughty, Naughty!) I find when I run across a new server is DBCC CHECKDB - or rather, the complete lack of CHECKDB.


http://i.memecaptain.com/gend_images/fEN0ew.png

I have heard any number of excuses over the years, but they are almost all some variant of this:

It takes too much time/CPU/RAM/IO/Wheatons to run CHECKDB on my databases.

To this I say one thing - <BLEEP!>

At the end of the day you can't afford *not* to run CHECKDB- you just can't.  Your database objects are only as reliable as your last clean CHECKDB (meaning your last CHECKALLOC + CHECKCATALOG + individual CHECKTABLE's on all of your tables - but more on that in a moment).

If you are not running regular CHECKDB, you may have endless unknown corruption across your databases, and you won't find out until someone tried to access the data.  

Ask yourself - wouldn't you rather find out about data corruption from a regular maintenance process than from the Senior Vice President when she tries to run her Management Big Bonuses Report and it fails?

My recommendation is always to run CHECKDB as often as is plausible, with a practical line of at most once per day (there is no need to run CHECKDB every minute just because you can).

If you can't run daily CHECKDB's, then run it weekly (or even monthly), but run it.

One thing I have done at multiple clients that have not had "sufficient" resources to run full CHECKDB on a regular basis is to split the job up into its component parts mentioned above.  This comes from an idea originally given by Paul Randal (blog/@PaulRandal) and works out like this:

First, analyze the size of your tables to create grouping of comparable sizes.  I like a personal variant of the script at http://stackoverflow.com/questions/15896564/get-table-and-index-storage-size-in-sql-server from "marc_s" that breaks out the objects:
SELECT
s.Name AS SchemaName,
t.NAME AS TableName,
p.rows AS RowCounts,
SUM(a.total_pages) * 8 AS TotalSpaceKB,
SUM(a.used_pages) * 8 AS UsedSpaceKB,
(SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnusedSpaceKB
FROM sys.tables t
INNER JOIN sys.schemas s
ON s.schema_id = t.schema_id
INNER JOIN sys.indexes i
ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions p
ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a
ON p.partition_id = a.container_id
WHERE
t.NAME NOT LIKE 'dt%'    /* filter out system tables for diagramming */
AND t.is_ms_shipped = 0
AND i.OBJECT_ID > 255
GROUP BY t.Name, s.Name, p.Rows
ORDER BY TotalSpaceKB desc
Next, after you have divided the tables into two to four groups, schedule them into jobs running CHECKTABLE individually on groups of tables, with a final job running a catch-all step for all tables not explicitly checked - something like this (I like Ola Hallengren maintenance so this syntax is from there, but the concept will be visible even if you aren't familiar with his code):


Job 1 - run CHECKTABLE on three big tables in BigDatabase01:
EXECUTE [dbo].[DatabaseIntegrityCheck]
@Databases = 'BigDatabase01',
@CheckCommands = 'CHECKTABLE',
@Objects = 'BigDatabase01.dbo.BigTable01, BigDatabase01.dbo.BigTable02, BigDatabase01.dbo.BigTable03',
@LogToTable = 'Y'

Job 2 - run CHECKTABLE on the next two big tables in BigDatabase01:
EXECUTE [dbo].[DatabaseIntegrityCheck]
@Databases = 'BigDatabase01',
@CheckCommands = 'CHECKTABLE',
@Objects = 'BigDatabase01.dbo.BigTable04, BigDatabase01.dbo.BigTable05',
@LogToTable = 'Y'

Job 3 - Step 1 - run CHECKTABLE on all tables in BigDatabase01 except the five big tables from Jobs 1 and 2 - this "catch-all" step is very important because you need to make sure new tables added to the database get checked - if all you have are steps running against individual tables, new table never get checked!
EXECUTE [dbo].[DatabaseIntegrityCheck]
@Databases = 'BigDatabase01',
@CheckCommands = 'CHECKTABLE',
@Objects = 'ALL_OBJECTS, -BigDatabase01.dbo.BigTable01, -BigDatabase01.dbo.BigTable02, -BigDatabase01.dbo.BigTable03, -BigDatabase01.dbo.BigTable04, -BigDatabase01.dbo.BigTable05',
@LogToTable = 'Y'

Job 3 - Step 2 - run CHECKTABLE on all tables in all databases except BigDatabase01 - the same "catch-all" logic applies here - if you run checks against individual named databases (rather than all databases except BigDatabase01) then new databases don't get picked up:
EXECUTE [dbo].[DatabaseIntegrityCheck]
@Databases = 'USER_DATABASES, -BigDatabase01',
@CheckCommands = 'CHECKTABLE',
@Objects = 'ALL_OBJECTS',
@LogToTable = 'Y'

Job 3 - Step 3 - run CHECKALLOC on all databases:
EXECUTE [dbo].[DatabaseIntegrityCheck]
@Databases = 'USER_DATABASES',
@CheckCommands = 'CHECKALLOC',
@LogToTable = 'Y'

Job 3 - Step 4 - run CHECKCATALOG on all databases:
EXECUTE [dbo].[DatabaseIntegrityCheck]
@Databases = 'USER_DATABASES',
@CheckCommands = 'CHECKCATALOG',
@LogToTable = 'Y'

Over the course of three jobs (usually scheduled over three nights), this plan runs CHECKTABLE on all tables in all databases, runs CHECKALLOC on all databases, and runs CHECKCATALOG on all databases - BINGO! CHECKDB!

This is a relatively simple example with a server that has one large database with multiple large tables. (Of course, this is a fairly common situation on many of these "can't CHECKDB" servers.)  The same principle - divide the CHECKDB operation into component parts - allows me to scheduled CHECKDB on multiple systems when a client says "oh we can *never* run CHECKDB here."

This same principle of exclusion is useful to start running CHECKDB (or its components) on systems that have large databases or objects even if you can't get the offending objects in progress.  Usually when I find one of these servers, there are no CHECKDB's in place on *any* databases even though the problem is ReallyBigTable01 in BiggestDatabase01.  A great first step is to set up CHECKDB or its components on all objects except the offending one(s) - as above, set up CHECKTABLE on everything on the server except ReallyBigTable01 and also CHECKCATALOG and CHECKALLOC everywhere, and you have much more protection while you figure out what to do about ReallyBigTable01.

--

At the end of the day - stay off the naughty list - 
RUN CHECKDB!


http://sugarscape.cdnds.net/14/50/980x490/nrm_1418137741-santa-naughty.jpg





Thursday, December 3, 2015

What Do You Mean There is No Current Database Backup?!?

Yet another late night/early morning tale...

The LOG backups for multiple databases on a server failed overnight (not all databases, but many of them), and looking in the output file from the LiteSpeed Maintenance plan (the client's tool of choice) I found entries like this:

Database Database01: Transaction Log Backup...
Destination: "I:\MSSQL\Transaction_Log_Backups\Database01\Database01_tlog_201511211012.TRN"

SQL Litespeed™ Version 4.8.4.00086
Copyright (C) 2004-2006, Quest Software Inc.
Quest Software.
Registered Name: ServerA

Msg 62309, Level 19, State 1, Line 0
SQL Server has returned a failure message to LiteSpeed which has prevented the operation from succeeding.

The following message is not a LiteSpeed message. Please refer to SQL Server books online or Microsoft technical support for a solution:
BACKUP LOG is terminating abnormally.
BACKUP LOG cannot be performed because there is no current database backup.

I checked the FULL backup job, and it had completed normally that night.

Looking back in the SQL Server Error Log I saw that the problems started at or around 9 pm local server time the previous night – when I looked at that time I found paired messages like these for *all* of the databases - not just the offending ones:

Date       11/20/2015 9:08:18 PM
Log        SQL Server (Current - 11/21/2015 10:16:00 AM)
Source     spid110
Message    Setting database option RECOVERY to SIMPLE for database Database01

--

Date       11/20/2015 9:08:22 PM
Log        SQL Server (Current - 11/21/2015 10:16:00 AM)
Source     spid111
Message    Setting database option RECOVERY to FULL for database Database01.

These messages showed that a process of some kind ran just after 9 pm that switched the databases from FULL recovery to SIMPLE and then back again.  This broke the LOG recovery chain and required new FULL backups before any LOG backups could succeed, which is why the LOG backup job was failing.

The regular nightly FULL backup job on this server runs at 6 pm and normally takes 3-4 hours to complete (that night it took 4.25) – the databases that had backups completed prior to the FULL>>SIMPLE>>FULL switch were those that were failing – the databases that didn’t have backups until after the switch were still OK because they had a FULL *after* the switch to re-start their LOG recovery chain.

The fix in this case was to run new FULL backups of the impacted databases, which thankfully I was able to do without much impact during the morning. (Although DIFF backups could have worked too, as discussed here by Paul Randal (blog/@PaulRandal)

I never could get the users to admit who had made a change (although mysteriously it hasn't happened again :))

--

The cautionary tale part of this relates to the recovery switch.  It is not uncommon to see recommendations to switch the recovery model prior to a large data load, or even a nightly import/export process, as part of transaction LOG management (although depending on your operations, a switch to BULK_LOGGED can be just as effective without breaking the LOG chain - there are point-in-time recovery issues for times *during* any actual minimally logged operations).  I am not a fan of switching like this, but it is out there.

The takeaway is that you need to plan your backup/restore schedule accordingly - if you use a recovery model switch mechanism make sure to schedule your FULL backups *after* the recovery switch so that this problem doesn't bite you.

Also - always scan vendor-provided scripts for recovery model changes - we all know it is good practice to backup databases before and after major changes (we know that, right?) but it also often doesn't happen for one reason or another.  If a script changes the recovery model to SIMPLE and back, you will have this problem if you don't take a new FULL or DIFF backup after the script runs.

Hope this helps!




Tuesday, November 24, 2015

We Are All Responsible


Today we all received an email from PASS's president Tom LaRock (blog/@SQLRockstar):


PASS Anti-Harassment Policy Reminder
It is unfortunate that I have to write this letter, but it has become necessary.
An Anti-Harassment Policy (AHP) was implemented in 2012 to help provide a harassment-free conference experience. Everyone attending PASS Summit is expected to follow the AHP at all times, including at PASS-sponsored social events. This includes, but is not limited to, PASS staff, exhibitors, speakers, attendees, and anyone affiliated with the event.
This year at PASS Summit I served on the Anti-Harassment Review Committee. As such, it was my responsibility to help review complaints and incidents reported during Summit. The PASS Summit experience should be enjoyable, exciting, and safe for everyone at all times. However, I am disappointed to say that PASS Summit was a negative experience for a few members of our #SQLFamily this year.
I expect Summit attendees and PASS members to treat others with respect at all times, whether that is inside a session room, the conference hallway, a PASS sponsored event, or even at a local coffee shop.
On a positive note, there were people actively using the policy this year and supporting one another onsite as well. I am proud to see that our community has embraced having this policy. It is wonderful to know that our #SQLFamily will not put up with these types of incidents.
If you have concerns or want to report an incident within the guidelines of the AHP, I encourage you to contact governance@sqlpass.org.
Thomas LaRock
PASS President

--


It is sad to me that we still live in a world where it is necessary to remind people of things like this.  I am not deluded that events like those in this post by Wendy Pastrick (blog/@wendy_dance) will never happen, but it is still disappointing every time I hear about one.

No one - regardless of gender/ethnicity/anything else - should have to suffer from an uncomfortable environment or even worse outright physical or mental abuse, especially in a professional setting.  When something like this does happen it is on all of us to speak up to try to end the situation and to prevent it from recurring.

I commit to personally working harder at this - when we see something unacceptable happen, Speak Up!

At the end of the day, we are all responsible for ourselves *and* for each other in our #sqlfamily.

As always, my $.02

--

Also - read Erin Stellato's (blog/@erinstellato) excellent post here giving *her* $.02.


Monday, November 16, 2015

CHECKDB - The database could not be exclusively locked to perform the operation

…so the job failed….again…you know the one – that darn Integrity Check job: 
Executing query "DECLARE @Guid UNIQUEIDENTIFIER EXECUTE msdb..sp_ma...".: 100% complete  End Progress  Error: 2015-11-15 03:27:52.43     Code: 0xC002F210     Source: User Databases Integrity Check Execute SQL Task     Description: Executing the query "DBCC CHECKDB(N'master')  WITH NO_INFOMSGS  " failed with the following error: "The database could not be exclusively locked to perform the operation.  Check statement aborted. The database could not be checked as a database snapshot could not be created and the database or table could not be locked. See Books Online for details of when this behavior is expected and what workarounds exist. Also see previous errors for more details."

This particular SQL Server 2014 Standard server is relatively new at a client that still uses old style maintenance plans (haven’t yet been able to convince them to #GoOla).  My first instinct in these cases is always that there is a problem related to the maintenance plan itself rather than the database or the integrity check command.  Since the failure is on the small yet unbelievably important master database I decided to just run the command from the query window to find out… 
DBCC CHECKDB(N'master')  WITH NO_INFOMSGS Msg 5030, Level 16, State 12, Line 1The database could not be exclusively locked to perform the operation. 
Msg 7926, Level 16, State 1, Line 1Check statement aborted. The database could not be checked as a database snapshot could not be created and the database or table could not be locked. See Books Online for details of when this behavior is expected and what workarounds exist. Also see previous errors for more details.

I have a small private user database on the system so I tried CHECKDB there: 
DBCC CHECKDB(N'Ntirety')  WITH NO_INFOMSGS Command(s) completed successfully.

OK….now what?

A Google of the “CHECKDB master could not be exclusively locked” brought back a short list of possibles which turned into a strand of spaghetti through the usual suspects – ServerFault, StackExchange, support.microsoft.com…before I found a question on SimpleTalk that sounded pretty spot on and referenced the most reliable source on CHECKDB I know, Paul Randal (blog/@PaulRandal).

Paul’s blog post “Issues around DBCC CHECKDB and the use of hidden database snapshots” discusses the need to have certain permissions to be able to create the snapshot CHECKDB uses.  I checked the DATA directory and the SQL Server default path and found that the service account did have Full Control to those locations.

What happened next ultimately resolved my issue, and it reflects something I constantly tell people when they ask me how I research things relatively quickly (most of the time anyway :)) – whenever you read a blog post or article about a subject, MAKE SURE TO READ THE FOLLOW-UP COMMENTS!  Sometimes they are nothing beyond “Great Article!” but quite often there are questions and answers between readers and the author that add important extra information to the topic, or just “Don’t Forget This!” style comments that add more detail.

In this case, one of the commenters said this: 
Brandon Tucker says:
August 27, 2013 at 2:19 pm
Ran into some issues and did some more debugging. Check
https://connect.microsoft.com/SQLServer/feedback/details/798675/2008-r2-engine-dbcc-checkdb-fails-against-databases-on-drives-that-dont-have-certain-root-permissions 

BINGO!

The Connect item talks about how the questioner was trying to use minimum permissions for their SQL Server service account and found that it broke the CHECKDB for their system databases in the same fashion as what I was seeing.  The agreed-upon fix was to add READ permission (just READ, not WRITE or anything higher) to the root level of the drive - not the folders, but the actual root of the drive.  Most of the respondents talked about adding the permission to ServerName\Users, while one person mentioned jut adding for the service account.

I checked the root directory of the DATA drive (where the data files for both the system and user databases reside) and found that it had been stripped down to Adminstrators and a couple of system logins (Windows Server 2012 R2). 

I added the read permission to the root of the drive for my SQL Server service account, and: 
DBCC CHECKDB(N'master')  WITH NO_INFOMSGS
 Command(s) completed successfully.

…and all was right with the world.

A subsequent attempt to run the offending Integrity Check job succeeded cleanly as well.

--

This is not the first time I have seen something like this and I always have to question the level of zeal with which some security admins/teams pursue this.  In my experience most shops (by far) have insufficient security, while the remaining shops have way too much, functionally breaking the systems by keeping them so secure that *nobody* can access them.

There are only a few people out there that seem to have that happy balance figured out where data is secure but people can still do their jobs and customers can use the systems.

In general my best experience is not to mess with permissions of the SQL Server account for files/folders/drives that are relevant to SQL Server.  If a folder is in one of the MSSQL hierarchies, let the install set the permissions and then LEAVE THEM ALONE!

Your SQL Server service account doesn’t need to be a Windows administrator, or even have that elevated of general permissions – but if the file or folder is created or managed by SQL Server, leave the service account alone.


As always, my $02 – hope this helps!