Another case of SCOM databases and log files going down and growing fast

Well, I have spent the last two days fighting SQL to get a customers SCOM environment running again. Somehow a few weeks ago their SCOM environment broke due to the databases going down. Seemed something went wrong with backups and the log files filled the disk fast. There were some buttons clicked and some things tried and in the end I got called in to come and see what I can do about it.

Basic Environment:

4 management servers, a number of gateways and a load of agents and network devices. An SQL always-on cluster hosted in Azure.

Problem situation:

When I logged into the environment I found the two main SCOM databases being still part of the availability group, but one having a sync problem (as in not syncing) and one simply broken. Also one of the management servers was broken down and they already removed it from the management group. Now in order to fix things in SCOM, we needed to have the databases online, synchronized and in the availability group, due to the management servers talking to the AG listener.

SQL database work part 1:

It seemed the OperationsManager database was reachable, due to the AG being primary on the server where the database was actually still running and the other server had it broken. So first started with the datawarehouse. At first removed it from the availability group. On one server it refused to come back online, so it got removed. On the other server I went on to get it online again. From there it needed fresh backups, also in order to flush the gigantic and 100% full log file. This took a while! Was good that I asked an SQL DBA friend of mine for advice before even starting looking at the situation since I already heard from customer what was going on so far (thank you Danny de Haan for your advice, it saved me a load of time in this stage!).

After the backups I checked for disk space and got it back into the availability group again. Next was the OperationsManager database where the same procedure was done. Lot of backups and waiting for initial copy and synchronisation.

All of this took seriously long. Meanwhile we started setting up fresh backups. The log files of especially the datawarehouse database were growing at an alarming rate at this point again. About 70 GB an hour on average! Once we got the backups to actually work the way we wanted to we were taking a backup every hour which took 45 minutes to finish and clear the log. This also slowed down the machine meanwhile.

SQL database work part 2:

So now we had a working SCOM environment. We added the 4th management server back in. But we still had that huge amount of stuff going on, on our SQL server. This was making it impossible to work from the SCOM side of things. So now was finally the investigation of what was happening. Where is a log file growth of 2x the size of the whole database PER HOUR coming from? This is an amount which can not be explained by saying, well now all the management servers can clear their cache into the databases again, and the gateways and the Windows Agents. It would be a lot of data, but not this amount! It was doing this for about 36 hours by now.

We first started by putting most of the Windows agents and all of the network devices in maintenance mode for a few days to weeks. So we can stop most of the data flow from SCOM to SQL. This did not stop the speed of which the log file was growing, but it would help in investigating the issue for now. Also to prevent some run away management pack to blast a database full of data for instance.

On the first page on a google search I ended up finding my own blog post of 3 years ago where we also ran into a fast growing database and log file.
http://www.bictt.com/blogs/bictt.php/2014/10/10/case-of-the-fast-growing

I can tell you, the steps I walked through at that time after some advice from another SQL friend of mine, Jacky van Hogen, helped me solve the issue last time. So lets try again to see if I can find more information about what is happening, who is doing what, what queries… you get the point.

First I started with the query:

select * from sys.dm_exec_requests

To see the sessions and if I could discover something there.
I thought I had found some session which interested me. Upon further inspection the session ID changed, but caught my attention again. So maybe this was something stuck in a loop? Lets see.

From the previous blog article I pulled the query to see what the sessions were doing in the transaction logs. See if I can find the one using a lot of the log file. And sure enough, it was the ID which caught my eye already and it was using a lot of “database_transaction_log_bytes_used”. So now I need to find out why… what is the query?
Of course at this point the session ID disappeared, so needed to do the same again and found it again using a different number.

So back to my previous blog post again and there is a longer SQL query to run to find out what these session IDs are running as query and what their parent query is. This might give you an idea of what is causing it! I ran the query and there we go…
DELETE FROM VeeamBackupMP.SessionTaskStaging WHERE blablabla
and parent query also looked like this.
In my blog article from 4 years ago it was a default SCOM staging table used to move the SCOM alerts around for processing which ran stuck with huge amounts. This time the same seemed to be happening in another staging type of table used by one of the vendor management packs.

Lets see, go and find the SQL queries post from Kevin Holman and the first query is the one for big tables in our SCOM databases. Ran it against the OperationsManagerDW and bingo… the first entry was this same SessionTaskStaging table. Funny thing was that the number of rows did not seem so big. while I was refreshing and investigating we noticed the number of rows go fro 10 thousand to 40 thousand rows and back again and so forth. But the size remained around 17 GB in data! I was speculating if the table was storing whole videos per row or something like that, and of coure speculating on a Friday afternoon with a few people gets funny.

So something was running a Delete statement again and again and probably failing and starting over within only a few minutes. And in the meantime blasting the log file full at a very high rate of 2 GB per minute by that time! I decided to quickly ring Danny again for a quick word to get me in the right direction. First question was to go look at which server and which account were running this delete statement. Stupid me that I did not scroll further to the right of my query results because there it was. Thanks mate, a few seconds of question and advice got me going again.

In the meantime the customer SCOM admin already was removing the Veeam Backup management pack and meanwhile cleaning the dependency references found everywhere. Sigh. He was already done and the same queries kept on coming again and again. So I found the queries again and the sessions and found the server running the query to be one of the management servers. Go onto the machine I wanted to see if it had the management packs still. No packs of that name on local disk anymore. Probably stuck in memory now. Restart SCOM service. And yes! the query was stopped now. It needed to get the management pack removed AND the workflows killed from memory.

From here I truncated that big table since we do not need the contents anyway. The SQL guy told me that the Truncate command does not go through the log file (that was my fear). Good.

So now it was waiting for next log file backup to run and finish. And the one after that.

Sure enough there was a big difference already. Now I could remove some old backup files from the backup disk as this one was filling up all the time with huge backups.

Next morning the backups are smaller, also the full backups. Upon inspection the data grooming in the OpsDB and OpsDW seem to have worked through their usual work. We seem to be back in business again. Now the SCOM admin will bring back the agents from maintenance mode and network devices in small groups! To not cause a huge config churn from state changes immediately. Small steps.

I am so glad that I wrote that article back in the days and that we can sometimes ask the advice of a specialist in an area that we tough a lot upon, but are simply not an expert in. A few minutes of their time goes a long way of pushing you in the right direction as was now proven for the second time. Team work rules!

Happy monitoring!
Bob Cornelissen