|« Microsoft Ignite||SCOM 2012 SP1 to R2 failed and how to fix failed management servers »|
Case of the fast growing SCOM datawarehouse db and logsSCOM, SQL, System Center, SCOM Tricks, SCOM 2012
This is a post in continuation of my previous post on the topic of upgrading a SCOM 2012 SP1 to a SCOM 2012 R2, which went wrong and how I was fixing it. After fixing the SCOM instance everything seemed alright for half a day. The very next morning we however saw something was wrong with the SCOM datawarehouse database. Below are some of my lessons learned and some SQL stuff It is all a long story. So sorry. I put it all down because it was a learning experience and it just happened to unfold this way. And also here there are no screenshots available anymore so will have to go for a load of tekst.
So what happened?
First message we got was that the log file beloning to the datawarehouse database had grown to the size of the disk and the disk and log file were full. Second message was that the database was in recovery mode currently.
So I went into SQL Management Studio on the machine and right there it said that the database was in recovery mode. Nothing else could be done with it. No reports to be run (I like the Disk Space report in SQL). No properties of the database to be opened. Looking at the eventvieuwer Application log I saw event 9002 "The transaction log for database 'OperationsManagerDW' is full due to 'ACTIVE_TRANSACTION'." and event 3619 "Could not write a checkpoint record in database OperationsManagerDW because the log is out of space". So next on the agenda was to first increase the size of the disk the log file was sitting on with 5 GB.
This prompted the log file to start growing again in its set increments and it filled the increased space right up. Hmmm ALright, lets do that again with 10 GB more space. And I even found an old file sitting on the disk of 5 GB which didnt belong there. SO it had 15 GB to work with.
Log file growing again. Database was out of recovery mode I thought. Great. But wait. Log file growing still. End of disk. Recovery mode turned on again! Ahhhhh.
Database in Recovery mode
When the database is recovering it will log Application log entry 3450 with a decription like this:
"Recovery of database 'OperationsManagerDW' (8) is 0%% complete (approximately 71975 seconds remain). Phase 1 of 3. This is an informational message only. No user action is required."
It needs to get through all three phases of the recovery and in my case the 20 hour estimate ended up being close enough. So best would be to let this recovery job finish and clear my log file! Gave the disk 150 GB more space to work with. And recovery was underway for the next 20 hours and meanwhile the log was growing nicely and I could not play with the datawarehouse database in any way. So lets wait.
Now the thing you need to know about log files in SQL databases is there are three kinds of Recovery Model for a database. Full or Simple or BulkLogged. Open up a SQL management studio, go to a database. Right click Properties. Go to the Options tab and on the second line it states the Recovery Model. When it is set to Full the log file will fill with all transactions happening on the database and it will stay there until you make a backup of the database. The backup will flush the log file at the end and you have an empty log file again. If it is set to Simple mode the basic method is that a transaction happens on the database and it written to the database and at the end of the transaction it cleans the log file. This is the simple explanation mind you, however it is a bit more complicated as I discovered. Will come back later.
The SCOM databases are set to Simple mode Always. But because the log file was growing so fast I was starting to believe something went wrong with that. Right click properties on the database. Error. Sigh. Now it starts to be clear to me that I am a clicking kind of guy. I am used to clicking my way through options and settings in most applications. And now I needed some SQL queries.
Asking for advice!
At this point I called a friend of mine who knows all about SQL, Jacky van Hogen. For some advice on how to find out what was happening and for some guidance on how to proceed when the recovery was finished. I think I actually waited for the recovery to finish first. At that point I still could not open the properties of the database, but as we found out we could simply run queries against it Thank you so much for handing me some queries
First a check on the recovery mode and on the status of the databases:
select * from sys.databases
Gives you a list of the database and one of the columns tells you if it is online and another one tells you recovery model. Database was now online and recovery model Simple.
Next we want to see what is going on. So directed this query against the OperationsManagerDW database:
select * from sys.dm_exec_requests
Check the sessions with an ID above 50.
You will find your own query requesting these data as well using your own session ID. I did not know this but at the top bar of your SQL Management Studio is your query session ID between brackets for each query session. Nice, so ignore that session and focus on the rest with numbers above 50.
Sure enough there were a few of those. A few were waiting for a session 117 in my case and it was an INSERT statement. Over the hours that followed I saw it go from Running to Suspended to Runnable to Running again all the time. And meanwhile the log file was still growing.
Now lets look at the oldest open transaction:
Sure enough, there was session ID 117 and it was running since a number of hours by that time. Actually after recovery was succesful I had restarted SQL Services, hoping that would make it stop its transactions and flush the log file. Doesn't quite work that easy But at least we could see what happened.
Few things I learned during a short discussion with Jacky were:
- This oldest open transaction needs to finish its work first. Next it will clean the log file. Any jobs running alongside will also write to the log file and these will not be cleaned out. This is because there might be a chance that the currently running transaction might still need those data in case we need to stop that transaction and it will do a rollback. Aha, so there is slightly more to the Simple recovery model than I thought.
- We can kill the transaction simply with the command Kill 117. It will do a rollback of all its actions and in the end clear the log file. Or it should. This takes a while for something that has filled up over 200 GB in log space by now. However there is the biggest chance that the job will just start again and from the start and take the same amount of data again and more.
- Best thing in this case would be to give it the space it needs and let it finish its work. After that shrink the log file and clean up.
So we decided to give it some time and meanwhile keep an eye on it.
Checking the log file
She asked me to check out the log file contents.
Now I might be saying this wrong or use the wrong terms but this basically gives you the Virtual log files within the log file (or log files) for that database. I think I will just see it as the pages in a book. You can have 1 or more log files (books) for a database. Each log file has a number of pages to fill up with data. Normally the log file writes sequentially (so it writes like in a book from beginning towards the end), but when bits and pieces get cleared out it could be that most of the log file is empty but still has some pages written at the end. This is usually the reason why sometimes you can not shrink a log file on the first try. It will clean out pages and find it can not clean one in the end so it can not make the log file smaller. Repeating it a few times makes the last change jump to the front of the log file again and the end becomes cleaned up, so we can shrink the log file. By the way we can write a checkpoint in the log by simply giving the command Checkpoint in a query.
Well in my case we were looking first at if the pages were all in use. Check the Status column. If it says 0 it is empty and if it sas 2 it is full with data. In my case most of the file was full. SO not much to gain by trying to move daat around inside the log file and shrinking it because the transaction had it all clearly in use.
Also we found there were way too many virtual log files (pages in my example) in the log file. Probably caused by the many auto-grow events. An interesting article forwarded to me by Jacky is http://blogs.msdn.com/b/saponsqlserver/archive/2012/02/22/too-many-virtual-log-files-vlfs-can-cause-slow-database-recovery.aspx
Watching the transaction do its work
Also interesting was to see how the transaction 117 went through the whole Running - Suspended - Runnable - Running status changes while running the "select * from sys.dm_exec_requests" command. This was due to the autogrow of the log file each time among others. Waiting for the file extension to be created, waiting for the disk (thats while it was suspended) and next it will go to the runnable status and waits for open threads to get processor time and jump to running status. Again this is the short and simple way of saying it I guess.
Also Jacky sent me a query to check if this transaction was using so much of the log space:
select * from sys.dm_exec_requests r
on t.transaction_id = r.transaction_id
and t.database_id = r.database_id
And check for the field database_transaction_log_bytes_used .
Sure enough it was transaction 117 using a few hundred GB of log file space.
Creating additional log files
Another thing which worried me was if I could keep expanding the log disk like that. There will come an end to the storage lun at some point right? So alternative would be to create additional log files for this database on other disks. Go to the database and right click to open the properties and to add a log file right? Wrong, could not open properties of the database still at this point, so had to use the TSQL again for it. I had done this once before for another customer. http://technet.microsoft.com/en-us/library/bb522469(v=sql.105).aspx
AN example perhaps:
ALTER DATABASE OperationsManagerDW
ADD LOG FILE
Name = OpsDW,
FILENAME = 'D:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\opsdwlog2.ldf',
SIZE = 100MB,
MAXSIZE = 10000MB,
FILEGROWTH = 100MB
And yes in hindsight I should have ignored the autogrow setting and just made it fixed. It would be a temporary file anyway. In the end I could add space to the disk where the big log file resided anyway.
Give up your secret mister transaction
All of this was really bugging me and I was trying to figure things out as they came along. So I went out and tried to find out more about the query which was running. Our illusive number 117. What are you doing mister 117?
I found a query somewhere on the internet. Sorry, I did not record where I found it. It is an extention of the command I used before to check what it was dong. I will paste it below:
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
er.session_Id AS [Spid]
, DATEDIFF(SS,er.start_time,GETDATE()) as [Age Seconds]
, SUBSTRING (qt.text, (er.statement_start_offset/2) + 1,
((CASE WHEN er.statement_end_offset = -1
THEN LEN(CONVERT(NVARCHAR(MAX), qt.text)) * 2
END - er.statement_start_offset)/2) + 1) AS [Individual Query]
, qt.text AS [Parent Query]
FROM sys.dm_exec_requests er
INNER JOIN sys.sysprocesses sp ON er.session_id = sp.spid
CROSS APPLY sys.dm_exec_sql_text(er.sql_handle)as qt
WHERE session_Id > 50
AND session_Id NOT IN (@@SPID)
ORDER BY session_Id, ecid
Alright, so this one uses the same kinds of commands, filters out only sessions above 50 and gives some info on the query running and the parent query. Now what I saw in both cases was Alert Staging.
Now lets get back to SCOM again, because that sounds familiar!
The basic way that SCOM works with this stuff is that the Agents send data to the management server. The management server inserts data into the Datawarehouse database. Next the data must be aggregated. It does this into a number of staging tables. There are staging tables for Alert, performance, event and state. Next inside the SCOM workflows there are some rules which belong to SCOM itself which kick off stored procedures in the datawarehouse. These process the data from/through the staging tables and put it in the historical tables and do some stuff with it and next clean the staging area up again.
So what this seemed to be telling me is that one of the management servers kicked off a rule which kicked off the stored procedure to handle the alerts in or through the alertstaging table. By the way I could see which of the management servers this was as well. But these kind of jobs run often and only have a rows of data to work through! Lets have a look.
SELECT count(*) from Alert.AlertStage
Uhmmm, 300 milion rows ?!?!?!
Now thats an alert storm of some kind and probably programatically. As there is no way we have that many alerts in a short amount of time. Now I know that this transaction will never get through this amount of data and what is the point? These are not normal alerts and there is no value in retaining them. Somebody wrote a nice blog post about another issue where the data from the alert staging was not written to the normal tables: http://bsuresh1.wordpress.com/2014/03/18/alert-data-is-not-being-inserted-into-scom-data-warehouse/. In my case I was not interested in the alerts, so I did not go for temporarily moving them to another table and running through them manually with that stored procedure.
I opted to clean out the table.
TRUNCATE TABLE Alert.AlertStage
Sit back and wait for 300 milion rows to be removed. I had the hope that once the transaction 117 realized there was no more data to process that it was done and ready and could clean up after itself and thus the log file. Guess I was not that Lucky because 2 hours afterwards it was still running. So I was done with this and killed session 117 (Kill 117). Of course this caused a rollback and a few hours later...
Next run a check on the four staging tables:
SELECT count(*) from Alert.AlertStage
SELECT count(*) from Event.EventStage
SELECT count(*) from Perf.PerformanceStage
SELECT count(*) from State.StateStage
All normal with low numbers.
It also cleaned up a few hundred GB of space in the database itself
And from this point I could finally open the properties if the database too!
So ran the task of shrinking the file. Right clik the database - task - shrink - file.
Make sure you select the log file of course! it shows the amount of available empty space inside (which was a lot). SO went ahead and shrunk it down. If it doesnt work open a query window and type the command "checkpoint" and try it again. This can be the case when there is something still writing around the end of the file. As soon as it wraps and starts writing at the start of the file the end will be clear and shrinking will work.
So what was the cause of all this mess? I can't say, but we saw this happening within hours after the bodged update of SCOM from 2012 SP1 to 2012 R2 and UR3. And yes I did run those SQL scripts belonging to UR3 as well. It is possible that the upgrade wizard which killed my first management server and the operational database might have touched the datawarehouse as well before it even reached that step in the wizard? I do not know. Next time just go for sure and restore not only the operational database but also the datawarehouse database. Even if it is over a TB in size. All the stuff that happened in the story described above took a load of time as well.
But it was a nice learning experience on some points as well.
Just happy that the whole system is running well again.
A big thanks to Jacky van Hogen for her advice on the SQL pieces over the phone on her lunch break! Just a few minutes of good advice and some pointers in the right direction from an expert in her field was such a big time saver and reduces stress.
It also sparked an idea, which I will get back to later.