Last night there were some update rollups released for SCOM. One for SCOM 2012 SP1 and one for SCOM 2012 R2. Will list them below.
For SCOM 2012 SP1 this is Update Rollup 9 and can be found at KB3023167. The description does not include much at the moment except the mention of Operational Insight and the support for SUSE Linux 12. The article can be found here: http://support.microsoft.com/kb/3023167
For SCOM 2012 R2 this is Update Rollup 5 and can be found at KB3023138. The description has several items listed and there is also the Linux packs separately with an update. The article can be found here: http://support.microsoft.com/kb/3023138
Items in the UR5 rollup for SCOM 2012 R2:
- Monitoringhost process crashes because of bind failures against Active Directory
- RunAs accounts cannot be edited because of "Specified cast is invalid" exception
- Application crashes when a search is finished without filter criteria on Distributed application designer
- MonitoringHost crashes with the exception - System.OverflowException: Value was either too large or too small for an Int32
- Support to troubleshoot PowerShell scripts
- Operational Insights through Operations Manager
- "Reset the Baseline," "Pause the baseline," and "Resume the baseline" tasks fail when they are run against an Optimized performance collection rule
- An exception occurs when you edit subscriptions that contain deleted monitors or objects
- You can't set Widget column width
- Event 4506: Data was dropped because of too much outstanding data
- Support for handling Datawarehouse time-outs
- Can't continue scan with NOLOCK because of data movement
- $ScriptContext.Context does not persist the value in PowerShell widgets
- Evaluation version alert
And the Unix/linux Pack:
- Updating the SCX Agent causes deep monitoring status for the JEE Application Server to be reset.
- By default, the Rpcimap monitor for Red Hat Enterprise Linux 5 is now disabled.
- Monitoring in UTC +13 causes Unix Process Monitoring template to fail.
- Using scxadmin -log-rotate causes logging to stop after log rotation.
As Always please first test the update rollups and supporting management packs and any newly released management packs first in a test environment!
Alerts may not get forwarded as expected via a connector in Operations Manager
When using connectors to forward alerts in System Center Operations Manager 2007 (OpsMgr 2007) or System Center 2012 Operations Manager (OpsMgr 2012, SP1 and OpsMgr 2012 R2), in certain situations such as an alert storm (defined as a large number of alerts being generated in a very short period of time) there may be alerts that are not forwarded via a connector. When this occurs, these alerts will never be forwarded and will remain in a "New" state.
Cause and potential fix can be found in the KB article here:
Want to know about Monitoring SQL with SCOM and Operational Insights?
Then you have to check in with this webinar hosted by the WMUG NL user group and presented by a fellow MVP and friend Simon Skinner. Save your calendars for 28 January 2015 at 20:00 CET (GMT+1). For more information and registering for this webinar please follow this link and do not be shy to spread the word:
All are welcome to join!
Last week we got a fresh database server at a customer site to put the SCOM and Orchestrator databases on. Actually two servers in ALways-on setup. Should be a lot better and faster than the "temporary" servers we had been using before.
So, last week we started with the Orchestrator database simply because it is smaller and because there are less people looking in that console.
We had a few lessons learned and one of them is listed as the subject of this post. But I will go through it in a few steps. Don't try to follow the steps exactly as described, because I am listing them in the order it happened... So lessons learned are in between!
First of all I used the following page on TechNet:
How to Change the Orchestrator Database
This page talks about what to do when moving the database and it looked really easy.
So made a backup of the orchestrator database and restored it on one of the two new nodes. Looked nice. So next was adding it to the always on availability group. There is a wizard for that so clicked through that thing. Needed to specify a share that both servers could reach.
After investigation it turns out the service account which SQL db engine is running should have access to the share from both machines, not my account. Lesson learned.
Run through the wizard again to put it in availability group. What is does is to create backups of database and transaction logs and moving it to the second server and restoring them into there and getting it in sync. It will place the database in the availability group.
How is this possible? the backups seemed to have worked. However the second database was in a retoring mode somehow. It looked like the synchronization did not start. We had our DBA check this with us and we found that it should be connection related. So we checked the firewall settings and it turned out that the Endpoint ports used for syncronization in the cluster were not listed here. We added the firewall ports and checked the one for SQL 1433 was there as well. Lesson learned.
We turned back and removed the second copy and the entry in the Availability group and we tried it again. This time it worked.
By the way if you do it manually to copy the database to be made highly available to the secondary server, do not forget to use both the full backups of the database and ALSO create a transaction log backup and roll that one into the second one too. Else it will not work.
Alright, now the database was running and highly available. We tried a failover to the other node and that worked.
Meanwhile we had Orchestrator itself stopped.
So ran the steps in the technet article and started the services again. Started the Runbook Designer.
The following is this error:
The license for System Center 2012 Orchestrator has expired or is invalid. Enter Product Key.
It asks me to enter the Orchestrator license. Huh? Actually when doing that and pressing Enter it also did not want to continue.
This was very irritating. In the end I think we managed to cancel through it and get to the runbooks, most of which would not start. Or would start and within 5 seconds stop again.
Checking for the errors in the runbook designer it turns out that the connections could not be made to for instance the SCOM server.
We investigated and finally I found this article on the TechNet site:
Migrate Orchestrator Between Environments
Among the very first things I saw was the first step involved in bold:
Back up SQL Server service master key in environment A
Now at first I did not know what this was or what it does, but I am pretty sure this is what our problem relates to!
It did irritate me that those TechNet pages did not reference each other though! Also as you will see they do not take into account what happens when you have multiple target servers such as when using Always On functionality.
First thing I did was think that it was too late already to get that thing moved over. So I went ahead and created a fresh key. Next thing to do was enter the encrypted info again. So go find the connection settings for connecting to SCOM and e-mail and SCCM and so on and re-enter passwords. I opened up all runbooks and re-selected those connections to be sure. Also the System Center license key.
Restart the Orchestrator management service and we were back up and running.
However, hold on!
A little later this went wrong again and it was due to a manual fail over of the availability group the database was living in. And again we had runbooks failing and again the enter your license key error!
Turns out I missed something important there!
Alright this is how it works:
- the data you enter like passwords and so is stored encrypted in the Orchestrator database.
- a key from the database is used for this.
- however this key can only be read or decrypted using the SQL Server Service Master key!
Yeah and this Master key is not inside the datasbase but is stored in the SQL system tables. Meaning it is related to the Instance and not the database. Now in a normal failover cluster we would not have run into the second time to see the Orchestrator go nuts on this. However with this always on cluster there are two machines with their own SQL instance. And the databases are made in sync. But there are two instances!
If the Server Master key is not the same (which it isn't) the data can not be descrypted with failover. So I should have used the key from the original server in the first place or just have replaced it after seeing the error. I did not know it could still be done afterwards. And also did not know that in this case
SO what to do:
Open up SQL management studio and run this command to check the master key:
SELECT * FROM sys.symmetric_keys
(by the way you see here that its not database related by instance related because it is asking the master database for this info and of course its taking it from the system tables).
You will see the servicemasterkey and the Guid belonging to that key.
And of course this was different between the two servers and would have been different from the original database server as well.
So to fix it we need to create a backup of the first key and restore it onto the second server.
First create a directory on the first SQL server to store the backup of the key in. Next run a backup command to backup the key. I have changed the password a bit in this post.
BACKUP SERVICE MASTER KEY
TO FILE = 'C:\backup\service_master_key'
ENCRYPTION BY PASSWORD = '3dH85Hhk004GHk2597ghefj5';
Next I created the same directory on the second server and copied the key bakcup file to it. Open a SQL query on the second server and import the key:
RESTORE SERVICE MASTER KEY
FROM FILE = 'C:\backup\service_master_key'
DECRYPTION BY PASSWORD = '3dH85Hhk004GHk2597ghefj5' FORCE
By the way this last command uses the FORCE option because it will find a database which is already using encrypted data, so we need to force it to use this master key. Make sure you take the key from the working machine and put it on the not working machine though!
Now we check again for the master key and see that they are the same. I restarted SQL Server Service.
Restarted orchestrator services on the orchestrator management server and runbook servers.
So there were a few lessons learned. Sure wish I had read the second article earlier as it could have prevented part of this.
A few of these lessons learned came back when we did the SCOM operational database, but that is a story for next time.
Hi all, I am back from my trip to the states for the MVP summit and the MMS. It was great to see everybody again and hear a lot of stuff. Totally worth the trip.
Last few days I had some issues with this blog. Basically it is running a PHP type blogging software with a MYSQL backend. Something went wrong with the MYSQL and it wasnt easy to fix. Somehow this is working again, but I do not trust it enough. This means I will be transferring the whole blog to a freshly installed server and trying to move the software and databases over, so everything remains the same. Probably with higher versions of both software parts and of course also higher Windows version and so on. If it is down for a number of hours or even a day it will probably be because im moving it at that time.
There was somebody who sent me an email about a page he was trying to get while it was down. Thats great! Feel free to do so if that happens.
Since Microsoft stopped their Microsoft Management Summit and integrated it into their TechEd it still felt like a void was left there in the System Center space. Of course the MMS in the old days started as a community thing and in the end it ended up as a Microsoft conference with some community input. And of course now it was gone.
Well, the community decided to step back in! (yeah i better not include too many smileys here).
User groups and MVP's and other community leaders and experts are jumping in and now the Midwest Management Summit (so also MMS in short) has been born. It is filled with great speakers, many from the community and MVP sides from all over the world. It is gearing up to be a great event and specialized into the field of System Center in connection also with PowerShell, Clouds and so on.
Have a look at this great lineup of sessions in the schedule: http://mms2014.sched.org/
Next I can inform you that I also will be speaking there in two sessions along with my long time friend, SCOM specialist and MVP Cameron Fuller
Check out our sessions here: http://mms2014.sched.org/speaker/bob.cornelissen
I hope to see you there. If you are attending the MMS, please feel free to catch me there and have a chat! That is also what these community based events are all about. I am very much looking forward to meeting as many of you as possible during the upcoming events in November. I will be attending the MVP Summit and this MMS of course!
Today I investigated a case where SCOM had an alert with the following name and contents:
Data Warehouse configuration synchronization process failed to write data
Data Warehouse configuration synchronization process failed to write data to the Data Warehouse database. Kan geen gegevens in de datawarehouse opslaan.
Uitzondering SqlException: Sql execution failed. Error 2627, Level 14, State 1, Procedure ManagementPackInstall, Line 2879, Message: Violation of UNIQUE KEY constraint 'UN_ManagementGroupManagementPackVersion_ManagementGroupRowIdManagementPackVersionRowId'. Cannot insert duplicate key in object 'dbo.ManagementGroupManagementPackVersion'. The duplicate key value is (1, 2020, Jun 18 2014 3:15PM).
There are event log entries in the Operations Manager event log with ID 31552 with the same kind of contents.
Now I want to give a big shout out to the guys in the SCOM Support team, who are writing Support Tip entries on their blog for common issues and solutions. I found their Support Tip quickly and the contents is very clear on how it probably happened and what not to do next time and how to solve the issue at hand.
And yes in my case we did decide during some problems earlier not to restore the DW database because it was huge and so on And yes that probably was the cause. Was during an upgrade of SCOM and the upgrade wizard failed and killed the management server and touched the operational database so decideed to restore the OpsDb because well they wouldn't have touched the datawarehouse db yet right? Wrong, they do So lesson learned for sure, when restoring one database just restore the other one to the same point in time.
So ran the SQL script provided and pasted in the correct key value pair into the script and ran it. Sure enough it was the Notifications Internal Library. Exported the pack, increased the version number. Imported the pack. And few minutes later the 31554 event popped up in the event log.
Thanks again to the SCOM Engineering Blog and the escalation engineers behind it for publishing these kind of support tips.
Just now Microsoft announced the wave of main events of next year. The expected event which brings together several tech events like TechEd and Management Summit and Exchange/Lync/Sharepoint/Project conferences into one big event. Well here it is and it is called Microsoft Ignite. Scheduled May 4 to May 8 in Chicago. Read more about it on this page:
This page also lists some of the other conferences in the year like Convergence, Build and WPC.
This is a post in continuation of my previous post on the topic of upgrading a SCOM 2012 SP1 to a SCOM 2012 R2, which went wrong and how I was fixing it. After fixing the SCOM instance everything seemed alright for half a day. The very next morning we however saw something was wrong with the SCOM datawarehouse database. Below are some of my lessons learned and some SQL stuff It is all a long story. So sorry. I put it all down because it was a learning experience and it just happened to unfold this way. And also here there are no screenshots available anymore so will have to go for a load of tekst.
So what happened?
First message we got was that the log file beloning to the datawarehouse database had grown to the size of the disk and the disk and log file were full. Second message was that the database was in recovery mode currently.
So I went into SQL Management Studio on the machine and right there it said that the database was in recovery mode. Nothing else could be done with it. No reports to be run (I like the Disk Space report in SQL). No properties of the database to be opened. Looking at the eventvieuwer Application log I saw event 9002 "The transaction log for database 'OperationsManagerDW' is full due to 'ACTIVE_TRANSACTION'." and event 3619 "Could not write a checkpoint record in database OperationsManagerDW because the log is out of space". So next on the agenda was to first increase the size of the disk the log file was sitting on with 5 GB.
This prompted the log file to start growing again in its set increments and it filled the increased space right up. Hmmm ALright, lets do that again with 10 GB more space. And I even found an old file sitting on the disk of 5 GB which didnt belong there. SO it had 15 GB to work with.
Log file growing again. Database was out of recovery mode I thought. Great. But wait. Log file growing still. End of disk. Recovery mode turned on again! Ahhhhh.
Database in Recovery mode
When the database is recovering it will log Application log entry 3450 with a decription like this:
"Recovery of database 'OperationsManagerDW' (8) is 0%% complete (approximately 71975 seconds remain). Phase 1 of 3. This is an informational message only. No user action is required."
It needs to get through all three phases of the recovery and in my case the 20 hour estimate ended up being close enough. So best would be to let this recovery job finish and clear my log file! Gave the disk 150 GB more space to work with. And recovery was underway for the next 20 hours and meanwhile the log was growing nicely and I could not play with the datawarehouse database in any way. So lets wait.
Now the thing you need to know about log files in SQL databases is there are three kinds of Recovery Model for a database. Full or Simple or BulkLogged. Open up a SQL management studio, go to a database. Right click Properties. Go to the Options tab and on the second line it states the Recovery Model. When it is set to Full the log file will fill with all transactions happening on the database and it will stay there until you make a backup of the database. The backup will flush the log file at the end and you have an empty log file again. If it is set to Simple mode the basic method is that a transaction happens on the database and it written to the database and at the end of the transaction it cleans the log file. This is the simple explanation mind you, however it is a bit more complicated as I discovered. Will come back later.
The SCOM databases are set to Simple mode Always. But because the log file was growing so fast I was starting to believe something went wrong with that. Right click properties on the database. Error. Sigh. Now it starts to be clear to me that I am a clicking kind of guy. I am used to clicking my way through options and settings in most applications. And now I needed some SQL queries.
Asking for advice!
At this point I called a friend of mine who knows all about SQL, Jacky van Hogen. For some advice on how to find out what was happening and for some guidance on how to proceed when the recovery was finished. I think I actually waited for the recovery to finish first. At that point I still could not open the properties of the database, but as we found out we could simply run queries against it Thank you so much for handing me some queries
First a check on the recovery mode and on the status of the databases:
select * from sys.databases
Gives you a list of the database and one of the columns tells you if it is online and another one tells you recovery model. Database was now online and recovery model Simple.
Next we want to see what is going on. So directed this query against the OperationsManagerDW database:
select * from sys.dm_exec_requests
Check the sessions with an ID above 50.
You will find your own query requesting these data as well using your own session ID. I did not know this but at the top bar of your SQL Management Studio is your query session ID between brackets for each query session. Nice, so ignore that session and focus on the rest with numbers above 50.
Sure enough there were a few of those. A few were waiting for a session 117 in my case and it was an INSERT statement. Over the hours that followed I saw it go from Running to Suspended to Runnable to Running again all the time. And meanwhile the log file was still growing.
Now lets look at the oldest open transaction:
Sure enough, there was session ID 117 and it was running since a number of hours by that time. Actually after recovery was succesful I had restarted SQL Services, hoping that would make it stop its transactions and flush the log file. Doesn't quite work that easy But at least we could see what happened.
Few things I learned during a short discussion with Jacky were:
- This oldest open transaction needs to finish its work first. Next it will clean the log file. Any jobs running alongside will also write to the log file and these will not be cleaned out. This is because there might be a chance that the currently running transaction might still need those data in case we need to stop that transaction and it will do a rollback. Aha, so there is slightly more to the Simple recovery model than I thought.
- We can kill the transaction simply with the command Kill 117. It will do a rollback of all its actions and in the end clear the log file. Or it should. This takes a while for something that has filled up over 200 GB in log space by now. However there is the biggest chance that the job will just start again and from the start and take the same amount of data again and more.
- Best thing in this case would be to give it the space it needs and let it finish its work. After that shrink the log file and clean up.
So we decided to give it some time and meanwhile keep an eye on it.
Checking the log file
She asked me to check out the log file contents.
Now I might be saying this wrong or use the wrong terms but this basically gives you the Virtual log files within the log file (or log files) for that database. I think I will just see it as the pages in a book. You can have 1 or more log files (books) for a database. Each log file has a number of pages to fill up with data. Normally the log file writes sequentially (so it writes like in a book from beginning towards the end), but when bits and pieces get cleared out it could be that most of the log file is empty but still has some pages written at the end. This is usually the reason why sometimes you can not shrink a log file on the first try. It will clean out pages and find it can not clean one in the end so it can not make the log file smaller. Repeating it a few times makes the last change jump to the front of the log file again and the end becomes cleaned up, so we can shrink the log file. By the way we can write a checkpoint in the log by simply giving the command Checkpoint in a query.
Well in my case we were looking first at if the pages were all in use. Check the Status column. If it says 0 it is empty and if it sas 2 it is full with data. In my case most of the file was full. SO not much to gain by trying to move daat around inside the log file and shrinking it because the transaction had it all clearly in use.
Also we found there were way too many virtual log files (pages in my example) in the log file. Probably caused by the many auto-grow events. An interesting article forwarded to me by Jacky is http://blogs.msdn.com/b/saponsqlserver/archive/2012/02/22/too-many-virtual-log-files-vlfs-can-cause-slow-database-recovery.aspx
Watching the transaction do its work
Also interesting was to see how the transaction 117 went through the whole Running - Suspended - Runnable - Running status changes while running the "select * from sys.dm_exec_requests" command. This was due to the autogrow of the log file each time among others. Waiting for the file extension to be created, waiting for the disk (thats while it was suspended) and next it will go to the runnable status and waits for open threads to get processor time and jump to running status. Again this is the short and simple way of saying it I guess.
Also Jacky sent me a query to check if this transaction was using so much of the log space:
select * from sys.dm_exec_requests r
on t.transaction_id = r.transaction_id
and t.database_id = r.database_id
And check for the field database_transaction_log_bytes_used .
Sure enough it was transaction 117 using a few hundred GB of log file space.
Creating additional log files
Another thing which worried me was if I could keep expanding the log disk like that. There will come an end to the storage lun at some point right? So alternative would be to create additional log files for this database on other disks. Go to the database and right click to open the properties and to add a log file right? Wrong, could not open properties of the database still at this point, so had to use the TSQL again for it. I had done this once before for another customer. http://technet.microsoft.com/en-us/library/bb522469(v=sql.105).aspx
AN example perhaps:
ALTER DATABASE OperationsManagerDW
ADD LOG FILE
Name = OpsDW,
FILENAME = 'D:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\opsdwlog2.ldf',
SIZE = 100MB,
MAXSIZE = 10000MB,
FILEGROWTH = 100MB
And yes in hindsight I should have ignored the autogrow setting and just made it fixed. It would be a temporary file anyway. In the end I could add space to the disk where the big log file resided anyway.
Give up your secret mister transaction
All of this was really bugging me and I was trying to figure things out as they came along. So I went out and tried to find out more about the query which was running. Our illusive number 117. What are you doing mister 117?
I found a query somewhere on the internet. Sorry, I did not record where I found it. It is an extention of the command I used before to check what it was dong. I will paste it below:
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
er.session_Id AS [Spid]
, DATEDIFF(SS,er.start_time,GETDATE()) as [Age Seconds]
, SUBSTRING (qt.text, (er.statement_start_offset/2) + 1,
((CASE WHEN er.statement_end_offset = -1
THEN LEN(CONVERT(NVARCHAR(MAX), qt.text)) * 2
END - er.statement_start_offset)/2) + 1) AS [Individual Query]
, qt.text AS [Parent Query]
FROM sys.dm_exec_requests er
INNER JOIN sys.sysprocesses sp ON er.session_id = sp.spid
CROSS APPLY sys.dm_exec_sql_text(er.sql_handle)as qt
WHERE session_Id > 50
AND session_Id NOT IN (@@SPID)
ORDER BY session_Id, ecid
Alright, so this one uses the same kinds of commands, filters out only sessions above 50 and gives some info on the query running and the parent query. Now what I saw in both cases was Alert Staging.
Now lets get back to SCOM again, because that sounds familiar!
The basic way that SCOM works with this stuff is that the Agents send data to the management server. The management server inserts data into the Datawarehouse database. Next the data must be aggregated. It does this into a number of staging tables. There are staging tables for Alert, performance, event and state. Next inside the SCOM workflows there are some rules which belong to SCOM itself which kick off stored procedures in the datawarehouse. These process the data from/through the staging tables and put it in the historical tables and do some stuff with it and next clean the staging area up again.
So what this seemed to be telling me is that one of the management servers kicked off a rule which kicked off the stored procedure to handle the alerts in or through the alertstaging table. By the way I could see which of the management servers this was as well. But these kind of jobs run often and only have a rows of data to work through! Lets have a look.
SELECT count(*) from Alert.AlertStage
Uhmmm, 300 milion rows ?!?!?!
Now thats an alert storm of some kind and probably programatically. As there is no way we have that many alerts in a short amount of time. Now I know that this transaction will never get through this amount of data and what is the point? These are not normal alerts and there is no value in retaining them. Somebody wrote a nice blog post about another issue where the data from the alert staging was not written to the normal tables: http://bsuresh1.wordpress.com/2014/03/18/alert-data-is-not-being-inserted-into-scom-data-warehouse/. In my case I was not interested in the alerts, so I did not go for temporarily moving them to another table and running through them manually with that stored procedure.
I opted to clean out the table.
TRUNCATE TABLE Alert.AlertStage
Sit back and wait for 300 milion rows to be removed. I had the hope that once the transaction 117 realized there was no more data to process that it was done and ready and could clean up after itself and thus the log file. Guess I was not that Lucky because 2 hours afterwards it was still running. So I was done with this and killed session 117 (Kill 117). Of course this caused a rollback and a few hours later...
Next run a check on the four staging tables:
SELECT count(*) from Alert.AlertStage
SELECT count(*) from Event.EventStage
SELECT count(*) from Perf.PerformanceStage
SELECT count(*) from State.StateStage
All normal with low numbers.
It also cleaned up a few hundred GB of space in the database itself
And from this point I could finally open the properties if the database too!
So ran the task of shrinking the file. Right clik the database - task - shrink - file.
Make sure you select the log file of course! it shows the amount of available empty space inside (which was a lot). SO went ahead and shrunk it down. If it doesnt work open a query window and type the command "checkpoint" and try it again. This can be the case when there is something still writing around the end of the file. As soon as it wraps and starts writing at the start of the file the end will be clear and shrinking will work.
So what was the cause of all this mess? I can't say, but we saw this happening within hours after the bodged update of SCOM from 2012 SP1 to 2012 R2 and UR3. And yes I did run those SQL scripts belonging to UR3 as well. It is possible that the upgrade wizard which killed my first management server and the operational database might have touched the datawarehouse as well before it even reached that step in the wizard? I do not know. Next time just go for sure and restore not only the operational database but also the datawarehouse database. Even if it is over a TB in size. All the stuff that happened in the story described above took a load of time as well.
But it was a nice learning experience on some points as well.
Just happy that the whole system is running well again.
A big thanks to Jacky van Hogen for her advice on the SQL pieces over the phone on her lunch break! Just a few minutes of good advice and some pointers in the right direction from an expert in her field was such a big time saver and reduces stress.
It also sparked an idea, which I will get back to later.
A few weeks ago I did an upgrade of SCOM 2012 SP1 UR6 to SCOM 2012 R2 (and later to UR3) for a customer. Of course they have a test environment with SCOM in it and I went through the whole process and everything looked fine there. Next I was allowed to do it on the production environment. This has 6 management servers and some bits and pieces in it.
The story below contains a recovery command for a dead management server when you still do have the SCOM database, so if that is your case just read on through as well.
This post does not contain pictures as I cant bring those back anymore. It also does not contain any sound, which is good because some of the things I had to say about the upgrade that day were not that positive. By the way, most important thing about the day was to remain calm. Whatever happens during the upgrade or when you see stuff breaking, keep calm and take it step by step. I managed to fix it in the end and have a good upgrade and bring back dead SCOM servers to life. So can you.
First of all read the following story for the procedure to upgrade from 2012 SP1 to 2012 R2:
This is the page where you do some checks and run a script and such things and the rest of the steps for the upgrade are listed in the other pages on the left-hand side of that page.
Now the step you will want to do is to backup your databases! Please do this. I did and I was happy for it as you will soon find out. Keep in mind to backup all databases involved!
First lesson learned here is that you need to install prerequisite software which was changed in version between SCOM 2012 SP1 and 2012 R2. This is ReportViewer 2012 (instead of 2010) and that one again has a prerequisite on SQL CLR Tools. Go to this page and find the prerequisites for the Operations Console. It lists the ReportViewer redisteributable and in a box below it the CLR Types prerequisite:
First run the CLR Types prerequisite installer and if it assks to do a reboot, please do so! Next install the ReportViewer 2012. The setup wizard below will check for the prerequisite software and for any pending reboots
One of the lessons learned much earlier with these upgrades of SCOM is that you run the SCOM setup only on ONE management server at a time. Do not try to start up and run through the first half of the wizard on each box so you think you can win 30 seconds. It WILL break your SCOM if you do that (explanation in very short: the first step the wizard does is check if the SCOM db has been upgraded and if more servers than one think it is not and it does the upgrade on multiple servers anyway you break the stuff). So do not touch the SCOM setup except for one management server at first.
So on a management server, with the correct rights (all of them...) you start the setup from the SCOM 2012 R2 install media. The setup should quickly see that you are doing an upgrade. If you do not see that something is wrong. If you do see it thats fine. Walk through the wizard and enter the service accounts to be used. Pasting of passwords in these boxes might be troublesome and sometimes you need to copy a second time and to a notepad or the search box on the machine to confirm you will be pasting the right string. Just saying, I ran into this a few times and was wondering why it would not validate the accounts I typed.
If everything has gone right you should now be ready to run the upgrade wizard and it should upgrade your management server + databases + management packs + local console and if you have more functions installed on that machine also those.
In my case the wizard ran for a few minutes and next gave me an error on the second step (the database upgrade phase). So click OK and try again? Well...
Problem with this upgrade wizard is that if it runs into a problem it will NOT do a rollback of anything it did.
The first step looks like a preparation step in this wizard but what it actually does is remove the bits of your management server! And next it went to the second step and first thing it does it touch the database and let it know it is in upgrade mode.
So... my managment server was dead and gone!
I thought, well perhaps it is just that management server, I have a few more. Lets try the second one and I will fix the first MS later. Well nice try but the upgrade wizard immediately sees in the database that there is already an upgrade going on, so it will cancel out.
I searched and found the log file of the upgrade wizard and found no real good clue about what had gone wrong
Next thing I did was to restore the SCOM database to the state right before the upgrade of the first machine started. What this does is that it doesnt know any upgrade business had been going on and it knows that the first management server should be there and working (which it wasnt). In hindsight I should have made a restore of the Datawarehouse database as well, so if you run into something like this restore all components!
SO now I started the upgrade on the second management server. Acting as if nothing happened. Ran though the upgrade wizard and pressed the Upgrade button. First step done, database step... taking a long while... and I see that it is doing stuff like importing management packs for a long time. This is expected. The management server which does the upgrade does take a while to go through all its steps. The rest of them later only need to upgrade their executables and such so will be much faster! And next I see this step through the whole wizard until the end. Upgrade done! Yeah!!!
So next up... 4 more living management servers to go. I started to upgrade them one by one as well now. And only the last one also failed on the second step somewhere and got killed due to no rollback.
Alright, so two management servers to be restored. First quickly lets do the web/report server. Start the upgrade wizard... Error And this time it had checked that the management server it was pointing to had not been upgraded yet (yes you guessed it, that was the first dead server it was pointing to). So to keep the stuff moving I went into fixing the two dead management servers first. I will write that down below in a minute.
I can tell you that after that the rest of the components upgraded smoothly after I fixed those two boxes.
The funniest thing was of course that at the end I also upgraded the Linux/Unix agents. You need to know that from 2012 Sp1 version to 2012 R2 version the cross platform agent was completely changed and built up from the ground. It is now not using OpenPegasus anymore, but OMI instead. And to my surprise this was a select-all, upgrade agent, yes use the default account (that one had the rights in my case) and go go go. 2 minutes later and all the Linux SCOM agents were upgraded and functioning without any error. I will have to hear this from the Linux admins for some time I guess that of course that part of the upgrade went fine and fast.
Reparing the management servers
What we can do in this case for a dead SCOM server where we know the database is still working fine and it still thinks there is a management there. In that case we can run the SCOM setup using a recovery option.
First thing to check is:
Add/Remove programs. Yes SCOM is not there anymore. Next in case you have Savision I would say remove the console extension parts. We will install those again after the SCOM console is installed again. Remember also that in my case the server died when it was a SCOM 2012 SP1 box, and in 2012 R2 the file paths change.
What I used in this case was the setup media from the SCOM 2012 R2 media directly. I know the box died when it was 2012 SP1, but meanwhile now all management servers and the database were upgraded to R2 already. So make sure you can access the SCOM setup media. In my case 2012 R2 version.
I used a command version to run this recovery. As you will see it is largely the same as a clean install, except that it has the /recovery switch in the command.
Open up taskmanager by the way so you can see setup.exe running and the second installation process as well. It takes a few minutes and next they disappear and the installation should be finished. Of course we assume all prequisite software as mentioned before was already done.
This recovery command below will put the management server componet back. It will not install the SCOM console. You can install that separately through the normal setup wizard after this is done. Next the Savision console extensions in my case. And the Update Rollup stuff comes well after this story of course.
Here is the command I used and I changed the accounts and passwords. Due to the way this blog displays it I have to enter line feeds in this to display it correctly. Keep in mind that this is ONE command on ONE line. When you copy and paste please first paste it in a notepad and make sure it returns to being one line with only a space between the parameter switches!!
setup.exe /silent /AcceptEndUserLicenseAgreement /recover /EnableErrorReporting:Always
/SendCEIPReports:1 /UseMicrosoftUpdate:1 /DatabaseName:OperationsManager
So change the values according to your environment and use it.
Just to continue the story up the upgrade process for this specific environment... after fixing all servers and upgrading the web server as well we continued with the Update Rollup 3 upgrade including all steps (keep in mind there is also a step in there with SQL scripts to be run against the databases). Also the UNIX/Linux agent upgrades downloaded and run and management packs imported. Give all of this time as there is a LOT to synchronize.
Upgrade all the agents. Windows agents, cross platform agents and a number of left-over consoles.
So was this it?
First of all we ran into the changed code signing certificate for the web console components when run from desktops with users without rights (see http://www.bictt.com/blogs/bictt.php/2014/10/02/scom-2012-web-console-configuration ).
Second thing is the day after we discovered that the SCOM datawarehouse database was going nuts! I will write about that very soon on what appeared to have happened, how to diagnose, what to look for and how we eventually fixed that. One of the coming days.