SCOM 2012 SP1 to R2 failed and how to fix failed management servers

A few weeks ago I did an upgrade of SCOM 2012 SP1 UR6 to SCOM 2012 R2 (and later to UR3) for a customer. Of course they have a test environment with SCOM in it and I went through the whole process and everything looked fine there. Next I was allowed to do it on the production environment. This has 6 management servers and some bits and pieces in it.
The story below contains a recovery command for a dead management server when you still do have the SCOM database, so if that is your case just read on through as well.

This post does not contain pictures as I cant bring those back anymore. It also does not contain any sound, which is good because some of the things I had to say about the upgrade that day were not that positive. By the way, most important thing about the day was to remain calm. Whatever happens during the upgrade or when you see stuff breaking, keep calm and take it step by step. I managed to fix it in the end and have a good upgrade and bring back dead SCOM servers to life. So can you.

First of all read the following story for the procedure to upgrade from 2012 SP1 to 2012 R2:
http://technet.microsoft.com/en-us/library/dn249704.aspx

This is the page where you do some checks and run a script and such things and the rest of the steps for the upgrade are listed in the other pages on the left-hand side of that page.

Now the step you will want to do is to backup your databases! Please do this. I did and I was happy for it as you will soon find out. Keep in mind to backup all databases involved!

First lesson learned here is that you need to install prerequisite software which was changed in version between SCOM 2012 SP1 and 2012 R2. This is ReportViewer 2012 (instead of 2010) and that one again has a prerequisite on SQL CLR Tools. Go to this page and find the prerequisites for the Operations Console. It lists the ReportViewer redisteributable and in a box below it the CLR Types prerequisite:
http://technet.microsoft.com/en-us/library/dn249696.aspx

First run the CLR Types prerequisite installer and if it assks to do a reboot, please do so! Next install the ReportViewer 2012. The setup wizard below will check for the prerequisite software and for any pending reboots πŸ’‘

One of the lessons learned much earlier with these upgrades of SCOM is that you run the SCOM setup only on ONE management server at a time. Do not try to start up and run through the first half of the wizard on each box so you think you can win 30 seconds. It WILL break your SCOM if you do that (explanation in very short: the first step the wizard does is check if the SCOM db has been upgraded and if more servers than one think it is not and it does the upgrade on multiple servers anyway you break the stuff). So do not touch the SCOM setup except for one management server at first.

So on a management server, with the correct rights (all of them…) you start the setup from the SCOM 2012 R2 install media. The setup should quickly see that you are doing an upgrade. If you do not see that something is wrong. If you do see it thats fine. Walk through the wizard and enter the service accounts to be used. Pasting of passwords in these boxes might be troublesome and sometimes you need to copy a second time and to a notepad or the search box on the machine to confirm you will be pasting the right string. Just saying, I ran into this a few times and was wondering why it would not validate the accounts I typed.

If everything has gone right you should now be ready to run the upgrade wizard and it should upgrade your management server + databases + management packs + local console and if you have more functions installed on that machine also those.

In my case the wizard ran for a few minutes and next gave me an error on the second step (the database upgrade phase). So click OK and try again? Well…

Problem with this upgrade wizard is that if it runs into a problem it will NOT do a rollback of anything it did.
The first step looks like a preparation step in this wizard but what it actually does is remove the bits of your management server! And next it went to the second step and first thing it does it touch the database and let it know it is in upgrade mode.
So… my managment server was dead and gone!

I thought, well perhaps it is just that management server, I have a few more. Lets try the second one and I will fix the first MS later. Well nice try but the upgrade wizard immediately sees in the database that there is already an upgrade going on, so it will cancel out.
I searched and found the log file of the upgrade wizard and found no real good clue about what had gone wrong :-/

Next thing I did was to restore the SCOM database to the state right before the upgrade of the first machine started. What this does is that it doesnt know any upgrade business had been going on and it knows that the first management server should be there and working (which it wasnt). In hindsight I should have made a restore of the Datawarehouse database as well, so if you run into something like this restore all components!

SO now I started the upgrade on the second management server. Acting as if nothing happened. Ran though the upgrade wizard and pressed the Upgrade button. First step done, database step… taking a long while… and I see that it is doing stuff like importing management packs for a long time. This is expected. The management server which does the upgrade does take a while to go through all its steps. The rest of them later only need to upgrade their executables and such so will be much faster! And next I see this step through the whole wizard until the end. Upgrade done! Yeah!!! B)

So next up… 4 more living management servers to go. I started to upgrade them one by one as well now. And only the last one also failed on the second step somewhere and got killed due to no rollback.

Alright, so two management servers to be restored. First quickly lets do the web/report server. Start the upgrade wizard… Error :> And this time it had checked that the management server it was pointing to had not been upgraded yet (yes you guessed it, that was the first dead server it was pointing to). So to keep the stuff moving I went into fixing the two dead management servers first. I will write that down below in a minute.

I can tell you that after that the rest of the components upgraded smoothly after I fixed those two boxes.

The funniest thing was of course that at the end I also upgraded the Linux/Unix agents. You need to know that from 2012 Sp1 version to 2012 R2 version the cross platform agent was completely changed and built up from the ground. It is now not using OpenPegasus anymore, but OMI instead. And to my surprise this was a select-all, upgrade agent, yes use the default account (that one had the rights in my case) and go go go. 2 minutes later and all the Linux SCOM agents were upgraded and functioning without any error. I will have to hear this from the Linux admins for some time I guess that of course that part of the upgrade went fine and fast.

Reparing the management servers

What we can do in this case for a dead SCOM server where we know the database is still working fine and it still thinks there is a management there. In that case we can run the SCOM setup using a recovery option.

First thing to check is:
Add/Remove programs. Yes SCOM is not there anymore. Next in case you have Savision I would say remove the console extension parts. We will install those again after the SCOM console is installed again. Remember also that in my case the server died when it was a SCOM 2012 SP1 box, and in 2012 R2 the file paths change.

What I used in this case was the setup media from the SCOM 2012 R2 media directly. I know the box died when it was 2012 SP1, but meanwhile now all management servers and the database were upgraded to R2 already. So make sure you can access the SCOM setup media. In my case 2012 R2 version.

I used a command version to run this recovery. As you will see it is largely the same as a clean install, except that it has the /recovery switch in the command.

Open up taskmanager by the way so you can see setup.exe running and the second installation process as well. It takes a few minutes and next they disappear and the installation should be finished. Of course we assume all prequisite software as mentioned before was already done.

This recovery command below will put the management server componet back. It will not install the SCOM console. You can install that separately through the normal setup wizard after this is done. Next the Savision console extensions in my case. And the Update Rollup stuff comes well after this story of course.

Here is the command I used and I changed the accounts and passwords. Due to the way this blog displays it I have to enter line feeds in this to display it correctly. Keep in mind that this is ONE command on ONE line. When you copy and paste please first paste it in a notepad and make sure it returns to being one line with only a space between the parameter switches!!


setup.exe /silent /AcceptEndUserLicenseAgreement /recover /EnableErrorReporting:Always
/SendCEIPReports:1 /UseMicrosoftUpdate:1 /DatabaseName:OperationsManager
/SQLServerInstance:THESQLSERVER1 /DWDatabaseName:OperationsManagerDW
/DWSQLServerInstance:THESQLSERVER1 /DASAccountUser:CONTOSO\scomsdk
/DASAccountPassword:TooDiff1cult /DataReaderUser:CONTOSO\scomdra
/DataReaderPassword:TooDiff1cult /DataWriterUser:CONTOSO\scomdwa
/DataWriterPassword:TooDiff1cult /ActionAccountUser:CONTOSO\scommsa
/ActionAccountPassword:TooDiff1cult

So change the values according to your environment and use it.

Just to continue the story up the upgrade process for this specific environment… after fixing all servers and upgrading the web server as well we continued with the Update Rollup 3 upgrade including all steps (keep in mind there is also a step in there with SQL scripts to be run against the databases). Also the UNIX/Linux agent upgrades downloaded and run and management packs imported. Give all of this time as there is a LOT to synchronize.

Upgrade all the agents. Windows agents, cross platform agents and a number of left-over consoles.

So was this it? πŸ˜‰
Well, no. :> 88|

First of all we ran into the changed code signing certificate for the web console components when run from desktops with users without rights (see http://www.bictt.com/blogs/bictt.php/2014/10/02/scom-2012-web-console-configuration ).

Second thing is the day after we discovered that the SCOM datawarehouse database was going nuts! I will write about that very soon on what appeared to have happened, how to diagnose, what to look for and how we eventually fixed that. One of the coming days.

Good luck!
Bob Cornelissen