Savision Live Maps Service Health Index

SCOM, System Center, SCOM 2012, SCOM 2016 Send feedback »

Starting with version of Savision Live Maps version 8.5 they added a new feature called Service Health Index.
Let us investigate what it does.

Those who have been using Live Maps the last few years know about the Services monitoring, which basically is a definition of a single application/service/distributed app which gets split up in 3 parts: Infrastructure, Application and User. We place items like Operating System and Disks in the infrastructure part, we can place specific server roles like Web Server, Domain Controller and monitored items such as website, database, windows service in the Application layer. The user checks go in the User side. This way we can display the state of the Service as a whole, but also its main parts and the effect the users might see.

However if one of the items of one of the 3 main maps goes red this usually makes that map and the whole service go red as well. Things can be overridden with custom health rollups, but still there are the usual Green-Yellow-Red colors and the rollup to the top. There have been several requests to be able to specify which parts of our application are more important than others. For instance, imagine a web farm. Lets say this farm has 3 web servers and 1 database. Now, if 1 web server goes down this will make the application go red, but the website is still up. The user side website check would show a green state still as well, but the health rollup does not make a distinction of this and rolls up the Application map to the Service state.

Now imagine the database going down. Assuming for a second any other high availability solutions for this database have failed. Without the backend database the website will not work. This is also rolled up again to a red state for the Application side and up to the total Service health. Depending on how the user side web checks are setup this could make that check go red as well as a User impact. However, looking at both imaginary situations the Service went into a red state and we potentially did not see much difference as to how important this red state was to the service.

Bring in the new feature Service Health Index!

Quite simply we have a list of items we are monitoring in the Infrastructure/Application/User maps and we define how important they are to the working of the Service on a scale from 1 to 5 with 5 being very bad.

What does this look like? Lets open up the Savision Live Maps Authoring Console and open up one of the Services. In this case I am opening up the SCOM service. There is now a tab called Health Index.

From this screen you can Enable the Health Index and set it to update its health index indication every x minutes. I set it to 15 minutes at first.
There is the option to set which states have an impact on the Health Index:

So I added Warning in this case as an example.

Next you will see a list of all current objects added to all 3 maps (Infrastructure/Application/User) which are added to one of the levels. You can now drag them around to the correct effect it would have to your Service.

So over here I have been dragging some components of the SCOM Service up to the higher impact levels.
The SCOM operational database, the main Resource Pool and the Data Access Service in this case were placed in the Catastrophic level (level 5). Next move down and place other components according to the expected impact of those components on the working of the service.
Next Save the result. Give it the amount of minutes you specified to calculate the health index the first time.

If we now go to the All Services Dashboard we see the following:

Luckily the SCOM service is still green. On the other service (Exchange) you can see the Health Index of 4, which means this red is quite red, but not catastrophic yet.

So now we have a combination of the health state rollups of the 3 main components of every Service and an additional Health Index indicating the resulting effect and priority of handling the situation!

Enjoy your monitoring and pass on the value of monitoring to the whole organization by displaying the state of company services and its impact to all stakeholders!
Bob Cornelissen

NiCE DB2 Management Pack updated

SCOM, System Center, SCOM 2012, SCOM 2016 Send feedback »

The new version 4.20 of NiCE DB2 Management Pack has been released!

New with this release
• Feature: Support of DB2 BLU Acceleration
• Feature: Monitoring of InDoubt Transactions
• Security: Support of DB2 restrictive databases
• Security: Support of non-root setup and operation
• Security: DB2 Instance attach extensions for user and password options
• Platform: New platform support for IBM AIX 7.2
• Platform: Support of non-standard paths for both installation path and instance user home directory

If you are interested in learning more you can click on the Nice logo to the right of this screen.

Happy monitoring!
Bob Cornelissen

SCOM Web Console Application Pool crashing every 15 minutes

SCOM, System Center, SCOM 2012, SCOM 2016 Send feedback »

Recently I had a customer where the SCOM web console application pool would be crashing every 15 minutes (2 servers in this case). This was on a SCOM 2016 instance on a Windows 2012 R2 server.

The error message we got was (the process id is a different number each time):

A process serving application pool 'OperationsManagerMonitoringView' terminated unexpectedly. The process id was '1111'. The process exit code was '0xc0000005'.

This is a bit of a generic access denied error code.
While looking at the application pool which was crashing all the time we see the application pool is running under the security context of "ApplicationPoolIdentity".
In this environment there are several policies in effect and this was probably affecting the access of this generic placeholder account to not be able to access some registry key or local path.

We changed the application pool identity to LocalSystem by opening IIS Manager -> finding the application pool -> on the right click Advanced settings -> find the Identity and use the dropdown to select the LocalSystem in this case. Could have also used another account which was used for another application pool on the server, but went with this one first.
Recycle the application pool after this.

The crashes stopped happening from here. The SCOM web console was reachable.

Hope it helps somebody sometime.
Bob Cornelissen

SCOM agent for Linux and root squash

SCOM, System Center, SCOM Tricks, SCOM 2012, SCOM 2016 Send feedback »

At one of my customers they had a problem deploying SCOM agents through a script on Linux servers. They had a number of Red Hat 6 servers and all went well. On the Red Hat 7 servers however the agent refused to install. Also through a push of the agent through the console. It seemed to stop around the file copy stage where the rpm file gets copied to the server and next run for installation.

It turned out to be a feature called "root squash" causing the issue. What it does is lock rights on NFS shared volumes, so root can not simply access or run commands from any directory. For instance the /home parts. When they turned off this feature the agent installed immediately.

Just writing this down because I am sure I will run into this again somewhere.

Happy agent deployment!
Bob Cornelissen

Test your knowledge on SCOM/OMS/Azure and more

SCOM, System Center, SCOM 2012, SCOM 2016, Windows 2016, OMS Send feedback »

Now test your knowledge on SCOM/OMS/Azure and more through this quiz for fun and to win a Band as well :D

You can take the quiz by clicking on the picture of by this link:
Test your knowledge on SCOM/OMS/Azure and more

Have fun!
Bob Cornelissen

OMS - Antimalware Assessment example

SCOM, OMS Send feedback »

As you may know I have been playing with OMS for a while, especially on the Log Analytics side and some security items. One of the solutions I added quickly was the Antimalware Assessment solution.

What the ANtimalware Assessment does is first of all check if you are protected at all. It will find some antivirus products and it will also see if a machine has nothing recognized outside of the last run of the Malicious Software Removal Tool which comes with Windows Updates every month. And for instance for System Center Endpoint Protection it can pick up on threats.

Today I had a chance to also see that part in action :>

So I got the following email:

This does also name which machine is involved and such.

So I went to my OMS workspace and went into the Antimalware Assessment to find this:

From here we can see which machine was affected and also that the threat has been quarantined already. The second blade tells me what item was found and at what time.

If you click on the threat or the machine you will get to see the log entries leading to this. It features things like which files in which path were found and quarantined.

SO let me have a look at the machine giving the alert and sure enough there it is:


So this gave me a possibility to confirm this does not belong there and remove it permanently. And of course make sure to run a full scan just to be sure.

So there you have it. Immediate value add by the OMS solution on top of what you have already. B):idea::D

Have fun and stay safe!
Bob Cornelissen

STOP 0x00000050 PAGE_FAULT_IN_NONPAGED_AREA on a Windows 2008 server

Windows 2008, Windows 2012 Send feedback »

I was working with an old Windows 2008 R2 server last night. It needed a "few" updates!
So I will first admit to several of my own mistakes. I did not give myself time to update this machine regularly enough in the past and of course we always have to install the Windows Updates on time. If you figure you wait for an extra month for any fixes introduced one month to be fixed the next its something we all understand. But this was many months worth of updates. I went the lazy way, which bit me as you will see below.

I was first interested on getting 1 specific update on the machine. So I selected that update and a random few other smaller updates. Now this is a mistake! It installed the updates and wanted a reboot. OK. Next thing which happens is that the machine starts up in an immediate Blue Screen with code STOP 0x00000050 PAGE_FAULT_IN_NONPAGED_AREA or in short a code 0x50. There was no way around this into for instance safe mode or whatever. The only thing which popped up was the System Recovery Options shown below:

By the way, before you get to this screen it asks you for the Local Administrator password. Turns out even I did not remember, but I got it in the end. Managing admin accounts, including local administrator accounts is important to do. Watch Paula JanuszKiewicz give you an example why it is important here at one of the CQURE academy sessions about passing the hash.

Felt a little panic coming up at that point, because data loss or at least a lot of time fixing things can follow this action. Did not look like I could do much from here either. I did have backups of the data, so in time I would have restored it.
Another rerason for the panic is that I was doing two systems at the same time and in the same way.... and you guessed it... both with the same result!

A lot of googling open and there are a lot of videos explaining how to fix this FROM Windows! Problem is I am stuck in this System Recovery Options Screen. The memory check did not show anything by the way.

Well somewhere hidden in a comment of one of the threads (I can not find it!) was the suggestion that some previous hotfix might have hit one file and removing that file solved it for a few people.

In the picture above you can see a command prompt. Open that.
Next you need to find out which drive letter contains your Windows Installation. The System Recovery just uses a drive letter for itself and throws the other drives into other drive letters. So I did a C: Enter. DIR and knew this was not the drive. So I went to D: and did DIR again. Nope.. Continued until I got it.

CD Windows\system32
Dir *cache*

The file I am looking for is fntcache.dat

this is the font cache file. Do NOT touch the DLL file there. The DAT file is a cache and will be re-built by Windows after restart.

del fntcache.dat

Now I exited the command prompt and restarted the server. It started again into Windows where I hoped it would go.

Next I still needed to do a select-all on the rest of the updates and install them all the same B):D

So keep in mind to update regularly + do not select half the updates but go for them all because there are fixes in there which fix issues created (or surfaced) by other fixes.

Now I can continue with actually replacing these servers, which was the plan to start with!

Good luck!
Bob Cornelissen

Error 500.19 after installing Savision LiveMaps Unity Portal

SCOM, System Center, SCOM Tricks, SCOM 2012, SCOM 2016 Send feedback »

Today I was doing a quick installation of the Savision 8.2 Live Maps Unity Portal. Downloaded the self-extracting executable from the website and of course arranged a license key. While running the installer I selected the Express setup which just pushes the web portal onto the machine and not the other components available in the Advanced installation option. The installation ran in 2 minutes on a slow machine, and this is including the extracting of the files and running checks.

After installation the web page automaticaly opens up and I was greeted with the following error:

HTTP Error 500.19 - Internal Server Error
Module: WindowsAuthenticationModule

In the error description there is talk of a configuration section being locked at parent level.

Screenshot of the error:

What happened is that the configuration on the server level is that Windows Authentication is turned off and that this configuration is locked for the whole machine. So for the Live Maps Portal it is trying to read configuration from a configuration file relating to Authentication and because this configuration is locked at a higher level it throws an error.

How to fix it:

Open IIS Manager
In the left menu select your server name
In the middle of the screen select Configuration Editor

Near the top of the Configuration Editor is a selection box for which section you want to see and edit.
Go to system.webServer/security/authentication/windowsAuthentication

In the right hand manu you will find a link to Unlock Section. Click it to unlock this configuration item.

Now any lower level (Sites or Applications within a site) can have their own configuration for Windows Authentication.

Refresh the error page and the Live Maps Unity Portal came up fine!

Happy dashboarding!
Bob Cornelissen

SCOM: DMZ or workgroup machines refusing to connect to SCOM

SCOM, SCOM Tricks, SCOM 2012 Send feedback »

Ran into a customer issue today whereby there was a nice clean SCOM 2012 R2 installation with UR's. Certificates arranged and momcertimport ran. On the agent machines in DMZ we had the agent installed, UR on it, certificate root imported, certificate meant for computer imported. momcertimport ran to get the correct certficate running. Yet no communication at all between agent and server. This is what I found:

So first checks are:

  1. does the agent machine have the certificate for the name of the server (which in workgroup can be the short name and in a dmz domain a fully qualified name)? Yes
  2. does the agent machine trust the CA which issued the certificate? (in this case a customer own CA, so the root chain cert was imported). Yes
  3. can the agent resolve the SCOM server name you used while configuring the agent? Yes
  4. Is the management group name we used in configuring the agent correct (case sensitive!)? Yes
  5. Is there a firewall blocking TCP 5723 from agent to SCOM server? Yes! OK this was fixed quickly, and verified with telnet. Still no communication! Moving on.
  6. On the SCOM server did we import the CA root chain as trusted and did momcertimport run on the correct machine certificate with the correct FQDN for that server? Yes
  7. restart healthservice on both sides... Yes. No effect

Man usually its name resolving, firewall and routing, certificate with wrong name, no certificate, or not trusted certificate. Pffff.

Something must be wrong with the SCOM server, I'm sure of it.

Next step, lets check out if all our SPN's are correct.

setspn -L scomservername

He wait a second, I see an entry like this:


Now this SCOM server is installed with the setting that the SDK service is running using a domain account. So this SPN should not be registered to the server itself but to the service account in the domain.

setspn -L domain\sdkserviceaccount

Sure enough the entry is not here for MSOMSdkSvc on this service for the mentioned server.

ALright, now we can not place thie correct SPN for this until we remove the wrong one. so we first delete the wrong ones.

setspn -d MSOMSdkSvc/scomservername scomservername
setspn -d MSOMSdkSvc/ scomservername

Next we enter the SPNs on the service account:

setspn -s MSOMSdkSvc/scomservername domain\serviceaccount
setspn -s MSOMSdkSvc/ domain\serviceaccount

And we check our results again with the setspn -L command.
Looks fine now.
Try again.

It must be the certificate somehow.
Open MMC Certificates, check the computer certificate. Is it valid, is it trusted, is it for the right purposes, does it have the correct name... Yes.
momcertimport it again.. only 1 certificate to chose from and its the same one. Restart the Microsoft Management Agent service afterwards.


Wait a second. Let me check in the registry for this certificate. What Momcertimport does is not that difficult. It grabs two properties of the certificate and creates two registry keys for it for SCOM to use.

Aha! NO registry values!

Looking in this key there must be two entries relating to the certificate:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Machine Settings

Alright, so I will create them manually!
What you do is open the properties of the certificate. You need the Thumbprint and the SerialNumber.

Create a New -> String Value
Name it: ChannelCertificateHash
Copy and paste the Thumbprint contents into it and remove the spaces in between

Create a New -> Binary Value
Name it: ChannelCertificateSerialNumber
Now go to the properties of the certificate and click the Serial Number. Its again a string of numbers and letters in pairs of 2. What you need to do is fill in the pairs of 2 in the registry Binary value IN REVERSE.
Original serial number in certificate = 68 00 AB CD 69 00 23
What you enter in Binary field = 23 00 69 CD AB 00 68
So the pair of 2 characters stays the same, but the order of the pairs in the total string is reversed.

Next I restarted the SCOM services.

Within the minute it started saying that: A device which is not part of this management group has attempted to access this Health Service.
Those were the DMZ machines which just keep trying again and again!


In the end it will have been the certificate rather than the SPN record which messed it up, but at least I could show what things I checked. When the SPN came up I just fixed it as well. In the end it WAS the certificate eventhough I felt that it was alright. Well when in doubt and ALL untrusted agents refuse to talk to this machine, and all trusted ones have no issue... triple-check the certificate and if SCO is actually using it!

Have fun monitoring!
Bob Cornelissen

How to make a SCOM implementation project successful

SCOM, System Center, SCOM Tricks, SCOM 2012, SCOM 2016 Send feedback »

I thought I would take a different approach to thinking about how to make a SCOM monitoring project a success. It is not about technical details or designs this time, but about a way to bring business and IT together into monitoring business related services and being in control of those processes. In a short blog post below I am touching upon some of those items.

Enjoy B)
Bob Cornelissen

Contact / Help. ©2017 by Bob Cornelissen. blog software.
Design & icons by N.Design Studio. Skin by Tender Feelings / Evo Factory.