Recently I had a customer where the SCOM web console application pool would be crashing every 15 minutes (2 servers in this case). This was on a SCOM 2016 instance on a Windows 2012 R2 server.
The error message we got was (the process id is a different number each time):
A process serving application pool 'OperationsManagerMonitoringView' terminated unexpectedly. The process id was '1111'. The process exit code was '0xc0000005'.
This is a bit of a generic access denied error code.
While looking at the application pool which was crashing all the time we see the application pool is running under the security context of "ApplicationPoolIdentity".
In this environment there are several policies in effect and this was probably affecting the access of this generic placeholder account to not be able to access some registry key or local path.
We changed the application pool identity to LocalSystem by opening IIS Manager -> finding the application pool -> on the right click Advanced settings -> find the Identity and use the dropdown to select the LocalSystem in this case. Could have also used another account which was used for another application pool on the server, but went with this one first.
Recycle the application pool after this.
The crashes stopped happening from here. The SCOM web console was reachable.
Hope it helps somebody sometime.
At one of my customers they had a problem deploying SCOM agents through a script on Linux servers. They had a number of Red Hat 6 servers and all went well. On the Red Hat 7 servers however the agent refused to install. Also through a push of the agent through the console. It seemed to stop around the file copy stage where the rpm file gets copied to the server and next run for installation.
It turned out to be a feature called "root squash" causing the issue. What it does is lock rights on NFS shared volumes, so root can not simply access or run commands from any directory. For instance the /home parts. When they turned off this feature the agent installed immediately.
Just writing this down because I am sure I will run into this again somewhere.
Happy agent deployment!
Now test your knowledge on SCOM/OMS/Azure and more through this quiz for fun and to win a Band as well
You can take the quiz by clicking on the picture of by this link:
Test your knowledge on SCOM/OMS/Azure and more
As you may know I have been playing with OMS for a while, especially on the Log Analytics side and some security items. One of the solutions I added quickly was the Antimalware Assessment solution.
What the ANtimalware Assessment does is first of all check if you are protected at all. It will find some antivirus products and it will also see if a machine has nothing recognized outside of the last run of the Malicious Software Removal Tool which comes with Windows Updates every month. And for instance for System Center Endpoint Protection it can pick up on threats.
Today I had a chance to also see that part in action
So I got the following email:
This does also name which machine is involved and such.
So I went to my OMS workspace and went into the Antimalware Assessment to find this:
From here we can see which machine was affected and also that the threat has been quarantined already. The second blade tells me what item was found and at what time.
If you click on the threat or the machine you will get to see the log entries leading to this. It features things like which files in which path were found and quarantined.
SO let me have a look at the machine giving the alert and sure enough there it is:
So this gave me a possibility to confirm this does not belong there and remove it permanently. And of course make sure to run a full scan just to be sure.
So there you have it. Immediate value add by the OMS solution on top of what you have already.
Have fun and stay safe!
I was working with an old Windows 2008 R2 server last night. It needed a "few" updates!
So I will first admit to several of my own mistakes. I did not give myself time to update this machine regularly enough in the past and of course we always have to install the Windows Updates on time. If you figure you wait for an extra month for any fixes introduced one month to be fixed the next its something we all understand. But this was many months worth of updates. I went the lazy way, which bit me as you will see below.
I was first interested on getting 1 specific update on the machine. So I selected that update and a random few other smaller updates. Now this is a mistake! It installed the updates and wanted a reboot. OK. Next thing which happens is that the machine starts up in an immediate Blue Screen with code STOP 0x00000050 PAGE_FAULT_IN_NONPAGED_AREA or in short a code 0x50. There was no way around this into for instance safe mode or whatever. The only thing which popped up was the System Recovery Options shown below:
By the way, before you get to this screen it asks you for the Local Administrator password. Turns out even I did not remember, but I got it in the end. Managing admin accounts, including local administrator accounts is important to do. Watch Paula JanuszKiewicz give you an example why it is important here at one of the CQURE academy sessions about passing the hash.
Felt a little panic coming up at that point, because data loss or at least a lot of time fixing things can follow this action. Did not look like I could do much from here either. I did have backups of the data, so in time I would have restored it.
Another rerason for the panic is that I was doing two systems at the same time and in the same way.... and you guessed it... both with the same result!
A lot of googling open and there are a lot of videos explaining how to fix this FROM Windows! Problem is I am stuck in this System Recovery Options Screen. The memory check did not show anything by the way.
Well somewhere hidden in a comment of one of the threads (I can not find it!) was the suggestion that some previous hotfix might have hit one file and removing that file solved it for a few people.
In the picture above you can see a command prompt. Open that.
Next you need to find out which drive letter contains your Windows Installation. The System Recovery just uses a drive letter for itself and throws the other drives into other drive letters. So I did a C: Enter. DIR and knew this was not the drive. So I went to D: and did DIR again. Nope.. Continued until I got it.
The file I am looking for is fntcache.dat
this is the font cache file. Do NOT touch the DLL file there. The DAT file is a cache and will be re-built by Windows after restart.
Now I exited the command prompt and restarted the server. It started again into Windows where I hoped it would go.
Next I still needed to do a select-all on the rest of the updates and install them all the same
So keep in mind to update regularly + do not select half the updates but go for them all because there are fixes in there which fix issues created (or surfaced) by other fixes.
Now I can continue with actually replacing these servers, which was the plan to start with!
Today I was doing a quick installation of the Savision 8.2 Live Maps Unity Portal. Downloaded the self-extracting executable from the website and of course arranged a license key. While running the installer I selected the Express setup which just pushes the web portal onto the machine and not the other components available in the Advanced installation option. The installation ran in 2 minutes on a slow machine, and this is including the extracting of the files and running checks.
After installation the web page automaticaly opens up and I was greeted with the following error:
HTTP Error 500.19 - Internal Server Error
In the error description there is talk of a configuration section being locked at parent level.
Screenshot of the error:
What happened is that the configuration on the server level is that Windows Authentication is turned off and that this configuration is locked for the whole machine. So for the Live Maps Portal it is trying to read configuration from a configuration file relating to Authentication and because this configuration is locked at a higher level it throws an error.
How to fix it:
Open IIS Manager
In the left menu select your server name
In the middle of the screen select Configuration Editor
Near the top of the Configuration Editor is a selection box for which section you want to see and edit.
Go to system.webServer/security/authentication/windowsAuthentication
In the right hand manu you will find a link to Unlock Section. Click it to unlock this configuration item.
Now any lower level (Sites or Applications within a site) can have their own configuration for Windows Authentication.
Refresh the error page and the Live Maps Unity Portal came up fine!
Ran into a customer issue today whereby there was a nice clean SCOM 2012 R2 installation with UR's. Certificates arranged and momcertimport ran. On the agent machines in DMZ we had the agent installed, UR on it, certificate root imported, certificate meant for computer imported. momcertimport ran to get the correct certficate running. Yet no communication at all between agent and server. This is what I found:
So first checks are:
- does the agent machine have the certificate for the name of the server (which in workgroup can be the short name and in a dmz domain a fully qualified name)? Yes
- does the agent machine trust the CA which issued the certificate? (in this case a customer own CA, so the root chain cert was imported). Yes
- can the agent resolve the SCOM server name you used while configuring the agent? Yes
- Is the management group name we used in configuring the agent correct (case sensitive!)? Yes
- Is there a firewall blocking TCP 5723 from agent to SCOM server? Yes! OK this was fixed quickly, and verified with telnet. Still no communication! Moving on.
- On the SCOM server did we import the CA root chain as trusted and did momcertimport run on the correct machine certificate with the correct FQDN for that server? Yes
- restart healthservice on both sides... Yes. No effect
Man usually its name resolving, firewall and routing, certificate with wrong name, no certificate, or not trusted certificate. Pffff.
Something must be wrong with the SCOM server, I'm sure of it.
Next step, lets check out if all our SPN's are correct.
setspn -L scomservername
He wait a second, I see an entry like this:
Now this SCOM server is installed with the setting that the SDK service is running using a domain account. So this SPN should not be registered to the server itself but to the service account in the domain.
setspn -L domain\sdkserviceaccount
Sure enough the entry is not here for MSOMSdkSvc on this service for the mentioned server.
ALright, now we can not place thie correct SPN for this until we remove the wrong one. so we first delete the wrong ones.
setspn -d MSOMSdkSvc/scomservername scomservername
setspn -d MSOMSdkSvc/scomservername.domain.com scomservername
Next we enter the SPNs on the service account:
setspn -s MSOMSdkSvc/scomservername domain\serviceaccount
setspn -s MSOMSdkSvc/scomservername.domain.com domain\serviceaccount
And we check our results again with the setspn -L command.
Looks fine now.
It must be the certificate somehow.
Open MMC Certificates, check the computer certificate. Is it valid, is it trusted, is it for the right purposes, does it have the correct name... Yes.
momcertimport it again.. only 1 certificate to chose from and its the same one. Restart the Microsoft Management Agent service afterwards.
Wait a second. Let me check in the registry for this certificate. What Momcertimport does is not that difficult. It grabs two properties of the certificate and creates two registry keys for it for SCOM to use.
Aha! NO registry values!
Looking in this key there must be two entries relating to the certificate:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Machine Settings
Alright, so I will create them manually!
What you do is open the properties of the certificate. You need the Thumbprint and the SerialNumber.
Create a New -> String Value
Name it: ChannelCertificateHash
Copy and paste the Thumbprint contents into it and remove the spaces in between
Create a New -> Binary Value
Name it: ChannelCertificateSerialNumber
Now go to the properties of the certificate and click the Serial Number. Its again a string of numbers and letters in pairs of 2. What you need to do is fill in the pairs of 2 in the registry Binary value IN REVERSE.
Original serial number in certificate = 68 00 AB CD 69 00 23
What you enter in Binary field = 23 00 69 CD AB 00 68
So the pair of 2 characters stays the same, but the order of the pairs in the total string is reversed.
Next I restarted the SCOM services.
Within the minute it started saying that: A device which is not part of this management group has attempted to access this Health Service.
Those were the DMZ machines which just keep trying again and again!
In the end it will have been the certificate rather than the SPN record which messed it up, but at least I could show what things I checked. When the SPN came up I just fixed it as well. In the end it WAS the certificate eventhough I felt that it was alright. Well when in doubt and ALL untrusted agents refuse to talk to this machine, and all trusted ones have no issue... triple-check the certificate and if SCO is actually using it!
Have fun monitoring!
I thought I would take a different approach to thinking about how to make a SCOM monitoring project a success. It is not about technical details or designs this time, but about a way to bring business and IT together into monitoring business related services and being in control of those processes. In a short blog post below I am touching upon some of those items.
Last week I installed a fresh WSUS server for a customer of mine and because it needed to download lots of files after the approvals were done we left it for a few days. Today I came in and opened the WSUS console only to notice it refused to connect. Got an error like this one:
The WSUS administration console was unable to connect to the WSUS Server via the remote API.
Verify that the Update Services service, IIS and SQL are running on the server. If the problem persists, try restarting IIS, SQL, and the Update Services Service.
The WSUS administration console has encountered an unexpected error. This may be a transient error; try restarting the administration console. If this error persists, Try removing the persisted preferences for the console by deleting the wsus file under %appdata%\Microsoft\MMC\.
System.IO.IOException -- The handshake failed due to an unexpected packet format.
After checking that the requires services were running the investigation starts. Lot of blog and forum posts from long ago to recent, all with different solutions.
I came across a post from 6 weeks or so ago which talks about an update KB3148812 which causes this behavior and also to cause an additional error where clients can not scan WSUS.
Now I could not find this KB patch installed on my system, however it mentioned manual steps to be done after applying the hotfix and those manual steps solved it indeed. Keep reading.
A little more research found that the 3148812 has now been cancelled and another one came in its place KB3159706.
This article describes what is going on and it contains manual steps to be followed! The first step solved the console not being able to connect. The second step is for HTTP Activation. And if you have SSL turned on there are a few more steps to follow.
In my previous post which introduced SCOM 2016 Features - Network Monitoring MP Generator I have shown you how to use the command syntax of the tool and why it was created. Now it is time for an example.
Have fun monitoring some network device and see how the principles of the input XML file works.
Also because I have been doing a few presentations with a SCOMosaur theme, so we combine a little SCOM with a little dinosaur madness. You will see a few references of that here and there.
Mind I am using a simulated device which may not be fit for this purpose. Reason being the default simulated devices by the Jalasoft SNMP Device Simulator are all CERTIFIED. ANd we are of course creating monitoring for the non certified devices. The OID's in the example below are from a APC UPS device, but for now we can use it as exampe clearly enough.
- First of all I am using SCOM 2016 TP5 here, which is the first version to include this feature.
- I am using Jalasoft SNMP Device Simulator on another machine to simulate a few network devices of different types.
- Of course make sure both sides can reach eachother with ping (ICMP) and SNMP.
- I am using iReasoning MIB Browser to browse the SNMP tree on the device selected to determine we actually have data there and the right OID's.
Next on the list is to discover the devices in SCOM by creating a Device Discovery and adding the device IP addresses and SNMP community string to it and letting SCOM discover the devices.
The XML input file
Actually the idea here is relatively the same as a simple management pack setup.
- A manifest with management pack name and version
- A Device definition
A Device discovery>/li>
- Device Components
- Device ฉomponent Discovery
- Rules (these are collection rules)
Starting the Manifest
First we are going to define the start to the input file by the Root tag.
Next we define the Display Name and Version for the management pack.
Name and Version are mandatory and an optional tag is KeyToken.
Device Definition and Discovery
The next thing to do is create an entry for each type of device and to make a device discovery for it.
First we define a name for the device.
Next we jump into a discovery for it.
The discovery covers the SysObjId tag which points to the unique device identifier for the device type.
Next we have to specify a device type. The following types are supported for now: Switch, Router, Firewall, LoadBalancer.
Next fill out the Vendor and Model.
Components and Discovery
Now it is time to look into the components of the device. For example Processors or Fans. After we dicover those we can target monitors and rules to those components in order to monitor them.
We are opening the Components tag here, and it will be closed all the way at the end of the story.
Next we define our first component.
There are a few component types supported at this moment: Processor, Memory, Fan, Voltage Sensor, Power Supply, Temperature Sensor.
And we give it a name of course.
Now we define the OIDs we are interested in. These OIDs will have to be there for each instance of the Component we define. One of these will be used in the discovery of the component and the same one and/or others we can use for rules and monitors. At least we have defined all of them here and given them original names.
We do not have to enter the index number of each component instance. For example...
fan2 = 184.108.40.206
fan2 = 220.127.116.11
fan3 = 18.104.22.168
In the very short OID example above you can see the last number is the index number for each fan. So we only need to specify 1.3.6 in this case and the discoveries will find each instance for you.
In this case I named the component the Tricera Environment and gave it a Processor type, just because it needs to conform to the default types at this moment.
The 3 used OID's are a Temperature OID, a Usage OID (which happens to be the amount of battery percent left for the UPS), and an overal state indicator OID for this component.
For the step coming after this, it means we have two performance counters we can collect (but I will collect all three in the example), and also we can create state monitors based on the values.
Lastly the ComponentDiscovery is a pointer to which of the already defined OIDs is a component indicator. In this case I use the state indicator OID. If that one is there (with an index number behind it) an instance of the component will be created or as many as needed.
Monitoring and Rules
Alright now the monitoring needs to start for the component we are still at.
For starters we set the Monitoring tag. We will close that tag later after we have defined all rules and monitors.
Next we start with the rules:
We open the Rules tag and next define the performance collection rules as you see here. I used short names for it and pointed each rule to the name of the OID we defined already. See how easy that part is?
Lets go to the monitors now...
First again we start it off with the Monitors tag which we will close off after the last monitor we add.
Alright, first UnitMonitor. We give it a name. In this case Triceratops Environment Status.
It is a two state monitor so we define two expressions.
Both of them point (in black letters in the middle here)
to the name of the OID containing the state indication.
The first expression is for success (green state) and uses 2 or less. And the second expression uses anything higher than 2 to set it to an error state.
So i repeated that two more times for the Temperature and set it to 30 degrees as maximum acceptable value, otherwise our dino gets sunburn.
And the third monitor is using the TriEnvUsage OID to determine if it is at 100 or below.
And now as promissed we close the whole load of tags off:
The conversion process
Alright we now have an XML input file with all the stuff we need. Now we need to use the Network Monitoring MP Generator tool to convert the input file to a management pack XML file.
Open a command prompt and go to
%Program Files%\Microsoft System Center 2016\Operations Manager\
I placed my input file in the folder C:\SCOMosaur with file name dino.xml and I will allow the output file to be written to that folder as well.
I run the command:
NetMonMPGenerator.exe -InputFile "C:\SCOMosaur\dinos.xml" -OutputDir "C:\SCOMosaur"
The program will let you know if there are any errors and it will confirm if it finished creating the management pack file.
From here you simply import the management pack and as usual wait a little bit.
Well it is a lot easier to create this input file with the basics we need to be monitoring the custom device. The total input XML file was about 60 lines if we take away the empty lines. The resulting management pack was 690 lines long.
There will be a complete example coming from the product team very soon now, including comments in the file and such. This is just a quick starter to help you play with this feature.
This is meant to get NOT Certified devices in a more complete monitoring state as if it were CERTIFIED. As you have seen the device types and component types are for the moment a limited set.
My idea around this feature is that the possibilities might still expand in due time to be more and more flexible. Also it would be nice to see a graphic interface to build up the input XML and of course that would immediately build up the management pack. However those kind of things take a lot of time to build. I consider the current solution a nice go between.
Back to the SCOM 2016 Features - Overview post!
Hope you all have fun!