Ran into a customer issue today whereby there was a nice clean SCOM 2012 R2 installation with UR’s. Certificates arranged and momcertimport ran. On the agent machines in DMZ we had the agent installed, UR on it, certificate root imported, certificate meant for computer imported. momcertimport ran to get the correct certficate running. Yet no communication at all between agent and server. This is what I found:
So first checks are:
- does the agent machine have the certificate for the name of the server (which in workgroup can be the short name and in a dmz domain a fully qualified name)? Yes
- does the agent machine trust the CA which issued the certificate? (in this case a customer own CA, so the root chain cert was imported). Yes
- can the agent resolve the SCOM server name you used while configuring the agent? Yes
- Is the management group name we used in configuring the agent correct (case sensitive!)? Yes
- Is there a firewall blocking TCP 5723 from agent to SCOM server? Yes! OK this was fixed quickly, and verified with telnet. Still no communication! Moving on.
- On the SCOM server did we import the CA root chain as trusted and did momcertimport run on the correct machine certificate with the correct FQDN for that server? Yes
- restart healthservice on both sides… Yes. No effect
Man usually its name resolving, firewall and routing, certificate with wrong name, no certificate, or not trusted certificate. Pffff.
Something must be wrong with the SCOM server, I’m sure of it.
Next step, lets check out if all our SPN’s are correct.
setspn -L scomservername
He wait a second, I see an entry like this:
Now this SCOM server is installed with the setting that the SDK service is running using a domain account. So this SPN should not be registered to the server itself but to the service account in the domain.
setspn -L domain\sdkserviceaccount
Sure enough the entry is not here for MSOMSdkSvc on this service for the mentioned server.
ALright, now we can not place thie correct SPN for this until we remove the wrong one. so we first delete the wrong ones.
setspn -d MSOMSdkSvc/scomservername scomservername
setspn -d MSOMSdkSvc/scomservername.domain.com scomservername
Next we enter the SPNs on the service account:
setspn -s MSOMSdkSvc/scomservername domain\serviceaccount
setspn -s MSOMSdkSvc/scomservername.domain.com domain\serviceaccount
And we check our results again with the setspn -L command.
Looks fine now.
It must be the certificate somehow.
Open MMC Certificates, check the computer certificate. Is it valid, is it trusted, is it for the right purposes, does it have the correct name… Yes.
momcertimport it again.. only 1 certificate to chose from and its the same one. Restart the Microsoft Management Agent service afterwards.
Wait a second. Let me check in the registry for this certificate. What Momcertimport does is not that difficult. It grabs two properties of the certificate and creates two registry keys for it for SCOM to use.
Aha! NO registry values!
Looking in this key there must be two entries relating to the certificate:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Machine Settings
Alright, so I will create them manually!
What you do is open the properties of the certificate. You need the Thumbprint and the SerialNumber.
Create a New -> String Value
Name it: ChannelCertificateHash
Copy and paste the Thumbprint contents into it and remove the spaces in between
Create a New -> Binary Value
Name it: ChannelCertificateSerialNumber
Now go to the properties of the certificate and click the Serial Number. Its again a string of numbers and letters in pairs of 2. What you need to do is fill in the pairs of 2 in the registry Binary value IN REVERSE.
Original serial number in certificate = 68 00 AB CD 69 00 23
What you enter in Binary field = 23 00 69 CD AB 00 68
So the pair of 2 characters stays the same, but the order of the pairs in the total string is reversed.
Next I restarted the SCOM services.
Within the minute it started saying that: A device which is not part of this management group has attempted to access this Health Service.
Those were the DMZ machines which just keep trying again and again!
In the end it will have been the certificate rather than the SPN record which messed it up, but at least I could show what things I checked. When the SPN came up I just fixed it as well. In the end it WAS the certificate eventhough I felt that it was alright. Well when in doubt and ALL untrusted agents refuse to talk to this machine, and all trusted ones have no issue… triple-check the certificate and if SCO is actually using it!
Have fun monitoring!