SCOM Discovery Wizard error while deploying Redhat agent

 
Last week I had some trouble deploying a SCOM cross plat agent to a Redhat 4 machine. During discovery of the machine I got an error indicating that it could not transfer the file GetOSVersion.sh . I will write down what happened and my understanding of the process for at least this part and what the solution turned out to be in our case.
I will not dive into all specifics of the discovery process, also because in our case specific things needed to be done to make discovery and agent deployment work.
So, what normally is needed is check the prerequisites in the Operations Manager 2007 R2 Operations Administrator’s Guide in Appendix A. In our case these were there. Also there might be an underlying dependency on lsb (Linux Standard Base) as David Allen found here.
Next we created two accounts, one as a normal user and one as a priviledged user. Keep in mind that these need to be added in SCOM to the Accounts and in Profiles to the Unix Action account and Unix Priviledged account profiles.
While running the discovery wizard we used the accounts and a root password for the discovery and we selected SSH based Discovery in the wizard (otherwise we can not get it to work, as we also found with the Solaris boxes before). During the deployment process it will use a su to elevate for some commands.
So what the discovery wizard will do at this point is try to resolve the host name of the linux box to an ip address. It will try to connect with SSH to the box using the credentials given. At that point it needs to find out what kind of agent would be needed, so it tries to deploy a script to the box. To get this done it first deletes a temporary folder if it exists and right after that (re)creates the folder. It is formed like this: /tmp/scx-username where the username part is what you entered in the discovery wizard. So this command is run to do that: rm -rf /tmp/scx-username ; mkdir -m 755
-> /tmp/scx-username
(of course the username part is replaced by what you entered as user name).
Now the SCOM management server will try to push the file GetOSVersion.sh from its local file location at “c:program filessystem center operations manager 2007agentmanagementunixagents” to the newly created /tmp/scx-username directory. It uses SFTP for this purpose.
Right after this step it does a verify of that file on the linux box, because SFTP does not give an acknowledgement that the file was transferred successfully.
AT this point we got an error in the discovery wizard. The file was not there in order to be verified, so it was not tranferred to the temporary folder on the linux box.
In troubleshooting we used several methods.
One very nice option is to enable logging for the cross plat by creating an empty file with the name “EnableOpsmgrModuleLogging” in the systems temp directory (usually C:WindowsTemp) and you will see logfiles popping up there.
For instance during the initial discovery you see SCXNameResolverProbe.log telling you the name resolving process. Next the DeployFile.vbs.log was telling us that the deployment of the GetOSVersion.sh file was failing:
[2/18/2010 2:17:03 PM] Executing command: rm -rf /tmp/scx-username; mkdir -m 755 /tmp/scx-username
[2/18/2010 2:17:04 PM]
[2/18/2010 2:17:04 PM] Transferring file: C:Program FilesSystem Center Operations Manager 2007AgentManagementUnixAgentsGetOSVersion.sh to location: /tmp/scx-username/
[2/18/2010 2:17:04 PM] Verifying that file: GetOSVersion.sh was transferred properly
[2/18/2010 2:17:05 PM] find: /tmp/scx-username/GetOSVersion.sh: No such file or directory
[2/18/2010 2:17:05 PM]
This was about the same message we got in the gui (when clicking details).
So at this point you would want it to find that file and run it in order to determine the linux/unix version of the machine. This is the point in the wizard where it will show you the machines and what it has determined the operating system is and if it has an agent for it etc with the question if you want to deploy the actual agent.
We also used the terrific tool DebugView from sysinternals which also gave us good information as to where in the process things stopped.
So if this is not working, what else can you test?
Well you can try to use WinScp to remotely connect ot the linux box from the management server. In our case this worked like a charm. It should just connect to the OpenSSH as being the server side and uses SFTP to connect to it is handled on port 22.
I also resorted to contacting Robert Hearn, who by the way has created a very good blog post about the discovery troubleshooting process where he encountered networking related problems during discovery while the target had several network cards. We actually also had this situation, but it turned out not to be our problem.
Robert gave some insights into the discovery wizard process which can be found in the Unix Library management pack.
After a while I decided to try another SFTP client and used Putty sftp. And to our surprise… failed! Connection is possible and it accepts username and password and immediately fails at that point. So the problem was at SFTP on the linux side. SFTP was not turned on in the SSH config!
In /etc/ssh/sshd_config remove the comment before Subsystem in order to enable it, like so:
# override default of no subsystems
Subsystem sftp /usr/libexec/openssh/sftp-server

After this run: service restart sshd
One quick check in WinScp revealed that winscp has a fallback connection option is sftp fails which is checked by default. It falls back to SCP is SFTP fails. When creating a new connection in winscp you will see the checkbox “Allow SCP Fallback” and just uncheck that one to force it to use the sftp channel no matter what.

Now it worked like a charm, transferred the file, discovered what taste of Linux it was and gave the possibility to install the agent.
So some tips (not all are needed in all cases):

  • Make sure to create separate accounts for monitoring purposes
  • Make sure you define them in SCOM and add them to the Unix action account and Unix Priviliedged accounts profiles
  • A good check is to login directly with the scom priviledged action account to the linux machine using putty or some other means and to do a “su – root” and use the root password to see if it works. Is also a good check for passwords.
  • We used both winscp and putty sftp clients to check if sftp worked. But beware that winscp uses a fallback to scp if sftp does not work. Uncheck the checkbox for fallback to scp.
  • If it does not work enable sftp
  • Make sure name resolving works. Both machines should be able to resolve eachothers name
  • In most cases discovery errors occur due to networking (name resolution, routes, firewalls) or credentials

A big thank you also to my linux troubleshooting friends a customer site (Jan, Oleksandr and Gregory) and to Robert Hearn for getting things clear!
Bob Cornelissen