Troubleshooting Dynamic Agent down or unlinked issues

1/15/2019

Although Dynamic Agents have been around since TWS 8.6, troubleshooting procedures and checks around Dynamic Agents are not clear yet among majority of the Workload Automation Administrators.

This Blog aims to provide a detailed background on how to approach troubleshooting Dynamic Agent down or unlinked issues.

A Dynamic Agent communicates with the Master Domain Manager (MDM) or a Remote Gateway (in case you have a Gateway configured) through curl gets and curl posts. Dynamic Agent sends Resource Advisor information to the Resource Advisor URL and receives Job Submit commands from broker directly (in case of direct connection) or receives it through Remote Gateway from MDM.

Agent posts job status Information to JobManagerGWURL or to Broker directly.

Functions of a Gateway

Get actions from Broker <JobManagerGW URL>/actions/GWID
Send resource information to <Resource Advisor URL>
Perform component command on the <Agent's JobManagerURL>/ComponentCommand

Send Gateway action queue information to <JobManagerGW URL>/result/TenantPrefixGWID-<Job ID>

Functions of an Agent

Send resource advisor information to <Resource Advisor URL> (either Broker direct or Remote Gateway)
Send job status notification to <JobManagerGWURL> in case of Remote GW
Send job status notification to <IHS Server>:JM Port/ita/JobManagerGW/JobManagerRESTWeb/JobScheduler/job in case of Local GW

Performs submit action on <Agent's URL>:JM Port/ita/JobManager/job

Troubleshooting Agent Unlink or Agent Down Issues
When a Dynamic Agent unlinks, or goes down, the first check is to be done on the Agent side where the IWS Administrator runs ps -ef | grep “<TWSUser>” and looks for the JobManager
process, Agent process, or JobManagerGW process.

If there are more than one Agent processes, or JobManager processes, or JobManagerGW
processes with older Time Stamp, it means that the termination during previous shutdown
and start of the Agent was not clean.

In this case, the best course of action would be to restart the Agent. Alternatively, on the Windows side, it would be the processes: JobManager, JobManagerGW and Agent processes on the Task Manager.

If any of these processes is down or with an older Time Stamp, it would mean that the termination during previous shutdown and start of the Agent was not clean.
So, the best course of action again would be to restart the Agent.

Unix

cd <TWSHome>
./ShutDownLwa
Run ps -ef | grep <TWSUser>. If there’s a leftover process, issue Kill -9 on the process ID in question.
Issue a ./StartUpLwa to start the Agent clean.
Run ps -ef | grep <TWSUser> to verify that all processes came up cleanly.

Windows

Open a Windows CLI in Administrative Mode
cd <TWSHome>
Issue ShutDownLwa.cmd on the Agent.
Open Task Manager and ensure all processes are stopped: JobManager, JobManagerGW and Agent Processes.
Issue StartUpLwa.cmd on the Agent.
Open Task Manager and ensure all processes have started cleanly: JobManager, JobManagerGW and Agent Processes.

Agent up locally but not reflecting on MDM
In many cases, the JobManager, Agent and JobManagerGW processes may be up locally but not reflecting on MDM.

In such cases, it is always advisable to view the JobManager_message*log’s and JobManagerGW_message*log’s located under stdlist/JM directory on the Agent.

If there is a remote Gateway involved, also review the JobManager_message*log’s and JobManagerGW_message*log’s located under stdlist/JM directory on the GW.

The communication from Agent to Gateway and Gateway to MDM is to be reviewed carefully.

If there’s no remote Gateway involved, review the communication from Agent to MDM. A very good test would be to test submission of an Adhoc Job to the Agent. The joblog of the Adhoc Job will determine whether the job got submitted to the Agent and whether the Job started executing. If the Job submission happened, execution not started, it could be a case of
communication problem from Agent to MDM/GW to MDM depending on whether a Remote Gateway is involved or not.

In this case the job would be hung on WAIT+ State and would not progress.

You can also verify this through curl command from the Agent if you are able to connect to the GW (in case a remote GW is involved):

curl https://RemoteGW:JMPort/JobManagerRESTWeb/JobSchedulerGW/resourceGW -u user:password

The communication from GW to MDM can also be tested in the following way in case of problems from GW to MDM:

curl -k https://MDM:JMPort/JobManagerRESTWeb/JobSchedulerGW/resourceGW -u
user:password

The communication from the GW to the Agent can also be tested as follows:

curl -k https://Agent:JMPort/ita/JobManagerGW/JobManagerRESTWeb/JobScheduler/job -u user:password

If the curl command fails, note the curl error and pursue further troubleshooting:

curl 7 can indicate an issue with the bi-directional communication from MDM to GW. In case a firewall is involved, this translates to a need for a firewall rule change.

If the communication is good, you will still get curl 60 Error if you are not passing the certificate in the attempt.

curl 28 can indicate a network problem as it shows up in when there is a network timeout.

curl 18 indicates a Packet Commit Size not matching the actual Packet Size delivered. This situation might again indicate a network problem.

curl 52 indicates nothing was returned from Server, again indicating a network problem.

In case of curl 35, the below causes could be relevant:

Certificate issue on the DA. The key file containing the certificates to communicate with the MDM or GW is named in the ita.ini file using this parameter: key_db_name. The default value is TWSClientKeyStore. The file is found here: TWA/TWS/ITA/cpa/ita/cert/TWSClientKeyStore.kdb. To evaluate the kdb file, use gsk8capicmd. 

$ cd <TWAHome>/TWS
$ . ./tws_env.sh
$ cd ITA/cpa/ita/cert
$ gsk8capicmd -cert -list all -db TWSClientKeyStore.kdb -pw default
\\ The output from that command returned saasclient.
\\ To list details of the active certificates, use the following commands:
$ gsk8capicmd -cert -details -db TWSClientKeyStore.kdb -pw default -label client
$ gsk8capicmd -cert -details -db TWSClientKeyStore.kdb -pw default -label server

If there is a problem with the client certificate, there will be a problem with SSL communication from the MDM or GW to the DA. There will be entries in the log file TWA/TWS/stdlist/JM/ITA_trace.log like the following:

9. net_server: net_accept_ssl: error 127507: Error while accepting SSL connection
or
10. net_server: net_accept_ssl: error 127503: Error while initializing SSL context

Also, try and match the certificates on the Dynamic Agent and GW. If they do not match,
export the certificates from GW to the Agent and then try to link the Agent again. You can also verify the certificate using md5sum or cksum command.

If the above command output matches in Remote GW and Agent, then the certificate is good. If not, you would have to export the certificate from Gateway to Agent in case the Agent alone is unlinked with the curl 35 Error and GW is fine.

Check must also be made on the chmod of TWSClientKeyStore.kdb to be 755.
Check also that the owner of the file TWSClientKeyStore.kdb is TWSUser and TWSGroup.

Agent has to be restarted post applying above procedure for fixing curl 35.

Agent not coming up at all locally
In case the Agent is not coming up at all locally, try to analyze the error on JobManager_message*log’s and JobManager*trace.log’s.

If any of these errors point to a problem with the scanner process or wscanhw process, then you could get a message similar to the following:

AWSITA104E Unable to perform the system resources scan. The error is "operation failed with error code -6".

This could indicate a permissions issue with the scanner, in which case you could check:

The other possibility is a problem with the scanner not completing, so you could run the following and re-check:

If the scanner does not complete, run ps -ef | grep “wscanhw” to check the scanner processes triggered. If there are a lot of defunct or leftover scanner processes, drill down to the location /etc/cit and try to list the content of cit.ini. If the scanner version is 2.8.0.0001, you would need an upgrade to 2.8.0.0005:

Contact PMR L2 Support or SaaS L2 Support for CIT upgrade instructions and for Package for CIT Component.

In case the agent does not come up locally and you are not able to locate the problem, you can also enable Dynamic Agent traces in order to troubleshoot further.

To enable tracing modify the JobManager.ini file and JobManagerGW.ini file located under <TWSHome>/ITA/cpa/config as follows and update the below parameters to the newer values shown:

After enabling Dynamic Agent traces, shutdown the Agent by issuing ./ShutDownLwa or ShutDownLwa.cmd and start the Agent by issuing ./StartUpLwa or StartUpLwa.cmd.

After restarting the Dynamic Agent, you would end up with multiple trace files: JobManager*trace* as well as JobManagerGW*trace* files numbering up to 10 each of the size 3072000 Bytes or 3 MB. You could review these traces with PMR L2 Support or SaaS L2 Support for further assistance.

Sriram V.
Senior Technical Lead

Sriram has been working on Workload Automation for the last 10 years in various capacities like IWS Administrator, SME, India-SME, and now currently with the HCL product team supporting Workload Automation on SaaS.

0 Comments

Troubleshooting Dynamic Agent down or unlinked issues

Leave a Reply.

Archives

Categories