WORKLOAD AUTOMATION COMMUNITY
  • Home
  • Blogs
  • Forum
  • Resources
  • Events
    • IWA 9.5 Roadshows
  • About
  • Contact
  • What's new

Troubleshooting Dynamic Agent down or unlinked issues

1/15/2019

0 Comments

 
Picture
Although Dynamic Agents have been around since TWS 8.6, troubleshooting procedures and checks around Dynamic Agents are not clear yet among majority of the Workload Automation Administrators.
​

This Blog aims to provide a detailed background on how to approach troubleshooting Dynamic Agent down or unlinked issues. 

A Dynamic Agent communicates with the Master Domain Manager (MDM) or a Remote Gateway (in case you have a Gateway configured) through curl gets and curl posts. Dynamic Agent sends Resource Advisor information to the Resource Advisor URL and receives Job Submit commands from broker directly (in case of direct connection) or receives it through Remote Gateway from MDM. 

Agent posts job status Information to JobManagerGWURL or to Broker directly.

Functions of a Gateway 
  • Get actions from Broker <JobManagerGW URL>/actions/GWID 
  • Send resource information to <Resource Advisor URL> 
  • Perform component command on the <Agent's JobManagerURL>/ComponentCommand 
  • Send Gateway action queue information to <JobManagerGW URL>/result/TenantPrefixGWID-<Job ID> 

Functions of an Agent 
  • Send resource advisor information to <Resource Advisor URL> (either Broker direct or Remote Gateway) 
  • Send job status notification to <JobManagerGWURL> in case of Remote GW 
  • Send job status notification to <IHS Server>:JM Port/ita/JobManagerGW/JobManagerRESTWeb/JobScheduler/job in case of Local GW 
  • Performs submit action on <Agent's URL>:JM Port/ita/JobManager/job 

Troubleshooting Agent Unlink or Agent Down Issues 
When a Dynamic Agent unlinks, or goes down, the first check is to be done on the Agent side where the IWS Administrator runs ps -ef | grep “<TWSUser>” and looks for the JobManager 
proces
s, Agent process, or JobManagerGW process. 


If there are more than one Agent processes,  or JobManager processes, or JobManagerGW 
process
es with older Time Stamp, it means that the termination during previous shutdown 
and 
start of the Agent was not clean. 


In this case, the best course of action would be to restart the Agent. Alternatively, on the Windows side, it would be the processes: JobManager, JobManagerGW and Agent processes on the Task Manager. 

If any of these processes is down or with an older Time Stamp, it would mean that the termination during previous shutdown and start of the Agent was not clean. 
So, the best course of action again would be to restart the Agent. 

Unix 
  1. cd <TWSHome> 
  2. ./ShutDownLwa 
  3. Run ps -ef | grep <TWSUser>. If there’s a leftover process, issue Kill -9 on the process ID in question. 
  4. Issue a ./StartUpLwa to start the Agent clean. 
  5. Run ps -ef | grep <TWSUser> to verify that all processes came up cleanly. 

Windows 
  1. Open a Windows CLI in Administrative Mode 
  2. cd <TWSHome> 
  3. Issue ShutDownLwa.cmd on the Agent. 
  4. Open Task Manager and ensure all processes are stopped: JobManager, JobManagerGW and Agent Processes. 
  5. Issue StartUpLwa.cmd on the Agent. 
  6. Open Task Manager and ensure all processes have started cleanly: JobManager, JobManagerGW and Agent Processes. 
 
Agent up locally but not reflecting on MDM 
In many cases, the JobManager, Agent and JobManagerGW processes may be up locally but not reflecting on MDM. 

In such cases, it is always advisable to view the JobManager_message*log’s and JobManagerGW_message*log’s located under stdlist/JM directory on the Agent. 

If there is a remote Gateway involved, also review the JobManager_message*log’s and JobManagerGW_message*log’s located under stdlist/JM directory on the GW. 

The communication from Agent to Gateway and Gateway to MDM is to be reviewed carefully. 

If there’s no remote Gateway involved, review the communication from Agent to MDM. A very good test would be to test submission of an Adhoc Job to the Agent. The joblog of the Adhoc Job will determine whether the job got submitted to the Agent and whether the Job started executing. If the Job submission happened, execution not started, it could be a case of 
communication problem from Agent to MDM/GW to MDM depending on whether a Remote Gateway is involved or not. 


In this case the job would be hung on WAIT+ State and would not progress. 

You can also verify this through curl command from the Agent if you are able to connect to the GW (in case a remote GW is involved): 

curl https://RemoteGW:JMPort/JobManagerRESTWeb/JobSchedulerGW/resourceGW -u user:password 

The communication from GW to MDM can also be tested in the following way in case of problems from GW to MDM: 

curl -k https://MDM:JMPort/JobManagerRESTWeb/JobSchedulerGW/resourceGW -u 
user:password 

The communication from the GW to the Agent can also be tested as follows:  

curl -k https://Agent:JMPort/ita/JobManagerGW/JobManagerRESTWeb/JobScheduler/job -u user:password 

If the curl command fails, note the curl error and pursue further troubleshooting: 

curl 7 can indicate an issue with the bi-directional communication from MDM to GW. In case a firewall is involved, this translates to a need for a firewall rule change. 

If the communication is good, you will still get curl 60 Error if you are not passing the certificate in the attempt. 

curl 28 can indicate a network problem as it shows up in when there is a network timeout. 

curl 18 indicates a Packet Commit Size not matching the actual Packet Size delivered. This situation might again indicate a network problem. 

curl 52 indicates nothing was returned from Server, again indicating a network problem. 

In case of curl 35, the below causes could be relevant: 

Certificate issue on the DA. The key file containing the certificates to communicate with the MDM or GW is named in the ita.ini file using this parameter: key_db_name. The default value is TWSClientKeyStore. The file is found here: TWA/TWS/ITA/cpa/ita/cert/TWSClientKeyStore.kdb. To evaluate the kdb file, use gsk8capicmd.  
  1. $ cd <TWAHome>/TWS  
  2. $ . ./tws_env.sh  
  3. $ cd ITA/cpa/ita/cert  
  4. $ gsk8capicmd -cert -list all -db TWSClientKeyStore.kdb -pw default  
  5. \\ The output from that command returned saasclient. 
  6. \\ To list details of the active certificates, use the following commands:  
  7. $ gsk8capicmd -cert -details -db TWSClientKeyStore.kdb -pw default -label client  
  8. $ gsk8capicmd -cert -details -db TWSClientKeyStore.kdb -pw default -label server  

If there is a problem with the client certificate, there will be a problem with SSL communication from the MDM or GW to the DA. There will be entries in the log file TWA/TWS/stdlist/JM/ITA_trace.log like the following: 

     9. net_server: net_accept_ssl: error 127507: Error while accepting SSL connection   
or 
   10.
 net_server: net_accept_ssl: error 127503: Error while initializing SSL context 
 
Also, try and match the certificates on the Dynamic Agent and GW. If they do not match, 
export the certificates from GW to the Agent and then try to link the Agent again. 
You can also verify the certificate using md5sum or cksum command. 

Verify code

    
If the above command output matches in Remote GW and Agent, then the certificate is good. If not, you would have to export the certificate from Gateway to Agent in case the Agent alone is unlinked with the curl 35 Error and GW is fine. 
 
Check must also be made on the chmod of TWSClientKeyStore.kdb to be 755. 
Check also that the owner of the file TWSClientKeyStore.kdb is TWSUser and TWSGroup. 
 
Agent has to be restarted post applying above procedure for fixing curl 35. 
 
Agent not coming up at all locally 
In case the Agent is not coming up at all locally, try to analyze the error on JobManager_message*log’s and JobManager*trace.log’s. 
 
If any of these errors point to a problem with the scanner process or wscanhw process, then you could get a message similar to the following: 
 
AWSITA104E Unable to perform the system resources scan. The error is "operation failed with error code -6". 
 
This could indicate a permissions issue with the scanner, in which case you could check:
Run data

    
The other possibility is a problem with the scanner not completing, so you could run the following and re-check: ​
Run the following and re-check

    
If the scanner does not complete, run ps -ef | grep “wscanhw” to check the scanner processes triggered. If there are a lot of defunct or leftover scanner processes, drill down to the location /etc/cit and try to list the content of cit.ini. If the scanner version is 2.8.0.0001, you would need an upgrade to 2.8.0.0005: ​
Run data

    
Contact PMR L2 Support or SaaS L2 Support for CIT upgrade instructions and for Package for CIT Component. 
 
In case the agent does not come up locally and you are not able to locate the problem, you can also enable Dynamic Agent traces in order to troubleshoot further. 
  
To enable tracing modify the JobManager.ini file and JobManagerGW.ini file located under <TWSHome>/ITA/cpa/config as follows and update the below parameters to the newer values shown:
JobManager

    
After enabling Dynamic Agent traces, shutdown the Agent by issuing ./ShutDownLwa or ShutDownLwa.cmd and start the Agent by issuing ./StartUpLwa or StartUpLwa.cmd. 
 
After restarting the Dynamic Agent, you would end up with multiple trace files: JobManager*trace* as well as JobManagerGW*trace* files numbering up to 10 each of the size 3072000 Bytes or 3 MB. You could review these traces with PMR L2 Support or SaaS L2 Support for further assistance.
Picture
Sriram V.
​Senior Technical Lead


Sriram has been working on Workload Automation for the last 10 years in various capacities like IWS Administrator, SME, India-SME, and now currently with the HCL product team supporting Workload Automation on SaaS.
0 Comments

Your comment will be posted after it is approved.


Leave a Reply.

    Archives

    March 2023
    February 2023
    January 2023
    December 2022
    September 2022
    August 2022
    July 2022
    June 2022
    May 2022
    April 2022
    March 2022
    February 2022
    January 2022
    December 2021
    October 2021
    September 2021
    August 2021
    July 2021
    June 2021
    May 2021
    April 2021
    March 2021
    February 2021
    January 2021
    December 2020
    November 2020
    October 2020
    September 2020
    August 2020
    July 2020
    June 2020
    May 2020
    April 2020
    March 2020
    January 2020
    December 2019
    November 2019
    October 2019
    August 2019
    July 2019
    June 2019
    May 2019
    April 2019
    March 2019
    February 2019
    January 2019
    December 2018
    November 2018
    October 2018
    September 2018
    August 2018
    July 2018
    June 2018
    May 2018
    April 2018
    March 2018
    February 2018
    January 2018
    December 2017
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017

    Categories

    All
    Analytics
    Azure
    Business Applications
    Cloud
    Data Storage
    DevOps
    Monitoring & Reporting

    RSS Feed

www.hcltechsw.com
About HCL Software 
HCL Software is a division of HCL Technologies (HCL) that operates its primary software business. It develops, markets, sells, and supports over 20 product families in the areas of DevSecOps, Automation, Digital Solutions, Data Management, Marketing and Commerce, and Mainframes. HCL Software has offices and labs around the world to serve thousands of customers. Its mission is to drive ultimate customer success with their IT investments through relentless innovation of its products. For more information, To know more  please visit www.hcltechsw.com.  Copyright © 2019 HCL Technologies Limited
  • Home
  • Blogs
  • Forum
  • Resources
  • Events
    • IWA 9.5 Roadshows
  • About
  • Contact
  • What's new