WORKLOAD AUTOMATION COMMUNITY
  • Home
  • Blogs
  • Forum
  • Resources
  • Events
    • IWA 9.5 Roadshows
  • About
  • Contact
  • What's new

Workload Automation 9.5:  How to survive disasters

6/5/2020

0 Comments

 
Picture
If you want to avoid a potential business disruption in your Workload Automation environment, you should leverage the Master/Backup Master configuration. But, what happens if the RDBMS connected to Workload Automation crashes? 
​

In this article, we will describe how to manage both Workload Automation components and DB2 HADR to allow business continuity during a disaster event. ​
Scenario 

To avoid possible disasters in a Workload Automation production environment, you must configure your environment in high availability. 

In Figure 1., you see a Workload Automation environment with both Master and Backup Master configured with DB2 HADR. ​
Picture
In following sections, we will describe: 
  • How to set up the DB2 HADR for Workload Automation 
  • How to configure WebSphere Liberty to manage DB2 HADR 
  • How to recover from disaster 
  • How to troubleshoot DB2 HADR issues

How to set up DB2 HADR for Workload Automation 
This configuration is composed of two nodes (MyMaster and MyBackup) on which are installed all Workload Automation components (MDM on MyMaster and BKM on MyBackup) with their own DB2 nodes configured in HADR. 
The DB2 HADR is composed of two nodes, one is the primary node that is active and a secondary node that is in standby mode synchronizing data with the primary node.  
 
DB2 HADR configuration 
To configure the Workload Automation database in HADR we’ve to setup DB2 as below on both nodes. 
In the following commands, TWS is the database name. 
 
Setup database properties: 

1. 
The first configuration is about DB alternate server name and port, on both nodes:  db2 update alternate server for database TWS using hostname <other machine> port <db_port> 

2. 
Now we’ve to set all DB HADR properties on both nodes:  db2 update db cfg for TWS using HADR_LOCAL_HOST <mymaster|mybackup> 
This parameter specifies the hostname of the local database  
 
db2 update db cfg for TWS using HADR_REMOTE_HOST <mymaster|mybackup> 
This parameter specifies the hostname of the remote database 
 
db2 update db cfg for TWS using HADR_LOCAL_SVC <local service name> 
This parameter specifies the local DB2 service name 
 
db2 update db cfg for TWS using HADR_REMOTE_SVC <remote service name> 
This parameter specifies the remote DB2 service name 
 
db2 update db cfg for TWS using HADR_REMOTE_INST <remote instance name> 
This parameter specifies the remote instance name 
 
db2 update db cfg for TWS using HADR_TIMEOUT <peer timeout> 
This parameter specifies how after much time DB2 will consider a node as offline 
 
db2 update db cfg for TWS using HADR_TARGET_LIST <peer nodes list> 
This parameter specifies the list of HADR nodes to lookup 
 
db2 update db cfg for TWS using HADR_SYNCMODE <sync mode> 
This parameter specifies the transaction logs synch mode. This parameter should set depending on various factors like network speed between nodes. To refer to the IBM Info Center for the detailed explanation, please click here: https://www.ibm.com/support/knowledgecenter/SSEPGG_9.5.0/com.ibm.db2.luw.admin.config.doc/doc/r0011445.html 
 
db2 update db cfg for TWS using HADR_REPLAY_DELAY <delay limit> 
This parameter specifies the number of seconds that must pass from the time that a transaction is committed on the primary database to the time that the transaction is committed on the standby database. 
 
Start HADR on both nodes 
Now that HADR is configured, we have to start it using a fixed order: first the standby node and then the primary one: 
On MyBackup issue the following command: 
db2 start hadr on db TWS as standby 
 On MyMaster issue the following command: 
db2 start hadr on db TWS as primary 
 
How to configure WebSphere Liberty to manage DB2 HADR 
After configured DB2 in HADR, we have to configure the TWS datasource of Liberty in order to point to HADR instead of single DB node. 
So, Liberty, also if doesn’t know where DB is physically active, is able to reach TWS database.  
To configure TWS datasource properties edit the file <TWA_HOME>/<DATADIR>/usr/servers/engineServer/configDropins/overrides/datasource.xml: 
  • Add the highlighted parameters to the properties section: 
<properties.db2.jcc 
serverName="MyMaster" 
portNumber="50003" 
databaseName="TWS" 
user="db2inst1" 
password="="{xor}xxxxxxxxxxxxxxxxxx” 
clientRerouteAlternateServerName="MyBackup" 
clientRerouteAlternatePortNumber="50003" 
retryIntervalForClientReroute="3000" 
maxRetriesForClientReroute="100" 
/> 
  • The new configuration will be automatically reloaded.  
An example of the entire datasource.xml file using variables is: 
<server description="datasourceDefDB2"> 
<variable name="db.driverType" value="4"/> 
<variable name="db.serverName" value="MyMaster"/> 
<variable name="db.portNumber" value="50003"/> 
<variable name="db.databaseName" value="TWS"/> 
<variable name="db.user" value="db2inst1"/> 
<variable name="db.password" value="{xor}xxxxxxxxxxxxxxxxxx"/> 
<variable name="db.driver.path" value="/opt/wa/TWS/jdbcdrivers/db2"/> 
<variable name="db.sslConnection" value="true"/> 
<variable name="db.clientRerouteAlternateServerName" value="MyBackup"/> 
<variable name="db.clientRerouteAlternatePortNumber" value="50003"/> 
<variable name="db.retryIntervalForClientReroute" value="3000"/> 
<variable name="db.maxRetriesForClientReroute" value="100"/> 
 
<jndiEntry value="DB2" jndiName="db.type" /> 
 
<jndiEntry value="jdbc:db2://${db.serverName}:${db.portNumber}/${db.databaseName}" jndiName="db.url"/> 
<jndiEntry value="${db.user}" jndiName="db.user"/> 
 
<!--  DB2 DRIVER jars Path -> db2jcc4.jar db2jcc_license_cisu.jar --> 
<library id="DBDriverLibs"> 
<fileset dir="${db.driver.path}" includes="*" scanInterval="5s"/> 
</library> 
 
<dataSource id="db2" jndiName="jdbc/twsdb" statementCacheSize="400" isolationLevel="TRANSACTION_READ_COMMITTED"> 
<jdbcDriver libraryRef="DBDriverLibs"/> 
<connectionManager connectionTimeout="180s" maxPoolSize="300" minPoolSize="0" reapTime="180s" purgePolicy="EntirePool"/> 
<properties.db2.jcc 
driverType="${db.driverType}" 
serverName="${db.serverName}" 
portNumber="${db.portNumber}" 
sslConnection="${db.sslConnection}" 
databaseName="${db.databaseName}" 
user="${db.user}" 
password="${db.password}" 
clientRerouteAlternateServerName="${db.clientRerouteAlternateServerName}"clientRerouteAlternatePortNumber="${db.clientRerouteAlternatePortNumber}" 
retryIntervalForClientReroute="${db.retryIntervalForClientReroute}" 
maxRetriesForClientReroute="${db.maxRetriesForClientReroute}" 
/> 
</dataSource> 
</server> 
 
How to recover from disaster 
To recover from a disaster scenario, for example if the primary node crashes, we can leverage multi node environment to allow business continuity. 

Follow these steps to recover the Workload Automation environment. 

Takeover database on standby node
 
We’ve to “takeover” the database on the secondary node. 
On MyBackup issue the following command: 
db2 takeover hadr on db TWS 
 
Switch Workload Automation components 
After the database switch to secondary node we have to switch also all Workload Automation components. 
  • Export the master workstation definition into the file ‘file1’: 
composer create file1 from ws=S_MDM 
where S_MDM is the backup master workstation. 
  • Export the backup workstation definition into the file ‘file2’: 
composer create file2 from ws=S_BKM 
where S_BKM is the backup master workstation. 
  • From the both files, create two new files (‘file3’ and ‘file4’). For example, using the sed Linux command: 
sed 's/MANAGER/fta/Ig' < file1 > file3 
sed 's/fta/MANAGER/Ig' < file2 > file4 
  • Switch the event processor from master to backup master: 
conman "switchevtproc S_BKM" 
  • Switch the manager from master to backup master: 
conman "switchmgr masterdm;S_BKM" 
  • Switch the Broker application from master to backup master. 
On master node: 
<TWA_HOME>/wastools/stopBrokerApplication.sh 
On backup master node: 
<TWA_HOME>/wastools/startBrokerApplication.sh 
  • Import the new workstation definitions to make permanent the switch between MASTER and FTA (backup master): 
composer replace file3 
composer replace file4 
 
Now both middleware and Workload Automation components are on MyBackup machine and we could continue to work on this secondary node without any disruption.
Picture
Troubleshooting 
 
How to check HADR health 
To check the HADR status issue the following command on both nodes: 
db2pd –hadr –db TWS 

where TWS is the database name.
 

​Here an example of output of the command on the primary node:
 ​
Picture
This picture shows the status of HADR on primary node: highlighted parameters are the ones that describes the HADR healthy: 
  • HADR_ROLE: on primary node must be PRIMARY (STANDBY on secondary node) 
  • HADR_STATE: must be PEER 
  • HADR_CONNECT_STATUS: must be CONNECTED 
  • LOG_TIME parameters describes the latest transaction log on all nodes: the date and time must be synchronized and up-to-date 
Below an example of output of the command on the secondary node: ​
Picture
How to fix HADR issues 
If one of parameters described in the previous section is not in expected state, it means that HADR is not working fine and immediate action should be performed. 

Let try to understand common errors and recovery actions that should be performed. 

First try to start HADR on the node that is not working fine: 

db2 start hadr on db TWS as standby|primary 

If after some minutes the wrong status does not change it means that HADR is broken. 
Probably the node on which HADR is not working has either database corruption or a missing/corrupt transaction log, so the strategy to recover is: 

1. 
Takeover HADR on the working node: 
    
db2 takeover hadr on db TWS 
2. 
Backup online the database on the working node:  mkdir /tmp/TWS_backup 
db2 "backup db TWS ONLINE to /tmp/TWS_backup INCLUDE LOGS" 
3. Copy the TWS_backup on the corrupted node 
4. 
Drop and Restore the database on the corrupted node: 
db2 drop db TWS 
db2 "restore db TWS from /tmp/TWS_backup " 
5. 
Reconfigure and start HADR on the working node (for example MyBackup): 
db2 "update alternate server for db TWS using hostname MyMaster port 50003" 
db2 "update db cfg for TWS using HADR_LOCAL_HOST MyBackup" 
db2 "update db cfg for TWS using HADR_REMOTE_HOST MyMaster" 
db2 "update db cfg for TWS using HADR_TARGET_LIST MyMaster:DB2_HADR_TWS" 
db2 "start hadr on db TWS as standby" 
 6. Takeover HADR on the corrupted node (in our example MyMaster):  db2 takeover hadr on db MyMaster 
 
Conclusion 
This article provides a simple way to add a high availability capability to your Workload Automation environment to avoid possible disasters both on middleware or database side. 
Do not hesitate to contact us for any questions.  
 
References 
If your Workload Automation version is a 9.4 or previous, you can refer to this post: 
http://www.workloadautomation-community.com/blogs/workload-automation-how-to-survive-disasters# 

Author's BIO
Picture
Eliana Cerasaro, Technical Lead, HCL Technologies 

Eliana Cerasaro has worked in the Workload Automation area since 2006. In 2016, she moved from IBM to HCL Technologies and is currently part of the distributed development team of Workload Scheduler as a Technical Lead. She specializes in design and development of backend applications and databases. ​
View my profile on LinkedIn
0 Comments

Your comment will be posted after it is approved.


Leave a Reply.

    Archives

    March 2023
    February 2023
    January 2023
    December 2022
    September 2022
    August 2022
    July 2022
    June 2022
    May 2022
    April 2022
    March 2022
    February 2022
    January 2022
    December 2021
    October 2021
    September 2021
    August 2021
    July 2021
    June 2021
    May 2021
    April 2021
    March 2021
    February 2021
    January 2021
    December 2020
    November 2020
    October 2020
    September 2020
    August 2020
    July 2020
    June 2020
    May 2020
    April 2020
    March 2020
    January 2020
    December 2019
    November 2019
    October 2019
    August 2019
    July 2019
    June 2019
    May 2019
    April 2019
    March 2019
    February 2019
    January 2019
    December 2018
    November 2018
    October 2018
    September 2018
    August 2018
    July 2018
    June 2018
    May 2018
    April 2018
    March 2018
    February 2018
    January 2018
    December 2017
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017

    Categories

    All
    Analytics
    Azure
    Business Applications
    Cloud
    Data Storage
    DevOps
    Monitoring & Reporting

    RSS Feed

www.hcltechsw.com
About HCL Software 
HCL Software is a division of HCL Technologies (HCL) that operates its primary software business. It develops, markets, sells, and supports over 20 product families in the areas of DevSecOps, Automation, Digital Solutions, Data Management, Marketing and Commerce, and Mainframes. HCL Software has offices and labs around the world to serve thousands of customers. Its mission is to drive ultimate customer success with their IT investments through relentless innovation of its products. For more information, To know more  please visit www.hcltechsw.com.  Copyright © 2019 HCL Technologies Limited
  • Home
  • Blogs
  • Forum
  • Resources
  • Events
    • IWA 9.5 Roadshows
  • About
  • Contact
  • What's new