WORKLOAD AUTOMATION COMMUNITY
  • Home
  • Blogs
  • Forum
  • Resources
  • Events
    • IWA 9.5 Roadshows
  • About
  • Contact
  • What's new

Workload Automation in Full High Availability with Automatic Master Failover

5/6/2020

0 Comments

 
Picture
Nowadays, the global marketplace requires systems that can react to fluctuating loads and unpredictable failures, with the end-level agreement (SLA) goal of being available 99.999% of the time. Mitigating the negative impacts of failures, disasters and outages cannot even be contemplated in today’s competitive world. A disaster recovery strategy is too error-prone and resource-consuming: businesses need to operate following an “always on” model. ​
This requirement makes no exception for Workload Automation that manages your business critical batch and real-time workload in a downtime-free environment. 
 
Workload Automation provides the following high availability options to help you meet the SLA goal:  
               - DWC Replica in Active/Active configuration 
               - Master Replica in Active/Passive configuration 

With Workload Automation V9.5 Fix Pack 2, the high availability configuration offers a new Automatic Failover feature, without the need for additional third-party software. These options, coupled with support for zero downtime upgrade methods for agents and the entire workload environment, ensures continuous business operations and protection against data loss. 

Want to see how you can configure High Availability in a single domain network, with scalable and auto-recoverable masters? Read on to find out how to leverage the new Automatic Failover feature to reduce, to a minimum, the impact on workload scheduling, and more in general, on the responsiveness of the overall system when scenarios of unplanned outages of one or more master components occur. 

A simple Workload Automation configuration: master not in high availability

​In a configuration with a single master, with both dynamic and fault-tolerant agent (FTA) workstations defined on it (Figure 1), you might already be aware of the various ways components auto-recover from a failure, or how they continue to perform their job without an active master:  
  • If there are communication problems with the domain manager, the FTA can run its jobs locally. 
  • If a TCP/IP connection is not available, the store-and-forward message mechanism, queues messages to prevent the loss of job status updates coming from the FTA and dynamic agent. 
  • ​If the WebSphere Liberty Profile server process goes down, the watchdog service attempts to recover by restarting it.
Picture
Figure 1. Single master configuration with dynamic and FTA agent workstations

So, even with a simple configuration, you can still achieve a certain level of fault tolerance.  

And let’s not forget that you can already scale up/down your dynamic agents. For example, depending on how heavy the workload is on your workstations, you can choose to define a pool of dynamic agents and schedule your jobs on the defined pool so that jobs are balanced across the agents in the pool and are reassigned to available agents should an agent become unavailable (see this blog for more details on an elastic scale solution using Docker to configure a list of pools to which an agent can be associated). 

However, as you know, an environment that can only continue to orchestrate the jobs in the current plan at the FTA level, doesn’t quite cut it in the face of failure. You still rely on the availability of the master to schedule on dynamic agents, to extend your plan, and to solve dependencies between jobs running on different agents or domains. 
​

 Workload Automation configuration: master in high availability ​

The next incremental step toward high availability is to replicate the master in our network (Figure 2). Because the WA cluster configuration is active-passive, any replicated master instances will be backup replicas.  
Picture
Figure 2 Environment with a master and several backup masters
If an unplanned outage occurs, you can simply run the commands, conman switchmgrmasterdm;BKMi and conman    
switchevtproc BKM (to switch the master and event processor to a backup), from any of your backup masters (oh and by the​ way, since V9.5, you don’t need to manually stop/start the broker application anymore), and switch the workstations’ definition types FTA<->MASTER to make the switch permanent (the full procedure in this blog here). Moreover, to be able to schedule the FINAL job stream independently from what is the current master, you can define an  XA (extended agent) workstation with $MASTER  current host workstation and that is, the host of the current master).
The FTAs and dynamic agents can link automatically to the newly selected master by reading the Symphony file and by checking the configured Resource Advisor URL list (the list of the master’s endpoints).

With this, your workload resumes enabling business continuity. 

Ok, but this requires some kind of a monitoring system that would alert IT teams of failures and, after analyzing the issue, would run the switch manager procedure. Alternatively, high availability can be achieved automatically by using an external supervisor system suitably configured, such as TSAMP (IBM Tivoli System Automation for Multiplatforms).
So you ask, how can you further minimize system unavailability? Keep reading.

Master in high availability with automatic failover enabled 

Starting with Version 9.5 Fix Pack 2, you can leverage the new Automatic Failover feature: the first backup engine to automatically verify and detect if the active master is unavailable starts the long-term permanent switchmgr to [FM1] itself; Similarly, the same applies to the event processor. The backup event processor(s) automatically verifies and detects that the active event processor (that can be different from the current master workstation) is unavailable, and then starts the long-term switchevtproc to itself.  

And not only that, but the masters (active or passive) practice self-awareness: they can check the local FTA status and, if any of the processes go down, they are able to automatically start the recovery of the FTA.

To better understand how the automatic failover feature works, let’s look at some details about the role each component plays to detect a failure or recover from it:
  1. Each backup monitors the status of the active master.
  2. Each master (active or backup) monitors their own FTA status. If any of their own FTAs go down (mailman/batchman/jobman), automatic recovery is attempted by the master (max 3 attempts).
  3. If the WebSphere Liberty Profile server goes down, the watchdog process attempts to recover it. 
  4. If the active master cannot be automatically restored in 5 min (this is the threshold set after the master is declared to be unavailable) because either:
  • The FTA and/or Liberty Server are still down
  • The engine is unable to communicate with the database
Then, a permanent switch to a backup is automatically triggered by any of the backup candidates.

Nice, right? This feature is automatically enabled and the XA workstation hosted by the $MASTER hostname with FINAL and FINALPOSTREPORTS job streams in a UNIX environment is automatically created at installation time with a WA fresh installation. Otherwise , if you are coming from a product update or upgrade, after you have migrated your backups and your master, you can enable the feature via the optman command, optman chg enAutomaticFailover = yes and  change the FINAL and FINALPOSTREPORTS job streams to move the job streams and all of the jobs from the master to the XA workstation. 

Let's suppose that you have multiple backup masters and you want to have greater control over which one of them should be considered first as the candidate master for the switch operation, well, for your convenience, you can use the following new optman options:
  • workstationMasterListInAutomaticFailover
  • workstationEventMgrListInAutomaticFailover  

These are two separate lists of workstations, each list containing an ordered comma-separated list of workstations that serve as backups for the master and event processor, respectively. If a workstation is not included in the list, then it will never be considered as a backup. The switch is first attempted by the first workstation in the list and, if it fails, an attempt is made by the second in line, and so on. 
If no workstation is specified in this list, then all backup master domain managers in the domain are considered eligible backups. This gives you an extra level of control over your backups.

For further granularity, you can choose to use only a subset of one list in the other list, or, choose to use two completely different lists. You might have a backup that can serve as the event manager backup, but you don’t want it to be considered as a potential master domain manager backup. Also, if the event manager fails, but the master domain manager is running fine, then only the event manager switches to a backup manager defined in the list of potential backups. Another example where these two distinct lists can be useful is if you have a dedicated environment for job orchestration and a different one for the event processor. You can continue to keep this separation of duties by enabling automatic failover and configuring the new optman options accordingly (Figure 3): 
Picture
​Figure 3 Separate backup lists for MDM and event processor
If you have a Dynamic Workload Console (DWC) and/or other client application based on the exposed REST API of the master, you might be worried about how to replace the new  master connection info when an automatic switch occurs. The best way you can do this is with a load balancer sitting in front of your masters and behind the DWC, and by specifying the public hostname of the load balancer as the endpoint of your engine connections in the DWC or in your client apps. In this way, you won’t have to know what the hostname of the current active master is to interact with Workload Automation features ( Figure 4). This becomes possible with another feature introduced in Fix Pack 2 that enables any backup master to satisfy any HTTP request, even in the event of a request that can be satisfied only by the active master (i.e. requests on the Workload Service Assurance) by proxying the request to/from the active master itself . 
Picture
Figure 4 Load balancer placed behind the MDM and BKMs
​DWC, Master and RDBMS in high availability: WA in full high availability 

At this point, to have a fully high available WA environment, the only thing missing is configuring the RDBMS and DWC in high availability. 
​
If your RDBMS has the HADR feature and you have enabled it (a Db2 example here), you can configure the liberty server’s datasource.xml file of the master and backup components, adding the failover properties,  whose key-value pairs depend on the specific RDBMS vendor. Db2’s datasource can be configured with this set of properties in the XML element named properties.db2.jcc:
<properties.db2.jcc
            databaseName="TWS"
            user="…"
            password="…"
            serverName="MyMaster"
            portNumber="50000"
            clientRerouteAlternateServerName="MyBackup"
            clientRerouteAlternatePortNumber="50000"
            retryIntervalForClientReroute="3000"
            maxRetriesForClientReroute="100"
         />
​For the DWC, instead, you just need to ensure that it is connected to an external DB (not the embedded one), replicate it, and link it to a load balancer that supports session affinity, to dispatch the request related to the same user session to the same DWC instance.     
​
In Figure 5, the load balancers are depicted as two distinct ones, the most general case possible, but you can also use the same component for balancing the request to the DWC and to the masters:
Picture
Figure 5 Full high availability WA environment
​At this point, congratulations, you have reached a full WA environment in high availability!

Author's Bio
Picture
​Paolo Pisa, Senior Software Developer 
​

Paolo joined IBM as Software Engineering working on using Full Stack technologies  and covering technical leader roles.   Starting from 2017 he works in HCL as Technical leader of WA’s Deployment Team with main focus in  Dockerizing  javaEE-based applications.  
He has a strong background in computer science technologies, data structures, algorithms and distributed computing. He is the subject-matter expert of IBM Liberty server runtime environment.
View my profile on LinkedIn
Picture
Louisa Ientile, Information Developer

Louisa works as an Information Developer planning, designing, developing, and maintaining customer-facing technical documentation and multimedia assets.  Louisa completed her degree at University of Toronto, Canada, and currently lives in Rome, Italy with a passion for food, wine, beaches, and la dolce vita. 
View my profile on LinkedIn
0 Comments

Your comment will be posted after it is approved.


Leave a Reply.

    Archives

    March 2023
    February 2023
    January 2023
    December 2022
    September 2022
    August 2022
    July 2022
    June 2022
    May 2022
    April 2022
    March 2022
    February 2022
    January 2022
    December 2021
    October 2021
    September 2021
    August 2021
    July 2021
    June 2021
    May 2021
    April 2021
    March 2021
    February 2021
    January 2021
    December 2020
    November 2020
    October 2020
    September 2020
    August 2020
    July 2020
    June 2020
    May 2020
    April 2020
    March 2020
    January 2020
    December 2019
    November 2019
    October 2019
    August 2019
    July 2019
    June 2019
    May 2019
    April 2019
    March 2019
    February 2019
    January 2019
    December 2018
    November 2018
    October 2018
    September 2018
    August 2018
    July 2018
    June 2018
    May 2018
    April 2018
    March 2018
    February 2018
    January 2018
    December 2017
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017

    Categories

    All
    Analytics
    Azure
    Business Applications
    Cloud
    Data Storage
    DevOps
    Monitoring & Reporting

    RSS Feed

www.hcltechsw.com
About HCL Software 
HCL Software is a division of HCL Technologies (HCL) that operates its primary software business. It develops, markets, sells, and supports over 20 product families in the areas of DevSecOps, Automation, Digital Solutions, Data Management, Marketing and Commerce, and Mainframes. HCL Software has offices and labs around the world to serve thousands of customers. Its mission is to drive ultimate customer success with their IT investments through relentless innovation of its products. For more information, To know more  please visit www.hcltechsw.com.  Copyright © 2019 HCL Technologies Limited
  • Home
  • Blogs
  • Forum
  • Resources
  • Events
    • IWA 9.5 Roadshows
  • About
  • Contact
  • What's new