What if an ordinary day with Workload Automation turns into an intriguing mystery clue game? How can you find out what went wrong? Which is the weapon, and in which room did the crime scene occur in? Let’s put on the detective hat and start the game!
Everything was going as scheduled, all the workstations in your environment were up and running, the entire workload was running happily and smoothly, all the deadlines were going to be met. You were just starting to enjoy your coffee and looking at the greenest dashboard you have ever seen… And then, it happened. AGAIN. A job appears in your dashboard to be at a potential risk!
Stay cool! The clock is ticking but you need to keep calm and focused to solve this mystery crime case and save the entire workload (and your day).
Let’s start looking for clues…With a slightly shaking hand you click on the “potential risk” bar to take a step into the crime scene.
Here it is: your most important job is at risk… but who is responsible for that?
You launch the what-if-analysis and highlight the “critical path” to find out who is impacting your precious deadline.
And then you find it! Here is Mister Job, at the beginning of the chain, blocking your entire workload…
Now it is time to hold back your tears and start investigating with the right questions.
Who is blocking your job? Who is the culprit?
We know, the first to blame is always the butler. So, you first click goes for the workstation.
Same old story: you are expecting to find your agent not running (or your agent without the J flag in case you are checking it from the conman sc command).
This happens when the server sets the agent down because it has received no heartbeat from your agent. You need to check for a message like the following one in the WebSphere Application Server log room:
WebSphere Application Server log room
Note that those messages will be found in the SystemOut.log in the WebSphere Application Server log path (opt/IBM/TWA/WAS/TWSProfile/logs/server1) for releases previous to 9.5 or
or in the message.log file in the Liberty log path: (/opt/wa/server_root/TWSDATA/stdlist/appserver/engineServer/logs) starting from 9.5.
If this happens, a StartUpLwa command on your agent will be enough to restore it. But … this is not your case. Your agent is running…Then, why is your job stuck?
Job in INTRO
This time you try with the Monitor jobs (or with conman sj if you prefer the command line). For sure, in this way you will understand what’s going on… And in fact, there it is Mister Job laying down in INTRO status without going READY.
You know, now you need more clues… but where to find them?
You decide to proceed with a systematic approach to avoid to get lost forever in the twist and turns of Workload Automation Mansion.
The first room you enter is the WebSphere Application Server log room to find out if the broker and the agent are talking each other.
Mr. Broker has already sent your Mr. Job to Mr. Agent… but no response has been received from Mr. Agent. Is he responding or not? Who is blocking your messages and trying to mislead your investigation?
You will need to move to the agent log room to find these answers!
You enter the stdlist/JM/jobmessage.log and suddenly find out that the agent is unable to send resources to the broker:
These are the evidences you were looking for, now you can accuse Mr. Agent! Otherwise you would have found a message like the following one:
Still no signs of the crime scene weapons.
You write down a quick list of all the possible weapons that are preventing Mr. Agent to communicate with Mr. Broker.
Is the DNS preventing your agent to resolve the broker address? If yes, you need to check and fix the hosts file.
Is the firewall blocking your messages? If yes, you need to ask your network admin to open the port.
Is the agent trying to reach the broker on a port different from the default one (31116)? The jobmanager.ini configuration needs to be fixed in this case.
Is the CIT just sending resource status updates while the server is waiting for the full resource status? In this case, restarting the agent will solve the case and save your day!
Job in READY
Let’s picture another case with a completely different investigation scenario. Going back to the Monitor job, what do you need to look at if Mr. Job is READY but not starting? Multiple weapons are possible in this case! Usually, it’s a matter of dependencies: is Mr. Job waiting for a prompt, a resource or a time dependency? Or there is a job or job stream dependency that is preventing Mr. Job to start? But this is not the only option: it can be a limit or fence fault! In those cases, going to the Job Stream View and asking “why a job does not start?” can help your investigation.
Job in FAIL
What to check instead if Mr. Job went directly to FAIL status just after being in EXEC status? In this case the weapon is easy to spot: the job failed to start because of the wrong user. You need to go to the Workload Designer room to fix the job definition!
The final accusation
Regardless to the route you have followed so far, you can be proud of yourself: you MADE IT!
Sherlock Holmes would be proud of you. You stayed calm and saved Workload Automation by investigating the crime case. Like all respectable superheroes, you moved unnoticed until you made the perfect matching between your hypothesis and the available data. You can relax a little bit…but be ready to play another mystery clue game soon!