Autosys dependency on mainframe job - scheduling

We have an Autosys job (let's call it job_a) that has a 3am time dependency and is also supposed to await successful completion of a mainframe job (job_m, which in our case is always successful). Job_m is run via the OPC scheduler on the mainframe, which communicates job completion to Autosys. It can run any time between 2am and 6am.
My understanding of how Autosys works is that it writes an entry into a table in its database when job_m completes, and when job_a checks its dependencies, it looks in this table to see the status of job_m. This status is not automatically cleared. As a result, the job dependency will always be met after the first ever successful run of job_m, even though we are only interested in job_m runs on the same day.
Day 1 4am: job_m completes
Day 1 4:01am: job_a runs, since Day 1 4am run of job_m was successful
Day 2 3am: job_a runs, since Day 1 4am run of job_m was successful
Day 2 5am: job_m completes
Our current proposed workaround is to have a job (job_c) that periodically checks the table and only complete if the status of job_m was changed in the last 6 hours.
Day 1 3am: job_c starts, sees no status change for job_m within the last 6 hours
Day 1 4am: job_m completes
Day 1 4:01am: job_c completes
Day 1 4:02am: job_a runs following completion of job_c
Day 2 3am: job_c starts, sees no status change for job_m within the last 6 hours
Day 2 5am: job_m completes
Day 2 5:01am: job_c completes
Day 2 5:02am: job_a runs following completion of job_c
Is there an Autosys command that can be used to reset the status of job_m in the table? If not, is there a better method of enforcing this dependency than the one outlined above?

The solution depends on the version of Autosys you are using. If it is R11, the newest version, you can set look back dependencies on job_a to only run if job_c has ran to S within X hours.
In earlier versions you can run a job on the S of job_a that will change the status of job_c to INACTIVE. If job_c is inactive, job_a sees that starting condition as FALSE, but job_c will run the next time its starting conditions are met.
The command is sendevent -E CHANGE_STATUS -s INACTIVE -J job_c. This command has to be ran as the Autosys superuser account. Your Autosys Admins may not allow this. Also best practice is to run sendevent commands on the Autosys Event processor server so that if you are running dual server high availability and the system rolls over to single server mode, the sendevent command works after the roll over.
Example
insert_job: job_a job_type: c
command: do_something
machine: machine1
owner: my_id#machine1
conditions: s(job_c)
date_condition: 1
start_time: 03:00
insert_job: job_c job_type: c
command: do_something_else
machine: machine1
owner: mainframe#machine1
comment: "This is the mainframe job"
insert_job: job_d job_type: c
command: sendevent -E CHANGE_STATUS -s INACTIVE -J job_c
owner: superuser#autosys_server
machine: autosys_server
conditions: s(job_a) and s(job_c)

Related

Kill APscheduler add_job based on id

We have a flask script get_logs.py that uses APScheduler and contains following job
scheduler.add_job(id="create_recommendation_entries", trigger = 'interval',seconds=60*10,func=create_entries)
Someone ran the script and now the the logs show that this script is still running at 10 minutes interval even after terminating.
The process id is not listed nor does it show using grep and we don't know whether it was executed using nohup or gunicorn.
How do I kill this job based on id="create_recommendation_entries"because I don't know any of its stats(port,pid etc).
Rerunning the script creates a different thread and stops after Ctrl+C but the previous one remains still in process

How to start a Snakemake workflow on AWS and detach?

I am trying to execute a Snakemake workflow on AWS, and have succeeded in executing my workflow using the command:
snakemake --tibanna --use-conda --default-remote-prefix=mybucket/myproject
and it works successfully. So far, so good. Unfortunately snakemake keeps running in the foreground in the terminal until the workflow ends. Using Ctrl-C on it ends the run. This is problematic for me when I want to run a pipeline that takes a few days.
Is there a way to run pipelines using snakemake --tibanna and detach and poll the results later?
I believe tibanna has the capability: tibanna run_workflow runs the workflow and detatches, and you can check the status later using tibanna stat. I just can't get snakemake to finish leaving the processes scheduled in the cloud.

Missing log lines when writing to cloudwatch from ECS Docker containers

(Docker container on AWS-ECS exits before all the logs are printed to CloudWatch Logs)
Why are some streams of a CloudWatch Logs Group incomplete (i.e., the Fargate Docker Container exits successfully but the logs stop being updated abruptly)? Seeing this intermittently, in almost all log groups, however, not on every log stream/task run. I'm running on version 1.3.0
Description:
A Dockerfile runs node.js or Python scripts using the CMD command.
These are not servers/long-running processes, and my use case requires the containers to exit when the task completes.
Sample Dockerfile:
FROM node:6
WORKDIR /path/to/app/
COPY package*.json ./
RUN npm install
COPY . .
CMD [ "node", "run-this-script.js" ]
All the logs are printed correctly to my terminal's stdout/stderr when this command is run on the terminal locally with docker run.
To run these as ECS Tasks on Fargate, the log driver for is set as awslogs from a CloudFormation Template.
...
LogConfiguration:
LogDriver: 'awslogs'
Options:
awslogs-group: !Sub '/ecs/ecs-tasks-${TaskName}'
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: ecs
...
Seeing that sometimes the cloduwatch logs output is incomplete, I have run tests and checked every limit from CW Logs Limits and am certain the problem is not there.
I initially thought this is an issue with node js exiting asynchronously before console.log() is flushed, or that the process is exiting too soon, but the same problem occurs when i use a different language as well - which makes me believe this is not an issue with the code, but rather with cloudwatch specifically.
Inducing delays in the code by adding a sleep timer has not worked for me.
It's possible that since the docker container exits immediately after the task is completed, the logs don't get enough time to be written over to CWLogs, but there must be a way to ensure that this doesn't happen?
sample logs:
incomplete stream:
{ "message": "configs to run", "data": {"dailyConfigs":"filename.json"]}}
running for filename
completed log stream:
{ "message": "configs to run", "data": {"dailyConfigs":"filename.json"]}}
running for filename
stdout: entered query_script
... <more log lines>
stderr:
real 0m23.394s
user 0m0.008s
sys 0m0.004s
(node:1) DeprecationWarning: PG.end is deprecated - please see the upgrade guide at https://node-postgres.com/guides/upgrading
UPDATE: This now appears to be fixed, so there is no need to implement the workaround described below
I've seen the same behaviour when using ECS Fargate containers to run Python scripts - and had the same resulting frustration!
I think it's due to CloudWatch Logs Agent publishing log events in batches:
How are log events batched?
A batch becomes full and is published when any of the following conditions are met:
The buffer_duration amount of time has passed since the first log event was added.
Less than batch_size of log events have been accumulated but adding the new log event exceeds the batch_size.
The number of log events has reached batch_count.
Log events from the batch don't span more than 24 hours, but adding the new log event exceeds the 24 hours constraint.
(Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html)
So a possible explanation is that log events are buffered by the agent but not yet published when the ECS task is stopped. (And if so, that seems like an ECS issue - any AWS ECS engineers willing to give their perspective on this...?)
There doesn't seem to be a direct way to ensure the logs are published, but it does suggest one could wait at least buffer_duration seconds (by default, 5 seconds), and any prior logs should be published.
With a bit of testing that I'll describe below, here's a workaround I landed on. A shell script run_then_wait.sh wraps the command to trigger the Python script, to add a sleep after the script completes.
Dockerfile
FROM python:3.7-alpine
ADD run_then_wait.sh .
ADD main.py .
# The original command
# ENTRYPOINT ["python", "main.py"]
# To run the original command and then wait
ENTRYPOINT ["sh", "run_then_wait.sh", "python", "main.py"]
run_then_wait.sh
#!/bin/sh
set -e
# Wait 10 seconds on exit: twice the `buffer_duration` default of 5 seconds
trap 'echo "Waiting for logs to flush to CloudWatch Logs..."; sleep 10' EXIT
# Run the given command
"$#"
main.py
import logging
import time
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
if __name__ == "__main__":
# After testing some random values, had most luck to induce the
# issue by sleeping 9 seconds here; would occur ~30% of the time
time.sleep(9)
logger.info("Hello world")
Hopefully the approach can be adapted to your situation. You could also implement the sleep inside your script, but it can be trickier to ensure it happens regardless of how it terminates.
It's hard to prove that the proposed explanation is accurate, so I used the above code to test whether the workaround was effective. The test was the original command vs. with run_then_wait.sh, 30 runs each. The results were that the issue was observed 30% of the time, vs 0% of the time, respectively. Hope this is similarly effective for you!
Just contacted AWS support about this issue and here is their response:
...
Based on that case, I can see that this occurs for containers in a
Fargate Task that exit quickly after outputting to stdout/stderr. It
seems to be related to how the awslogs driver works, and how Docker in
Fargate communicates to the CW endpoint.
Looking at our internal tickets for the same, I can see that our
service team are still working to get a permanent resolution for this
reported bug. Unfortunately, there is no ETA shared for when the fix
will be deployed. However, I've taken this opportunity to add this
case to the internal ticket to inform the team of the similar and try
to expedite the process
In the meantime, this can be avoided by extending the lifetime of the
exiting container by adding a delay (~>10 seconds) between the logging
output of the application and the exit of the process (exit of the
container).
...
Update:
Contacted AWS around August 1st, 2019, they say this issue has been fixed.
I observed this as well. It must be an ECS bug?
My workaround (Python 3.7):
import atexit
from time import sleep
atexit.register(finalizer)
def finalizer():
logger.info("All tasks have finished. Exiting.")
# Workaround:
# Fargate will exit and final batch of CloudWatch logs will be lost
sleep(10)
I had the same problem with flushing logs to CloudWatch.
Following asavoy's answer I switched from exec form to shell form of the ENTRYPOINT and added a 10 sec sleep at the end.
Before:
ENTRYPOINT ["java","-jar","/app.jar"]
After:
ENTRYPOINT java -jar /app.jar; sleep 10

All jobs failing in C COMPSs execution

I have downloaded COMPSs 1.4 and some test programs from http://www.bsc.es/computer-sciences/grid-computing/comp-superscalar/downloads-and-documentation and I am trying to test them. Java executions went fine; however, I amb having problems with C.
I am currently trying to execute the Simple. The Readme states that I only need two commands:
buidapp simple
runcompss --lang=c master/simple 1
The app builds fine, but when executing with this command, I get the following error:
[ERRMGR] - WARNING: Job 1 for running task 1 on worker localhost has failed; resubmitting task to the same worker.
[ERRMGR] - WARNING: Task 1 execution on worker localhost has failed; rescheduling task execution. (changing worker)
[ERRMGR] - WARNING: No task could be scheduled to any of the available resources.
This could end up blocking COMPSs. Will check it again in 20 seconds.
Possible causes:
-Network problems: non-reachable nodes, sshd service not started, etc.
-There isn't any computing resource that fits the defined tasks constraints.
If this happens 2 more times, the runtime will shutdown.
After 3 checks, the execution ends with no results. Is there something I am missing?
When running an application with the C binding, the default project.xml is not valid because you have to define a project.xml which includes the place where the worker binaries are deployed in each host.
<Project>
<Worker Name="localhost">
<InstallDir>/opt/COMPSs/Runtime/scripts/system/</InstallDir>
<WorkingDir>[/path/to/dir/used_as_working_dir]</WorkingDir>
<AppDir>[/path/to/installation]</AppDir>
<LimitOfTasks>4</LimitOfTasks>
</Worker>
</Project>

How to use Tivix django-cron app

I got exact same problem described in this post, but the answer doesn't help at all. In short, I am using Tivix django-cron, the cron job is not running at regular basis.
To illustrate the problem, the following cron job class is intended to send email every min once running runcrons command. But in fact, it only sends out one email and no more. That defeats the purpose of cron... What am I missing?
class TestCron(CronJobBase):
schedule = Schedule(run_every_mins=1)
code = 'test_cron_philip'
def do(self):
send_mail('cron test', 'body is test body', 'coach_zhong#163.com',
['admin#dessert.webfactional.com'],fail_silently=False)
Yes, you miss something ("runcrons" is not background deamon). From documentation:
"Now everytime you run the management command python manage.py
runcrons all the crons will run if required. Depending on the
application the management command can be called from the Unix crontab
as often as required. Every 5 minutes usually works for most of my
applications."
That means you have to put "runcrons" command in your crontab.
Example:
You have some CronJob that do something every 30 min.
To get this running you must edit you crontab (linux, mac) or task scheduler (windows) to run "python manage.py runcrons" for every, let say 1 min.
If you get this running, your CronJob will be pinged every 1 min and run if necessary (every 30 min or whatever value you have set).
Hope this helps.