comprehensive retry schedule with different time intervals in Airflow - airflow-scheduler

i need to implement a comprehensive retry schedule in Airflow, say 3x with 5min interval and then 4x with 2h interval. what is the best way to do so? i am thinking to run the same task with the second retry schedule from on failure callback of the task with the first retry schedule, but the problem is that in that case i cannot (or don't know how) to make downstream task(s) depend on the task executed from callback only if it was started in the first place. maybe my idea is totally wrong. any advice?

Related

Schedule Task ECS Fargate at interval time

I have a doubt. I try to use ECS Schedule Task.
Mi first impression is the next... I can't say to Schedule task please run N task for two hours and after that stop them. In the options I have rate or cron but none of that options can't help me because the goal of them is run N task every N (minutes,hours,days, etc) and when the main proccess of the task it's finished, stop.
But I want to run N task for a specific time. Because mi task is an API but I need the API only in specific interval of time.
Does anyone know how to solve this? Or really the task schedules are not the best solution to achieve what I want?
I think you can achieve this using step function and aws lambda with boto3 client.
Stepfunctions have wait until specific time future for your usecase.

Best way to schedule tasks to run in the future dynamically using celery + sqs

So I'm struggling to figure out the optimal way to schedule some events to happen at some point in the future using celery. An example of this is when a new user has registered, we want to send them an email the next day.
We have celery setup and some of you may allude to the eta parameter when calling apply_async. However that won't work for us, as we use SQS which has a visibility timeout that would conflict and generally the eta param shouldn't be used for lengthy periods.
One solution we've implemented at this point is to create events and store them in the database with a 'to-process' timestamp (refers to when to process the event). We use the celery beat scheduler with a task that runs literally every second to see if there are any new events that are ready to process. If there are, we carry out the subsequent tasks.
This solution works, although it doesn't feel great since we're queueing a task every second on SQS. Any thoughts or ideas on this would be great?

How good is lambda functions to hit a REST API after a fixed amount of time?

I am using distributed scheduler 'Chronos'(distributed crontab) to hit a REST API after few minute of job addition(example: Add job at time T to schedule it at T+5minutes).This run on a bigger infrastructure and take care of fault-tolerant and no-data loss, however it has significant cost and I am thinking some alternative to the similar requirement. Please help if it can be done using a lambda function.
Its possible to do invoke a lambda function, block/wait for X seconds and continue execution, but not recommended. You cannot wait for more than 300 seconds though as thats the max timeout legally allowed by Lambda functions.
Moreover, you will hit concurrent execution limits from AWS and will need to keep calling AWS support to increase your concurrent execution limits.
Another approach to solve this problem could be to use Actor based system such as Akka, to create an Actor for each job and do the needful.

Is there a way to set a walltime on AWS Batch jobs?

Is there a way to set a maximum running time for AWS Batch jobs (or queues)? This is a standard setting in most batch managers, which avoids wasting resources when a job hangs for whatever reason.
As of April, 2018, AWS Batch now supports setting a Job Timeout when submitting a Job, or in the job definition.
https://aws.amazon.com/about-aws/whats-new/2018/04/aws-batch-adds-support-for-automatic-termination-with-job-execution-timeout/
You specify an attemptDurationSeconds parameter, which must be at least 60 seconds, either in your job definition, or when you submit the job. When this number of seconds has passed following the job attempt's startedAt timestamp, AWS Batch terminates the job. On the compute resource, your job's container receives a SIGTERM signal to give your application a chance to shut down gracefully; if the container is still running after 30 seconds, a SIGKILL signal is sent to forcefully shut down the container.
Source: https://docs.aws.amazon.com/batch/latest/userguide/job_timeouts.html
POST /v1/submitjob HTTP/1.1
Content-type: application/json
{
...
"timeout": {
"attemptDurationSeconds": number
}
}
AFAIK there is no feature to do this. However, a workaround was suggested in the forum for a similar question.
One idea is to call Batch as an Activity from Step Functions, pingback
back on a schedule (e.g. every minute) from that job. If it stops
responding then you can detect that situation as a Timeout in the
activity and act accordingly (terminate the job etc.). Not an ideal
solution (especially if the job continues to ping back as a "zombie"),
but it's a start. You'd also likely have to store activity tokens in a
database to trace them to Batch job id.
Alternatively, you split that setup into 2 steps, and schedule a Batch
job from a Lambda in the first state, then pass the Batch job id to
the second step which then polls Batch (from another Lambda) for its
state with Retry and IntervalSeconds (e.g. once every minute, or even
with exponential backoff), and MaxAttempts calculated based on your
timeout. This way, you don't need any external state storage
mechanism, long polling or even a "ping back" from the job (it CAN be
a zombie), but the downside is more steps.
There is no option to set timeout on batch job but you can setup a lambda function that triggers every 1 hour or so and deletes jobs created before say 24 hours.
working with aws for some time now and could not find a way to set a maximum running time for batch jobs.
However there are some alternative way which you could utilize.
AWS Forum
Sadly there is no way to set the limit execution time on AWS Batch.
One solution may be to edit the docker's entry point to schedule the execution time limit.

Heroku Scheduler - why enqueue long-running jobs

The Heroku Scheduler documentation says:
Scheduled jobs are meant to execute short running tasks or enqueue longer running tasks into a background job queue. Anything that takes longer than a couple of minutes to complete should use a worker process to run
If the Scheduler starts a new dyno for these jobs and the cost is the same for a dyno vs. a worker, what is the advantage to adding a task to the queue and having a worker process run it?
It is an architectural best practice to only schedule, and not execute, interval tasks on the scheduler task (or your own custom clock process). The motivation for this is explained in the scheduled jobs article but, to summarize, you want your scheduler process/task to be as light-weight as possible since there should only be one of them. When you start overloading scheduling with execution you often run into schedule conflicts and erratic behavior.
Imagine that one interval job hangs, or takes much longer than expected. If your intervals are tight enough this will start causing a backlog and future intervals could be pushed back or skipped all together.
Also, it is just wise to keep component responsibilities as separated as possible - not having a single component be responsible for orthogonal tasks. This is a common design practice which is reflected in the scheduled job use-case by keeping scheduling and execution independent.
Best practices aside, if you're in development or bootstrap mode and understand the consequences stated above you can certainly choose to ignore such advice and run everything within the scheduler task. Just be careful for hard to debug job conflicts or apparent duplication.
Well, I think this is just a recommendation. If you have a task which is ran by Scheduler and you'll run this task manually (in the Heroku administration), you'll get an error - this error is caused by timeout (because each task has limit 30s). But in fact, this task will not be interrupted - the task is gonna be finished correctly.
If you have 1 dyno, so this one dyno use Heroku for your application. If you run some scheduled job, so this dyno gonna be taken be the Scheduler -> if you have long-time running task, your page will be "idle" (not correctly working till the time, when the scheduled job will be finished).