Airflow scheduler doesn't work for monthly jobs schedule - airflow-scheduler

I am trying to schedule a monthly airflow job. I kept start date as
'start_date':datetime(2020,9,23),
which is the date for previous month(today's date); because of the 'start_date+schedule_interval' rule. I kept my schedule interval as :
schedule_interval="20 9 23 * *"
By this logic job should run on 2020/23/10 9:23 UTC . But I don't know why it's not running or even creating an instance. I did everything right, kept start date to one month before and even tried with catchup= True. But it doesn't help.
Job is running if I try keeping the schedule as daily; ex:
start_date':airflow.utils.dates.days_ago(1)
and schedule interval as:
schedule_interval="20 9 * * *"
and it works file. Ran a job today at 9.20 UTC.
Note: I have ran the job before manually so it has last execution date as something else. Can that be the problem . If so, how can I resolve it or will I have to create a new job.

Changing the schedul_interval can cause problems and it's recommended to create a new DAG, see Common Pitfalls on Apache Airflow Confluence:
When needing to change your start_date and schedule interval, change
the name of the dag (a.k.a. dag_id) - I follow the convention :
my_dag_v1, my_dag_v2, my_dag_v3, my_dag_v4, etc...
Changing schedule
interval always requires changing the dag_id, because previously run
TaskInstances will not align with the new schedule interval
Changing
start_date without changing schedule_interval is safe, but changing to
an earlier start_date will not create any new DagRuns for the time
between the new start_date and the old one, so tasks will not
automatically backfill to the new dates. If you manually create
DagRuns, tasks will be scheduled, as long as the DagRun date is after
both the task start_date and the dag start_date.

Related

Google Cloud Scheduler trigger last day of month

I want to create a trigger in Google Cloud Scheduler that runs at 9am on the 25th and the last day of each month (depending of the month that would be 28, 30, or 31th).
I assumed somethink like this might work, but GCP does not understand the L-syntax:
0 9 25,L * *
Any (elegant) ideas how to do it without having multiple triggers?
One trigger, with minimized overhead calls:
0 0 25,28-31 * *
Then, inside the function:
IF is25() OR islastDayOfMonthHelper()
work
ELSE
return
I just want to mention two alternative options I see for the end of month part of the question.
Simply run the function just past midnight to the 1st of each month. Depending on your use case this may be good enough.
Reschedule the function each month to the specific date which will be the last day of the next month.

Cloud scheduler trigger on the first monday of each month

I'm trying to schedule a job to trigger on the first Monday of each month:
This is the cron expression I got: 0 5 1-7 * 1
(which ,as far as I can read unix cron expressions, triggers at 5:00 am on Monday if it happens to be in the first 7 days of the month)
However the job is triggered on what seems to be random days at 5:00 am. the job was triggered today on the 16 of Aug!
Am I reading the expression awfully wrong? BTW, I'm setting the timezone to be on AEST, if that makes difference.
You can use the legacy cron syntax to describe the schedule.
For your case, specify something like below:
"first monday of month 05:00"
Do explore the "Custom interval" tab in the provided link, to get better understanding on this.

Django periodic task in period of time

I want to make complex tasks that let me control when the task will be started, when it will be finished, when it can repeated
Example: I want to create task, it's description call 10 of clients and this task will be assigned to X user and the quantity of this task will be 10 but I want this task will start from 1/1/2018 and it will be finished in 30/12/2018 and the repeat of this task will be weekly
Hint: meaning of the repeat, each week I will get the real quantity of the user that's done divided by the quantity of the task and make the quantity of the user is zero in the next week and so on.
In details, The X user will start work on Saturday with the target of calling 10 clients by the end of the week, at the end of the week I will calculate the number of clients he called him and I will start again from scratch in the next week with the new task and so on.
what is the best way to do that in Django? is there any way to do that with Celery? if can I do that is there any way to do that in Django Admin?
Thank you

What is the difference between 'Interval' and 'Cron' triggers in APScheduler?

I am using APScheduler for my project. I went through APScheduler documentation. But I am not able to understand what is actual difference between 'Interval' and 'cron' triggers. Following definition was given in docs:
interval: use when you want to run the job at fixed intervals of time
cron: use when you want to run the job periodically at certain time(s) of day
With interval, you can specify that the job should run say every 15 minutes. A fixed amount of time between each run and that's it.
With cron, you can tell it to run on every second tuesday at 9am, or every day at noon, or on every 1st of January at 7pm. In cron, you define the minute, hour, day of month, month, day of week (eg. Monday) and year where it should run, and you can assign periodicity to any of those (ie. every Monday, or every fifth minute).
Anything you can achieve with interval can also be achieved with cron I think, but not the other way around.

Modifying an existing, timezone-naive scheduler to deal with daylight savings time?

We currently have a timezone-unaware scheduler in pure python.
It uses a heapq (a python binary heap) of ordered events, containing a time, callback and arguments for the callback. It gets the least-valued time from the heapq, computes the number of seconds until the event is to occur, and sleeps that number of seconds before running the job.
We don't need to worry about computers being suspended; this is to run on a dedicated server, not a laptop.
We'd like to make the scheduler cope well with timezone changes, so we don't have a problem in November like we did recently (we had an important job that had to be adjusted in the database to make it run at 8:15AM instead of 9:15AM - normally it runs at 8:15AM). I'm thinking we could:
Store all times in UTC.
Make the scheduler sleep 1 minute and test, in a loop, recomputing
“now” each time, and doing a <= comparison against job datetimes.
Jobs run more frequently than once an hour should “just run normally”.
Hourly jobs that run in between 2:00AM and 2:59AM (inclusive) on a
time change day, probably should skip an hour for PST->PDT, and run
an extra time for PDT->PST.
Jobs run less than hourly probably should avoid rerunning in either
case on days that have a time change.
Does that sound about right? Where might it be off?
Thanks!
I've written about scheduling a few times before with respect to other programming languages. The concepts are valid for python as well. You may wish to read some of these posts: 1, 2, 3, 4, 5, 6
I'll try to address the specific points again, from a Python perspective:
It's important to separate the separate the recurrence pattern from the execution time. The recurrence pattern should store the time as the user would enter it, which is usually a local time. Even if the recurrence pattern is "just one time", that should still be stored as local time. Scheduling is one of a handful of use cases where the common advice of "always work in UTC" does not hold up!
You will also need to store the time zone identifier. These should be IANA time zones, such as America/Los_Angeles or Europe/London. In Python, you can use the pytz library to work with time zones like these.
The execution time should indeed be based on UTC. The next execution time for any event should be calculated from the local time in the recurrence pattern. You may wish to calculate and store these execution times in advance, such that you can easily determine which are the next events to run.
You should be prepared to recalculate these execution times. You may wish to do it periodically, but at minimum it should be done any time you apply a time zone update to your system. You can (and should) subscribe for tz update announcements from IANA, and then look for corresponding pytz updates on pypi.
Think of it this way. When you convert a local time to UTC, you're assuming that you know what the time zone rules will be at that point in time, but nobody can predict what governments will do in the future. Time zone rules can change, and they often do. You need to take that into consideration.
You should test for invalid and ambiguous times, and have a plan for dealing with them. These are easy to hit when scheduling, especially with recurring events.
For example, you might schedule a task to run at 2:00 AM every day - but on the day of the spring-forward transition that time doesn't exist. So what should you do? In many cases, you'll want to run at 3:00 AM on that day, since it's the next time after 1:59 AM. But in some (rarer) contexts, you might run at 1:00 AM, or at 1:59 AM, or just skip that day entirely.
Likewise, you might schedule a task to run at 1:00 AM every day, but on the day of the fall-back transition, 1:00 AM occurs twice. So what do you do? In many cases, the first instance (which is the daylight instance) is the right time to fire. In other (rarer) cases, the second instance may be more appropriate, or (even rarer) it might be appropriate to actually run the job twice.
With regard to jobs that run on an every X [hours/minutes/seconds] type schedule:
These are easiest to schedule by UTC, and should not be affected by DST changes.
If these are the only types of jobs you are running, you can just base your whole system on UTC. But if you're running a mix of different types of jobs, then you might consider just setting the "local time zone" to be "UTC" in the recurrence pattern.
Alternatively, you could just schedule them by a true local time, just make sure that when the job runs it calculates the next execution time based on the current execution time, which should already be in UTC.
You shouldn't distinguish between jobs that run more than hourly, or jobs that run less than hourly. I would expect an hourly to run 25 times on the day of a fall-back transition, and 23 times on the day of a spring-forward transition.
With regard to your plan to sleep and wake up once per minute in a loop - that will probably work, as long as you don't have sub-minute tasks to deal with. It may not necessarily be the most efficient way to deal with it though. If you properly pre-calculate and store the execution times, you could just set a single task to wake up at the next time to run, run everything that needs to run, then set a new task for the next execution time. You don't necessarily have to wake up once per minute.
You should also think about the resources you will need to run the scheduled jobs. What happens if you schedule 1000 tasks that all need to run at midnight? Well they won't necessarily all be able to run simultaneously on a single computer. You might queue them up to run in batches, or spread out the load into different time slots. In a cloud environment perhaps you spin up additional workers to handle the load.