Airflow: Dag scheduled twice a few seconds apart - concurrency

I am trying to run a DAG only once a day at 00:15:00 (midnight 15 minutes), yet, it's being scheduled twice, a few seconds apart.
dag = DAG(
'my_dag',
default_args=default_args,
start_date=airflow.utils.dates.days_ago(1) - timedelta(minutes=10),
schedule_interval='15 0 * * * *',
concurrency=1,
max_active_runs=1,
retries=3,
catchup=False,
)
The main goal of that Dag is check for new emails then check for new files in a SFTP directory and then run a "merger" task to add those new files to a database.
All the jobs are Kubernetes pods:
email_check = KubernetesPodOperator(
namespace='default',
image="g.io/email-check:0d334adb",
name="email-check",
task_id="email-check",
get_logs=True,
dag=dag,
)
sftp_check = KubernetesPodOperator(
namespace='default',
image="g.io/sftp-check:0d334adb",
name="sftp-check",
task_id="sftp-check",
get_logs=True,
dag=dag,
)
my_runner = KubernetesPodOperator(
namespace='default',
image="g.io/my-runner:0d334adb",
name="my-runner",
task_id="my-runner",
get_logs=True,
dag=dag,
)
my_runner.set_upstream([sftp_check, email_check])
So, the issue is that there seems to be two runs of the DAG scheduled a few seconds apart. They do not run concurrently, but as soon as the first one is done, the second one kicks off.
The problem here is that the my_runner job is intended to only run once a day: it tries to create a file with the date as a suffix, and if the file already exists, it throws an exception, so that second run always throws an exception (because the file for the day has already been properly created by the first run)
Since an image (or two) are worth a thousand words, here it goes:
You'll see that there's a first run that is scheduled "22 seconds after 00:15" (that's fine... sometimes it varies a couple of seconds here and there) and then there's a second one that always seems to be scheduled "58 seconds after 00:15 UTC" (at least according to the name they get). So the first one runs fine, nothing else seems to be running... And as soon as it finishes the run, a second run (the one scheduled at 00:15:58) starts (and fails).
A "good" one:
A "bad" one:

Can you check the schedule interval parameter?
schedule_interval='15 0 * * * *'. The cron schedule takes only 5 parameters and I see an extra star.
Also, can you have fixed start_date?
start_date: datetime(2019, 11, 10)

It looks like setting the start_date to 2 days ago instead of 1 did the trick
dag = DAG(
'my_dag',
...
start_date=airflow.utils.dates.days_ago(2),
...
)
I don't know why.
I just have a theory. Maaaaaaybe (big maybe) the issue was that because.days_ago(...) sets a UTC datetime with hour/minute/second set to 0 and then subtracts whichever number of days indicated in the argument, just saying "one day ago" or even "one day and 10 minutes ago" didn't put the start_date over the next period (00:15) and that was somehow confusing Airflow?
Let’s Repeat That The scheduler runs your job one schedule_interval
AFTER the start date, at the END of the period.
https://airflow.readthedocs.io/en/stable/scheduler.html#scheduling-triggers
So, the end of the period would be 00:15... If my theory was correct, doing it airflow.utils.dates.days_ago(1) - timedelta(minutes=16) would probably also work.
This doesn't explain why if I set a date very far in the past, it just doesn't run, though. ¯\_(ツ)_/¯

Related

Understanding Airflow execution_date when property 'catchup=false'

I am trying to see how Airflow sets execution_date for any DAG. I have made the property catchup=false in the DAG. Here is my
dag = DAG(
'child',
max_active_runs=1,
description='A sample pipeline run',
start_date=days_ago(0),
catchup=False,
schedule_interval=timedelta(minutes=5)
)
Now, since Catchup=false, it should skip the runs prior to current_time. It does the same, however a strange thing is it is not setting the execution_date right.
Here, the runs execution time:
Exectution time
We can see the runs are scheduled at freq of 5 min. But, why does it append seconds and milliseconds to time?
This is impacting my sensors later.
Note that the behaviour runs fine when catchup=True.
I did some permutations. Seems that the execution_time is correctly coming when I specify cron, instead of timedelta function.
So, my DAG now is
dag = DAG(
'child',
max_active_runs=1,
description='A sample pipeline run',
start_date=days_ago(0),
catchup=False,
schedule_interval='*/5 * * * *'
)
Hope it will help someone. I have also raised a bug for this:
Can be tracked at : https://github.com/apache/airflow/issues/11758
Regarding execution_date you should have a look on scheduler documentation. It is the begin of the period, but get's triggered at the end of the period (start_date).
The scheduler won’t trigger your tasks until the period it covers has ended e.g., A job with schedule_interval set as #daily runs after the day has ended. This technique makes sure that whatever data is required for that period is fully available before the dag is executed. In the UI, it appears as if Airflow is running your tasks a day late
Note
If you run a DAG on a schedule_interval of one day, the run with execution_date 2019-11-21 triggers soon after 2019-11-21T23:59.
Let’s Repeat That, the scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
Also the article Scheduling Tasks in Airflow might be worth a read.
You also should avoid setting the start_date to a relative value - this can lead to unexpected behaviour as this value is newly interpreted everytime the DAG file is parsed.
There is a long description within the Airflow FAQ:
We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing. The task is triggered once the period closes, and in theory an #hourly DAG would never get to an hour after now as now() moves along.

Django crontab, running a job every 12 hours

I have a django crontab sceduled to run every 12 hours, meaning it should run twice per day however, it is running more than that.
Can anyone tell me what's wront with it ?
('* */12 * * *', 'some_method','>>'+os.path.join(BASE_DIR,'log/mail.log'))
Also what changes I need to make if I need it to run every 24 hours?
After every 12 hours you want to run job any particular minute from 0 to 59, not every other minute. So it should be (assuming 0th minute):
('0 */12 * * *', 'some_method','>>'+os.path.join(BASE_DIR,'log/mail.log'))
For once in a day or every 24 hours (You can decide any specific hour from 0 to 23, assuming at midnight):
('0 0 * * *', 'some_method','>>'+os.path.join(BASE_DIR,'log/mail.log'))

Scheduling reset every 24 hours at midnight

I have a counter "numberOrders" and i want to reset it everyday at midnight, to know how many orders I get in one day, what I have right now is this:
val system = akka.actor.ActorSystem("system")
system.scheduler.schedule(86400000 milliseconds, 0 milliseconds){(numberOrders = 0)}
This piece of code is inside a def which is called every time i get a new order, so want it does is: reset numberOrders after 24hours from the first order or from every order, I'm not really sure if every time there's a new order is going to reset after 24 hours, which is not what I want. I want to rest the variable everyday at midnight, any idea? Thanks!
To further increase pushy's answer. Since you might not always be sure when the site started and if you want to be exactly sure it runs at midnight you can do the following
val system = akka.actor.ActorSystem("system")
val wait = (24 hours).toMillis - System.currentTimeMillis
system.scheduler.schedule(Duration.apply(wait, MILLISECONDS), 24 hours, orderActor, ResetCounterMessage)
Might not be the tidiest of solutions but it does the job.
As schedule supports repeated executions, you could just set the interval parameter to 24 hours, the initial delay to the amount of time between now and midnight, and initiate the code at startup. You seem to be creating a new actorSystem every time you get an order right now, that does not seem quite right, and you would be rid of that as well.
Also I would suggest using the schedule method which sends messages to actors instead. This way the actor that processes the order could keep count, and if it receives a ResetCounter message it would simply reset the counter. You could simply write:
system.scheduler.schedule(x seconds, 24 hours, orderActor, ResetCounterMessage)
when you start up your actor system initially, and be done with it.

Unexplained crash while polling systemtime type

I have a program that runs every 5 minutes when the stock market is open, which it does by running once, then entering the following function, which returns once 5 minutes has passed if the stock market is open.
What I don't understand, is that after a period of time, usually about 18 or 19 hours, it crashes returning a sigsegv error. I have no idea why, as it isn't writing to any memory - although I don't know much about the systemtime type, so maybe that's it?
Anyway, any help you could give would be very much appreciated! Thanks in advance!!
void KillTimeUntilNextStockDataReleaseOnWeb()
{
SYSTEMTIME tLocalTimeNow;
cout<<"\n*****CHECKING IF RUN HAS JUST COMPLETED OR NOT*****\n";
GetLocalTime(&tLocalTimeNow);//CHECK IF A RUN HAS JUST COMPLETED. IF SO, AWAIT NEXT 5 MINUTE MARK
while((tLocalTimeNow.wMinute % 5)==0)
GetLocalTime(&tLocalTimeNow);
cout<<"\n*****AWAITING 5 MINUTE MARK TO UPDATE STOCK DATA*****\n";
GetLocalTime(&tLocalTimeNow);//LOOP THROUGH THIS SECTION, CHECKING CURRENT TIME, UNTIL 5 MINUTE UPDATE. THEN PROCEED
while((tLocalTimeNow.wMinute % 5)!=0)
GetLocalTime(&tLocalTimeNow);
cout<<"\n*****CHECKING IF MARKET IS OPEN*****\n";
//CHECK IF STOCK MARKET IS EVEN OPEN. IF NOT, REPEAT
GetLocalTime(&tLocalTimeNow);
while((tLocalTimeNow.wHour < 8)||(tLocalTimeNow.wHour) > 17)
GetLocalTime(&tLocalTimeNow);
cout<<"\n*****PROGRAM CONTINUING*****\n";
return;
}
If you want to "wait for X seconds", then the Windows system call Sleep(x) will sleep for x milliseconds. Note however, if you sleep for, say, 300s, after some operation that took 3 seconds, that would mean you drift 3 seconds every 5minutes - it may not matter, but if it's critical that you keep the same timing all the time, you should figure out [based on time or some such function] how long it is to the next boundary, and then sleep that amount [possibly run a bit short and then add another check and sleep if you woke up early]. If "every five minutes" is more of an approximate thing, then 300s is fine.
There are other methods to wait for a given amount of time, but I suspect the above is sufficient.
Instead of using a busy loop, or even Sleep() in a loop, I would suggest using a Waitable Timer instead. That way, the calling thread can sleep effectively while it is waiting, while still providing a mechanism to "wake up" early if needed.

Limiting the lifetime of a file in Python

Helo there,
I'm looking for a way to limit a lifetime of a file in Python, that is, to make a file which will be auto deleted after 5 minutes after creation.
Problem:
I have a Django based webpage which has a service that generates plots (from user submitted input data) which are showed on the web page as .png images. The images get stored on the disk upon creation.
Image files are created per session basis, and should only be available a limited time after the user has seen them, and should be deleted 5 minutes after they have been created.
Possible solutions:
I've looked at Python tempfile, but that is not what I need, because the user should have to be able to return to the page containing the image without waiting for it to be generated again. In other words it shouldn't be destroyed as soon as it is closed
The other way that comes in mind is to call some sort of an external bash script which would delete files older than 5 minutes.
Does anybody know a preferred way doing this?
Ideas can also include changing the logic of showing/generating the image files.
You should write a Django custom management command to delete old files that you can then call from cron.
If you want no files older than 5 minutes, then you need to call it every 5 minutes of course. And yes, it would run unnecessarily when there are no users, but that shouln't worry you too much.
Ok that might be a good approach i guess...
You can write a script that checks your directory and delete outdated files, and choose the oldest file from the un-deleted files. Calculate how much time had passed since that file is created and calculate the remaining time to deletion of that file. Then call sleep function with remaining time. When sleep time ends and another loop begins, there will be (at least) one file to be deleted. If there is no files in the directory, set sleep time to 5 minutes.
In that way you will ensure that each file will be deleted exactly 5 minutes later, but when there are lots of files created simultaneously, sleep time will decrease greatly and your function will begin to check each file more and more often. To aviod that you add a proper latency to sleep function before starting another loop, like, if the oldest file is 4 minutes old, you can set sleep to 60+30 seconds (adding all time calculations 30 seconds).
An example:
from datetime import datetime
import time
import os
def clearDirectory():
while True:
_time_list = []
_now = time.mktime(datetime.now().timetuple())
for _f in os.listdir('/path/to/your/directory'):
if os.path.isfile(_f):
_f_time = os.path.getmtime(_f) #get file creation/modification time
if _now - _f_time < 300:
os.remove(_f) # delete outdated file
else:
_time_list.append(_f_time) # add time info to list
# after check all files, choose the oldest file creation time from list
_sleep_time = (_now - min(_time_list)) if _time_list else 300 #if _time_list is empty, set sleep time as 300 seconds, else calculate it based on the oldest file creation time
time.sleep(_sleep_time)
But as i said, if files are created oftenly, it is better to set a latency for sleep time
time.sleep(_sleep_time + 30) # sleep 30 seconds more so some other files might be outdated during that time too...
Also, it is better to read getmtime function for details.