Understanding Airflow execution_date when property 'catchup=false' - airflow-scheduler

I am trying to see how Airflow sets execution_date for any DAG. I have made the property catchup=false in the DAG. Here is my
dag = DAG(
'child',
max_active_runs=1,
description='A sample pipeline run',
start_date=days_ago(0),
catchup=False,
schedule_interval=timedelta(minutes=5)
)
Now, since Catchup=false, it should skip the runs prior to current_time. It does the same, however a strange thing is it is not setting the execution_date right.
Here, the runs execution time:
Exectution time
We can see the runs are scheduled at freq of 5 min. But, why does it append seconds and milliseconds to time?
This is impacting my sensors later.
Note that the behaviour runs fine when catchup=True.

I did some permutations. Seems that the execution_time is correctly coming when I specify cron, instead of timedelta function.
So, my DAG now is
dag = DAG(
'child',
max_active_runs=1,
description='A sample pipeline run',
start_date=days_ago(0),
catchup=False,
schedule_interval='*/5 * * * *'
)
Hope it will help someone. I have also raised a bug for this:
Can be tracked at : https://github.com/apache/airflow/issues/11758

Regarding execution_date you should have a look on scheduler documentation. It is the begin of the period, but get's triggered at the end of the period (start_date).
The scheduler won’t trigger your tasks until the period it covers has ended e.g., A job with schedule_interval set as #daily runs after the day has ended. This technique makes sure that whatever data is required for that period is fully available before the dag is executed. In the UI, it appears as if Airflow is running your tasks a day late
Note
If you run a DAG on a schedule_interval of one day, the run with execution_date 2019-11-21 triggers soon after 2019-11-21T23:59.
Let’s Repeat That, the scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
Also the article Scheduling Tasks in Airflow might be worth a read.
You also should avoid setting the start_date to a relative value - this can lead to unexpected behaviour as this value is newly interpreted everytime the DAG file is parsed.
There is a long description within the Airflow FAQ:
We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing. The task is triggered once the period closes, and in theory an #hourly DAG would never get to an hour after now as now() moves along.

Related

Vertex AI 504 Errors in batch job - How to fix/troubleshoot

We have a Vertex AI model that takes a relatively long time to return a prediction.
When hitting the model endpoint with one instance, things work fine. But batch jobs of size say 1000 instances end up with around 150 504 errors (upstream request timeout). (We actually need to send batches of 65K but I'm troubleshooting with 1000).
I tried increasing the number of replicas assuming that the # of instances handed to the model would be (1000/# of replicas) but that doesn't seem to be the case.
I then read that the default batch size is 64 and so tried decreasing the batch size to 4 like this from the python code that creates the batch job:
model_parameters = dict(batch_size=4)
def run_batch_prediction_job(vertex_config):
aiplatform.init(
project=vertex_config.vertex_project, location=vertex_config.location
)
model = aiplatform.Model(vertex_config.model_resource_name)
model_params = dict(batch_size=4)
batch_params = dict(
job_display_name=vertex_config.job_display_name,
gcs_source=vertex_config.gcs_source,
gcs_destination_prefix=vertex_config.gcs_destination,
machine_type=vertex_config.machine_type,
accelerator_count=vertex_config.accelerator_count,
accelerator_type=vertex_config.accelerator_type,
starting_replica_count=replica_count,
max_replica_count=replica_count,
sync=vertex_config.sync,
model_parameters=model_params
)
batch_prediction_job = model.batch_predict(**batch_params)
batch_prediction_job.wait()
return batch_prediction_job
I've also tried increasing the machine type to n1-high-cpu-16 and that helped somewhat but I'm not sure I understand how batches are sent to replicas?
Is there another way to decrease the number of instances sent to the model?
Or is there a way to increase the timeout?
Is there log output I can use to help figure this out?
Thanks
Answering your follow up question above.
Is that timeout for a single instance request or a batch request. Also, is it in seconds?
This is a timeout for the batch job creation request.
The timeout is in seconds, according to create_batch_prediction_job() timeout refers to rpc timeout. If we trace the code we will end up here and eventually to gapic where timeout is properly described.
timeout (float): The amount of time in seconds to wait for the RPC
to complete. Note that if ``retry`` is used, this timeout
applies to each individual attempt and the overall time it
takes for this method to complete may be longer. If
unspecified, the the default timeout in the client
configuration is used. If ``None``, then the RPC method will
not time out.
What I could suggest is to stick with whatever is working for your prediction model. If ever adding the timeout will improve your model might as well build on it along with your initial solution where you used a machine with a higher spec. You can also try using a machine with higher memory like the n1-highmem-* family.

How can I run specific task/s from the Airflow dag

Current State of airflow dag:
ml_processors = [a, b, c, d, e]
abc_task >> ml_processors (all ml models from a to e run in parallel after abc task is successfully completed)
ml_processors >> xyz_task (once a to e all are successful xyz task runs)
Problem statement: There are instances when one of the machine learning models (task in airflow) get on new version with better accuracy and we want to reprocess our data. Now lets say c_processor get on new version and reprocessing is required to just reprocess the data for this processor. In that case I would like to run c_processor >> xyz_task only.
What I know/tried
I know that I can go back in successful dag runs and clear the task for certain period of time to run only specific task. But this way might not be very efficient when I have lets say c_processor, d_classifier to be rerun. And I would end up doing 2 steps here:
c_processor >> xyz_task
d_processor >> xyz_task which I would like to avoid
I read about "backfill in airflow" but looks like its more for whole dag instead of specific/ selected tasks from a dag
Environment/setup
Using google composer environment.
Dag is triggered on file upload in GCP storage.
I am interested to know if there are any other ways to rerun only specific tasks from airflow dag.
"clear"1 would also allow you to clear some specific tasks in a DAG with the --task-regex flag. In this case, you can run airflow tasks clear --task-regex "[c|d]_processor" --downstream -s 2021-03-22 -e 2021-03-23 <dag_id>, which clear the states for c and d processors with their downstreams.
One caveat though, this will also clean up the states for the original task runs.

Airflow: Dag scheduled twice a few seconds apart

I am trying to run a DAG only once a day at 00:15:00 (midnight 15 minutes), yet, it's being scheduled twice, a few seconds apart.
dag = DAG(
'my_dag',
default_args=default_args,
start_date=airflow.utils.dates.days_ago(1) - timedelta(minutes=10),
schedule_interval='15 0 * * * *',
concurrency=1,
max_active_runs=1,
retries=3,
catchup=False,
)
The main goal of that Dag is check for new emails then check for new files in a SFTP directory and then run a "merger" task to add those new files to a database.
All the jobs are Kubernetes pods:
email_check = KubernetesPodOperator(
namespace='default',
image="g.io/email-check:0d334adb",
name="email-check",
task_id="email-check",
get_logs=True,
dag=dag,
)
sftp_check = KubernetesPodOperator(
namespace='default',
image="g.io/sftp-check:0d334adb",
name="sftp-check",
task_id="sftp-check",
get_logs=True,
dag=dag,
)
my_runner = KubernetesPodOperator(
namespace='default',
image="g.io/my-runner:0d334adb",
name="my-runner",
task_id="my-runner",
get_logs=True,
dag=dag,
)
my_runner.set_upstream([sftp_check, email_check])
So, the issue is that there seems to be two runs of the DAG scheduled a few seconds apart. They do not run concurrently, but as soon as the first one is done, the second one kicks off.
The problem here is that the my_runner job is intended to only run once a day: it tries to create a file with the date as a suffix, and if the file already exists, it throws an exception, so that second run always throws an exception (because the file for the day has already been properly created by the first run)
Since an image (or two) are worth a thousand words, here it goes:
You'll see that there's a first run that is scheduled "22 seconds after 00:15" (that's fine... sometimes it varies a couple of seconds here and there) and then there's a second one that always seems to be scheduled "58 seconds after 00:15 UTC" (at least according to the name they get). So the first one runs fine, nothing else seems to be running... And as soon as it finishes the run, a second run (the one scheduled at 00:15:58) starts (and fails).
A "good" one:
A "bad" one:
Can you check the schedule interval parameter?
schedule_interval='15 0 * * * *'. The cron schedule takes only 5 parameters and I see an extra star.
Also, can you have fixed start_date?
start_date: datetime(2019, 11, 10)
It looks like setting the start_date to 2 days ago instead of 1 did the trick
dag = DAG(
'my_dag',
...
start_date=airflow.utils.dates.days_ago(2),
...
)
I don't know why.
I just have a theory. Maaaaaaybe (big maybe) the issue was that because.days_ago(...) sets a UTC datetime with hour/minute/second set to 0 and then subtracts whichever number of days indicated in the argument, just saying "one day ago" or even "one day and 10 minutes ago" didn't put the start_date over the next period (00:15) and that was somehow confusing Airflow?
Let’s Repeat That The scheduler runs your job one schedule_interval
AFTER the start date, at the END of the period.
https://airflow.readthedocs.io/en/stable/scheduler.html#scheduling-triggers
So, the end of the period would be 00:15... If my theory was correct, doing it airflow.utils.dates.days_ago(1) - timedelta(minutes=16) would probably also work.
This doesn't explain why if I set a date very far in the past, it just doesn't run, though. ¯\_(ツ)_/¯

How to set internal wall clock in a Fortran program?

I use Fortran to do some scientific computation. I use HPC. As we know, when we submit jobs in a HPC job scheduler, we also specify the wall clock time limit for our jobs. However, when the time is up, if the job is still writing output data, it will be terminated and it will cause some 'NUL' values in the data, causing trouble for the post-processing:
So, could we set an internal mechanism that our job can stop itself peacefully some time before the end of HPC allowance time?
Related Question: How to skip reading "NUL" value in MATLAB's textscan function?
After realizing what you are asking I found out that I implemented similar functionality in my program very recently (commit https://bitbucket.org/LadaF/elmm/commits/f10a1b3421a3dd14fdcbe165aa70bf5c5001413f). But I still have to set the time limit manually.
The most important part:
time_stepping%clock_time_limit is the time limit in seconds. Count the number of system clock ticks corresponding to that:
call system_clock(count_rate = timer_rate)
call system_clock(count_max = timer_max_count)
timer_count_time_limit = int( min(time_stepping%clock_time_limit &
* real(timer_rate, knd), &
real(timer_max_count, knd) * 0.999_dbl) &
, dbl)
Start the timer
call system_clock(count = time_steps_timer_count_start)
Check the timer and exit the main loop with error_exit set to .true. if the time is up
if (mod(time_step,time_stepping%check_period)==0) then
if (master) then
error_exit = time_steps_timer_count_2 - time_steps_timer_count_start > timer_count_time_limit
if (error_exit) write(*,*) "Maximum clock time exceeded."
end if
MPI_Bcast the error exit to other processes
if (error_exit) exit
end if
Now, you may want to get the time limit from your scheduler automatically. That will vary between different job scheduling softwares. There will be an environment variable like $PBS_WALLTIME. See Get walltime in a PBS job script but check your scheduler's manual.
You can read this variable using GET_ENVIRONMENT_VARIABLE()

Delayed Job Overwhelming DB

I have a method which updates all DNS records for an account with 1 delayed job for each record. There's a lot of workers and queues which is great for getting other jobs done quickly, but this particular job completes quickly and overwhelms the database. Because each job requires DNS to resolve, it's difficult to move this to a process which collects the information then writes once. So I'm instead looking for a way to stagger delayed jobs.
As far as I know, just using sleep(0.1) in the after method should do the trick. I wanted to see if anyone else has specifically dealt with this situation and solved it.
I've created a custom job to test out a few different ideas. Here's some example code:
def update_dns
Account.active.find_each do |account|
account.domains.where('processed IS NULL').find_each do |domain|
begin
Delayed::Job.enqueue StaggerJob.new(self.id)
rescue Exception => e
self.domain_logger.error "Unable to update DNS for #{domain.name} (id=#{domain.id})"
self.domain_logger.error e.message
self.domain_logger.error e.backtrace
end
end
end
end
When a cron job calls Domain.update_dns, the delayed job table floods with tens of thousands of jobs, and the workers start working through them. There's so many workers and queues that even setting the lowest priority overwhelms the database and other requests suffer.
Here's the StaggerJob class:
class StaggerJob < Struct.new(:domain_id)
def perform
domain.fetch_dns_job
end
def enqueue(job)
job.account_id = domain.account_id
job.owner = domain
job.priority = 10 # lowest
job.save
end
def after(job)
# Sleep to avoid overwhelming the DB
sleep(0.1)
end
private
def domain
#domain ||= Domain.find self.domain_id
end
end
This may entirely do the trick, but I wanted to verify if this technique was sensible.
It turned out the priority for these jobs were set to 0 (highest). Setting to 10 (lowest) helped. Sleeping in the job in the after method would work, but there's a better way.
Delayed::Job.enqueue StaggerJob.new(domain.id, :fetch_dns!), run_at: (Time.now + (0.2*counter).seconds) # stagger by 0.2 seconds
This ends up pausing outside the job instead of inside. Better!