airflow - how to 'Filling up the DagBag' once only - airflow-scheduler

My dag takes about 50seconds to parse, I only use external triggers to start dag runs, no schedules. I notice airflow wants to fill the dagbag a lot --> On every trigger_dag command AND in the background it keeps checking the dags folder AND creating .pyc files seemingly instantly once new .py deployed.
Is there anyway I can deploy my cluster and get dags filled once! Then for the next 2 weeks get dagruns starting instantly on any trigger_dag (right now takes 50 seconds just to fill the dagbag before starting). I have no need to update dag definitions within the 2 weeks.

50 seconds is an incredibly huge amount of time for DAG instantiation. Looks like you are using a big piece of code (or just long-working) in your DAG file. It is very bad practice:
Note: This means all top level code (ie. anything that isn't defining the DAG) in a DAG file will get run each scheduler heartbeat. Try to avoid top level code to your DAG file unless absolutely necessary.
Airflow works exactly as you described. It is why you should treat your Python files in your DAG folder mostly as configuration files (with some programmatical capabilities). You can't change it with any magic config keys or something like it. This behaviour is the core of Airflow.

Related

Serializing DAG into DB on Demand in Airflow 2.0

We are currently deploying flows on runtime using Airflow and face constant issues in terms of DAG deployment . The DAGs doesnt get picked by Scheduler on time and delays the user response in our workflow application .
Is their any way we can deploy DAGs into database of airflow ON DEMAND
? If not how can we make this process well defined .
Each DAG file is parsed every 30 seconds by default from Airflow 2.0.1.
This is controlled by https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#min-file-process-interval
Change that to 2 seconds or whatever number of seconds you think is appropriate for your case.

How to force fail a dag after x number of time?

I want to force fail a dag after say 3 hours have passed.
I have a dag that is scheduled for 2am and one that is scheduled for 6am. I want the 2am dag to stop and give precedence to the one scheduled at 6am.
I have already tried using execution_timeout
I have tried using dagrun_timeout keeps running as there are no other dags to run within the first dag.
NOTE: This is like a cross-dag dependency where I want to give preference to a dag during certain hours
Programmatically stopping a dag like I wanted to do above is not possible. I solved my issue, giving priority to a dag by assigning a larger pool to it.

How airflow loads/updates DagBag from dags home folder on google cloud platform?

Please do not down vote my answer. If needed then I will update and correct my words. I have done my home-work research. I am little new so trying to understand this.
I would like to understand that how do airflow on Google cloud platform gets the changes from dags home folder to UI. Also Please help me with my dags setup script. I have read so many answers along with books. book link is here
I tried figuring out my answer from page 69 which says
3.11 Scheduling & Triggers The Airflow scheduler monitors all tasks and all DAGs, and triggers the task instances whose dependencies have
been met. Behind the scenes, it monitors and stays in sync with a
folder for all DAG objects it may contain, and periodically (every
minute or so) inspects active tasks to see whether they can be
triggered.
My understanding from this book is that scheduler regularly takes changes from dags home folder. (Is it correct?)
I also read multiple answers on stack overflow , I found this one useful Link
But still answer does not contain process that is doing this creation/updation of dagbag from script.py in dag home folder. How changes are sensed.
Please help me with my dags setup script.
We have created a generic python script that dynamically creates dags by reading/iterating over config files.
Below is directory structure
/dags/workflow/
/dags/workflow/config/dag_a.json
/dags/workflow/config/dag_b.json
/dags/workflow/task_a_with_single_operator.py
/dags/workflow/task_b_with_single_operator.py
/dags/dag_creater.py
Execution flow dag_creater.py is as following :-
1. Iterate in dags/workflow/config folder get the Config JSON file and
read variable dag_id.
2. create Parent_dag = DAG(dag_id=dag_id,
start_date=start_date, schedule_interval=schedule_interval,
default_args=default_args, catchup=False)
3. Read tasks with dependencies of that dag_id from config json file
(example :- [[a,[]],[b,[a]],[c,[b]]]) and code it as task_a >>
task_b >> task_c
This way dag is created. All works fine. Dags are also visible on UI and running fine.
But problem is, My dag creation script is running every time. Even in each task logs I see logs of all the dags. I expect this script to run once. just to fill entry in metadata. I am unable to understand like why is it running every time.
Please make me understand the process.
I know airflow initdb is run once we setup metadata first time. So that is not doing this update all time.
Is it scheduler heart beat updating all?
Is my setup correct?
Please Note: I can't type real code as that is the restriction from my
organization. However if asked, i will provide more information.
Airflow Scheduler is actually continuously running in Airflow runtime environment as a main contributor for monitoring changes in DAG folder and triggering the relevant DAG tasks residing in this folder. The main settings for Airflow Scheduler service can be found in airflow.cfg file, essentially the heart beat intervals which effectively impact the general DAG tasks maintenance.
However, the way how the particular task will be executed is defined as per the Executor's model in Airflow configuration.
To store DAGs being available for the Airflow runtime environment GCP Composer uses Cloud Storage, implementing the specific folder structure, synchronizing any object arriving to /dags folder with *.py extension be verified if it contains the DAG definition.
If you expect to run DAG spreading script within Airflow runtime, then in this particular use case I would advise you to look at PythonOperator, using it in the separate DAG to invoke and execute your custom generic Python code with guarantees scheduling it only once a time. You can check out this Stack thread with implementation details.

Google Cloud DataPrep schedule is spawning multiple DataFlow jobs

I have a schedule which runs my flow twice a day - at 0910 and 1520 BST.
This is spawning a massive number of DataFlow jobs - so far today just the second schedule (1520) has spawned 80 jobs:
$ gcloud dataflow jobs list
JOB_ID NAME TYPE CREATION_TIME STATE REGION
2018-07-29_12_17_06-14876588186269022154 project-name-513008-by-username Batch 2018-07-29 19:17:07 Running us-central1
2018-07-29_12_14_54-6436458673562317581 project-name-512986-by-username Batch 2018-07-29 19:14:55 Cancelled us-central1
2018-07-29_12_13_55-6167618802124600084 project-name-512985-by-username Batch 2018-07-29 19:13:57 Cancelled us-central1
...
(see PasteBin for the full list)
In the days after the DataPrep update last week, I had trouble accessing the run settings url for the flow. I suspect that there's a process as part of the run settings which walks back through the flow (I have 12 flows chained by reference datasets) and sanity checks it - it seems that my flow was just on the cusp of being complex enough to cause the page load to time out, and I had to cut out a couple of steps just to get to the run settings.
I wonder if each time this timed out, it somehow duplicated the schedule or something else in the process - but then again, the number of duplicated jobs is inconsistent.
I recently rebuilt this project after seeing some issues with sampling errors (in that the sample was corrupt, so I couldn't load the transformation UI, but also couldn't build a new sample). After a hefty attempt at resolving the issue, I took the chance to rebuild as a dedicated GCP project with structure improvements, etc. I didn't see this scheduling error before the rebuild.

Django setting up a scheduled task without Cron

I know there are many questions asking about this, especially this one: Django - Set Up A Scheduled Job?.
But what I want to understand is, how does a scheduled task inside Django actually works?
My simplistic way to think about it is that there's an infinite loop somewhere, something like this (runs every 60 seconds),
import time
interval=60 #60 seconds
while True:
some_method()
time.sleep(interval)
Question: where do you put this infinite loop? Is there some part of the Django app that just runs in the background alongside the rest of the app?
Thanks!
Django doesn't do scheduled tasks. If you want scheduled tasks, you need a daemon that runs all the time and can launch your task at the appropriate time.
Django only runs when a http request is made. If no one makes a http request for a month, django doesn't run for a month. If there are 45 http requests this second, django will run 45 times this second (in the absence of caching).
You can write scripts in the django framework (called management commands) that get called from some outside service (like cron). That's as close as you'll get to what you want. If that's the case, then the question/answer you reference is the place to get the how tos.
Probably on a unixy system, cron is the simplest outside service to work with. On recent linux systems, cron has a directory /etc/cron.d into which you can drop your app's cron config file, and it will not interfere with any other cron jobs on the system. No editing of existing files necessary.