I'm exploring apache airflow 1.8. I'm curious that either is there any way to pass arguments to DAGs or tasks while backfilling?
Here is something like that I'm searching for,
airflow backfill My_DAG -s some_date -e end_date argument_for_t1 argument_for_t2
or it could be array of args.
is there any way to pass arguments? I've searched a lot, but wasn't able to find anything.
Backfill is to re-run a failed/missed/mis configured DAG and hence there is no provision to give command line arguments.
However, you can go to DagBag folder and give params to the dag object (Change/fix existing dag) and then back fill the DAG
I came across the solution that I can set the variables using airflow variables -s my_var="some value"
Now I can access my_var anywhere across the application.
for more options check cli
Related
Currently, I am creating emr_default in the connections tab for my first DAG. For the other DAG I am making another emr_default2 and using it in my other DAG. What is the best way to create emr_default per each dag, can we have something like a parameter file and use it in our dag? Please advise.
Thanks.
Xi
I have to cheack the status of workflow weather that workflow completed within scheduled time or not in sql query format. And also send an email of workflow status like 'completed within time ' or not 'completed within time'. So, please help me out
You can do it either using option1 or option 2.
You need access to repository meta database.
Create a post session shell script. You can pass workflow name and benchmark value to the shell script.
Get workflow run time from repository metadata base.
SQL you can use -
SELECT WORKFLOW_NAME,(END_TIME-START_TIME)*24*60*60 diff_seconds
FROM
REP_WFLOW_RUN
WHERE WORKFLOW_NAME='myWorkflow'
You can then compare above value with benchmark value. Shell script can send a mail depending on outcome.
you need to create another workflow to check this workflow.
If you do not have access to Metadata, please follow above steps except metadata SQL.
Use pmcmd GetWorkflowDetails to check status, start and end time for a workflow.
pmcmd GetWorkflowDetails -sv service -d domain -f folder myWorkflow
You can then grep start and end time from there, compare them with benchmark values. The problem is the format etc. You need little bit scripting here.
I have tried viewing similar answers on stackoverflow to this problem, however my case is slightly different.
I am executing backfill jobs via Airflow CLI, and the backfilled dag runs get stuck in a running state, with the first task in the dag in a queued (grey) state.
The scheduler doesn't seem to ever kick off the first task.
I do not have depends_on_past=True set as dag_defaults
dag_defaults = {
"start_date": datetime.today() - timedelta(days=2),
"on_failure_callback": on_failure_callback,
"provide_context": True
}
I am forced to Run every task manually. :( Rather than just letting the scheduler take its course and run them automatically.
Note: I am executing the backfill cli commands via Airflow worker pods on a K8S cluster.
Has anyone else faced a similar issue using the backfill cli commands?
UPDATE:
I realised my backfill runs fall outside the total dag interval. I.e before the dag start_date causing a blocking schedule dependancy.
While you can still create the run, it will not run automatically, but you can manually run each task.
As a workaround would need to change the start_date to be before or on my oldest backfill date.
Would be nice if there was a way to override the backfill cmd or provide a --force option that could mock the start_date in for that specific dag_run, rather than being bound to the total interval.
UPDATE: I realised my backfill runs fall outside the total dag
interval. I.e before the dag start_date causing a blocking schedule
dependancy.
While you can still create the run, it will not run automatically, but you can manually run each task.
As a workaround would need to change the start_date to be before or on my oldest backfill date.
Would be nice if there was a way to override the backfill cmd or provide a --force option that could mock the start_date in for that specific dag_run, rather than being bound to the total interval.
I want to know the status of dag whether it is running or failure or success. I am triggering dag through CL argument airflow trigger and after the execution of job, I want to know the status of the run. I couldn't find any way.
I tried airflow dag_state but it is giving none. What should I do if there are more than one runs in a day to get status of latest run through command line argument or through python code.
You can use list_dag_runs command with the CLI to list the dag runs for a given dag ID. The information returned includes the state of each run.
You can also retrieve the information via python code a few different ways. One such way that I've used in the past is the 'find' method in airflow.models.dagrun.DagRun
An example with python3 on how to get the state of dag_runs via DagRun.find():
dag_id = 'fake_dag_id'
dag_runs = DagRun.find(dag_id=dag_id)
for dag_run in dag_runs:
print(dag_run.state)
You can use the following CL
airflow dag_state dag_id execution_date
Example:
airflow dag_state test_dag_id 2019-11-08T18:36:39.628099+00:00
In the above example:
test_dag_id is the actual dag
2019-11-08T18:36:39.628099+00:00 is the execution date. You can get this from airflow UI for your run.
Other option is to use airflow rest api plugin. This is better option. You can trigger a DAG and also check the status of the dag.
https://github.com/teamclairvoyant/airflow-rest-api-plugin
I have multiple DAG's created in Airflow .but i want to triggger all of them via some common Module or DAG .Can we create a workflow like Azkaban to have all the DAgs invoked listed in that flow.
You can create a DAG that triggers other DAGs using TriggerDagRunOperator by passing their dag_id and other appropriate args
Source of the operator can be found in dagrun_operator.py
Also I suggest you go through the Scheduling and Triggers section of the docs
You can use SubdagOperator.
The main DAG see and manage all the SubDAGs as normal tasks
the Airflow admin GUI list only the main DAG in the main DAGs list, then it will be possible to "zoom-in" to the SubDAG in the Graph View section of the GUI.
We can manage all dags using its main dag or each sub-dags separately by going through its zoom-in option
here is an example for the operator:
https://github.com/apache/airflow/blob/master/airflow/example_dags/example_subdag_operator.py
I recommended using a common dag factor to create all sub-dags if all
are in the same pattern and workflow.