reseting start_date of an existing airflow dag - airflow-scheduler

can I reset start_date of an existing airflow dag without changing its name? I have dag running currently in airflow .

You should never change start_date without changing the dag name. Doing so will lead to unpredictable results. And changing dag name is also safe for your operations as when the dag name changed, the dag under old name will immediately stop running and new dag will initiate in seconds. As it should never happen that an old name dag existing while the new name dag exists simultaneously if you simply change the dag name then save. Especially, it will be highly safe if you are using git to sync your code and always change the dag name when start_date and interval changed.

Related

Return Glue Job Status in Airflow

I am using an older version of Airflow (1.10). We are using Python operators to trigger Glue jobs because Glue operators aren't available in this version. We have multiple jobs that need to run in a particular order. When we run the DAG, our first job triggers and then it passes as succeeded since the job was successfully started.
We are trying to use boto3 to check the status of the job, but we need it to do so continually. Any thoughts on how to check the status continually then only move on to the next Python operator based upon success?
Well, you could try to replicate the .job_completion method from the GlueJobSensor. So basically:
def my_glue_job_that_waits():
# botocore call that starts the job
while True:
try:
# botocore call to retrieve job state
if job_state == "SUCCEEDED":
# log statement
return #what you want the operator to return
else:
# log statement
time.sleep(POKE_INTERVAL)
except:
# what you want to happen if the call above fails
But I highly encourage you to upgrade to Airflow 2 if you can. Long term it will save you a lot of time both being able to use new features and not running into conflicts with provider packages.

Trigger Airflow in postman

I want to trigger airflow dags in postman.
recently I am hitting
http://localhost:8080/api/v1/dags/a_pipline/dagRuns/run_id
It's asking for run_id.
How can I specify run id in dags
I have found answers after a lot of struggle. Its working on postman restAPI
POST: http://localhost:8080/api/v1/dags/a_pipline/dagRuns
body: select json formate
{
"dag_run_id": "manual__2022-10-22T11:05:59.461142+00:00"
}
Coming to Airflow from 15 years of using LSF, the terminology trips me up.
A dag_id is its "name"
The above is a "dag_run_id", which is the id of an instance of a dag having run, or in the process of running. What LSF would call it's flow_id.
In LSF these are purely numerical, issued sequentially across all flow instances, 1,2,3 and so on.
In Airflow, these are datetimestamps, which theoretically are not unique across the whole system, but should be within a dag.
In the airflow CLI - you trigger a dag as follows:
airflow dags trigger name_of_the_dag
(glossing over authentication)
I'm just getting to grips with scripts doing this via the REST API...

Airflow Backfill DAG runs stuck running with first task in queued (grey) state

I have tried viewing similar answers on stackoverflow to this problem, however my case is slightly different.
I am executing backfill jobs via Airflow CLI, and the backfilled dag runs get stuck in a running state, with the first task in the dag in a queued (grey) state.
The scheduler doesn't seem to ever kick off the first task.
I do not have depends_on_past=True set as dag_defaults
dag_defaults = {
"start_date": datetime.today() - timedelta(days=2),
"on_failure_callback": on_failure_callback,
"provide_context": True
}
I am forced to Run every task manually. :( Rather than just letting the scheduler take its course and run them automatically.
Note: I am executing the backfill cli commands via Airflow worker pods on a K8S cluster.
Has anyone else faced a similar issue using the backfill cli commands?
UPDATE:
I realised my backfill runs fall outside the total dag interval. I.e before the dag start_date causing a blocking schedule dependancy.
While you can still create the run, it will not run automatically, but you can manually run each task.
As a workaround would need to change the start_date to be before or on my oldest backfill date.
Would be nice if there was a way to override the backfill cmd or provide a --force option that could mock the start_date in for that specific dag_run, rather than being bound to the total interval.
UPDATE: I realised my backfill runs fall outside the total dag
interval. I.e before the dag start_date causing a blocking schedule
dependancy.
While you can still create the run, it will not run automatically, but you can manually run each task.
As a workaround would need to change the start_date to be before or on my oldest backfill date.
Would be nice if there was a way to override the backfill cmd or provide a --force option that could mock the start_date in for that specific dag_run, rather than being bound to the total interval.

How to get dag status like running or success or failure

I want to know the status of dag whether it is running or failure or success. I am triggering dag through CL argument airflow trigger and after the execution of job, I want to know the status of the run. I couldn't find any way.
I tried airflow dag_state but it is giving none. What should I do if there are more than one runs in a day to get status of latest run through command line argument or through python code.
You can use list_dag_runs command with the CLI to list the dag runs for a given dag ID. The information returned includes the state of each run.
You can also retrieve the information via python code a few different ways. One such way that I've used in the past is the 'find' method in airflow.models.dagrun.DagRun
An example with python3 on how to get the state of dag_runs via DagRun.find():
dag_id = 'fake_dag_id'
dag_runs = DagRun.find(dag_id=dag_id)
for dag_run in dag_runs:
print(dag_run.state)
You can use the following CL
airflow dag_state dag_id execution_date
Example:
airflow dag_state test_dag_id 2019-11-08T18:36:39.628099+00:00
In the above example:
test_dag_id is the actual dag
2019-11-08T18:36:39.628099+00:00 is the execution date. You can get this from airflow UI for your run.
Other option is to use airflow rest api plugin. This is better option. You can trigger a DAG and also check the status of the dag.
https://github.com/teamclairvoyant/airflow-rest-api-plugin

pass argument to DAG, task apache airflow

I'm exploring apache airflow 1.8. I'm curious that either is there any way to pass arguments to DAGs or tasks while backfilling?
Here is something like that I'm searching for,
airflow backfill My_DAG -s some_date -e end_date argument_for_t1 argument_for_t2
or it could be array of args.
is there any way to pass arguments? I've searched a lot, but wasn't able to find anything.
Backfill is to re-run a failed/missed/mis configured DAG and hence there is no provision to give command line arguments.
However, you can go to DagBag folder and give params to the dag object (Change/fix existing dag) and then back fill the DAG
I came across the solution that I can set the variables using airflow variables -s my_var="some value"
Now I can access my_var anywhere across the application.
for more options check cli