AWS MWAA (Managed Apache Airflow); Programmatically enable DAGs - amazon-web-services

We are using AWS MWAA. We add our DAG.py files to our S3 bucket programatically. They then show up in the UI. However, they are "OFF" and you must click the "ON" button to start them.
EDIT: Also we may sometimes want to turn a DAG that's ON to OFF (programatically)
I am looking to do this programmatically, however I cannot figure out to.
The API does not seem to have it:
https://docs.aws.amazon.com/mwaa/latest/userguide/mwaa-actions-resources.html
Boto does not seem to have it:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/mwaa.html
Is it possible to manipulate a DAGs status from OFF/ON ON/OFF via API?

This is not doable via API but you can use is_paused_upon_creation this flag specifies if the dag is paused when created for the first time. If the dag exists already, this flag will be ignored.
You can set is_paused_upon_creation=False in the DAG contractor.
dag = DAG(
dag_id='tutorial',
default_args=default_args,
is_paused_upon_creation=False,
)
Another option is to do it via unpause CLI:
airflow dags unpause [-h] [-S SUBDIR] dag_id

Related

Airflow: How to list all the DAGs where the status is either ON/OFF using Cloud Shell?

I am trying to get a list of my DAGs where the status is either ON/OFF using Cloud Shell.
For example, in Cloud Shell, I want to write a command that would return a list of DAGs where the status is either ON/OFF.
Specifically, in the Airflow UI, when you see the list of DAGs, there is a column that indicates "ON" or "OFF" - therefore, I want to list these DAGs in Cloud Shell.
So far, I have used this command to list all of my DAGs:
gcloud composer environments storage dags list --environment=ENVIRONMENT --location=LOCATION
Though, say if I wanted to list the DAGs where the status is "ON", how would I do this?
Likewise, this should be the same as listing the DAGs that are "OFF."

Trigger Airflow in postman

I want to trigger airflow dags in postman.
recently I am hitting
http://localhost:8080/api/v1/dags/a_pipline/dagRuns/run_id
It's asking for run_id.
How can I specify run id in dags
I have found answers after a lot of struggle. Its working on postman restAPI
POST: http://localhost:8080/api/v1/dags/a_pipline/dagRuns
body: select json formate
{
"dag_run_id": "manual__2022-10-22T11:05:59.461142+00:00"
}
Coming to Airflow from 15 years of using LSF, the terminology trips me up.
A dag_id is its "name"
The above is a "dag_run_id", which is the id of an instance of a dag having run, or in the process of running. What LSF would call it's flow_id.
In LSF these are purely numerical, issued sequentially across all flow instances, 1,2,3 and so on.
In Airflow, these are datetimestamps, which theoretically are not unique across the whole system, but should be within a dag.
In the airflow CLI - you trigger a dag as follows:
airflow dags trigger name_of_the_dag
(glossing over authentication)
I'm just getting to grips with scripts doing this via the REST API...

How to get dag status like running or success or failure

I want to know the status of dag whether it is running or failure or success. I am triggering dag through CL argument airflow trigger and after the execution of job, I want to know the status of the run. I couldn't find any way.
I tried airflow dag_state but it is giving none. What should I do if there are more than one runs in a day to get status of latest run through command line argument or through python code.
You can use list_dag_runs command with the CLI to list the dag runs for a given dag ID. The information returned includes the state of each run.
You can also retrieve the information via python code a few different ways. One such way that I've used in the past is the 'find' method in airflow.models.dagrun.DagRun
An example with python3 on how to get the state of dag_runs via DagRun.find():
dag_id = 'fake_dag_id'
dag_runs = DagRun.find(dag_id=dag_id)
for dag_run in dag_runs:
print(dag_run.state)
You can use the following CL
airflow dag_state dag_id execution_date
Example:
airflow dag_state test_dag_id 2019-11-08T18:36:39.628099+00:00
In the above example:
test_dag_id is the actual dag
2019-11-08T18:36:39.628099+00:00 is the execution date. You can get this from airflow UI for your run.
Other option is to use airflow rest api plugin. This is better option. You can trigger a DAG and also check the status of the dag.
https://github.com/teamclairvoyant/airflow-rest-api-plugin

In Airflow,How can i create Workflow having multiple DAG that can be invoked by executing that workflow?

I have multiple DAG's created in Airflow .but i want to triggger all of them via some common Module or DAG .Can we create a workflow like Azkaban to have all the DAgs invoked listed in that flow.
You can create a DAG that triggers other DAGs using TriggerDagRunOperator by passing their dag_id and other appropriate args
Source of the operator can be found in dagrun_operator.py
Also I suggest you go through the Scheduling and Triggers section of the docs
You can use SubdagOperator.
The main DAG see and manage all the SubDAGs as normal tasks
the Airflow admin GUI list only the main DAG in the main DAGs list, then it will be possible to "zoom-in" to the SubDAG in the Graph View section of the GUI.
We can manage all dags using its main dag or each sub-dags separately by going through its zoom-in option
here is an example for the operator:
https://github.com/apache/airflow/blob/master/airflow/example_dags/example_subdag_operator.py
I recommended using a common dag factor to create all sub-dags if all
are in the same pattern and workflow.

Get the BigQuery Table creator and Google Storage Bucket Creator Details

I am trying to identify the users who created tables in BigQuery.
Is there any command line or API that would provide this information. I know that audit logs do provide this information, but I was looking for a command line which could do the job so that i could wrap this in a shell script and run them against all the tables at one time. Same for Google Storage Buckets as well. I did try
gsutil iam get gs://my-bkt and looked for "role": "roles/storage.admin" role, but I do not find the admin role with all buckets. Any help?
This is a use case for audit logs. BigQuery tables don't report metadata about the original resource creator, so scanning via tables.list or inspecting the ACLs don't really expose who created the resource, only who currently has access.
What's the use case? You could certainly export the audit logs back into BigQuery and query for table creation events going forward, but that's not exactly the same.
You can find it out using Audit Logs. You can access them both via Console/Log Explorer or using gcloud tool from the CLI.
The log filter that you're interested in is this one:
resource.type = ("bigquery_project" OR "bigquery_dataset")
logName="projects/YOUR_PROJECT/logs/cloudaudit.googleapis.com%2Factivity"
protoPayload.methodName = "google.cloud.bigquery.v2.TableService.InsertTable"
protoPayload.resourceName = "projects/YOUR_PROJECT/datasets/curb_tracking/tables/YOUR_TABLE"
If you want to run it from the command line, you'd do something like this:
gcloud logging read \
'
resource.type = ("bigquery_project" OR "bigquery_dataset")
logName="projects/YOUR_PROJECT/logs/cloudaudit.googleapis.com%2Factivity"
protoPayload.methodName = "google.cloud.bigquery.v2.TableService.InsertTable"
protoPayload.resourceName = "projects/YOUR_PROJECT/datasets/curb_tracking/tables/YOUR_TABLE"
'\
--limit 10
You can then post-process the output to find out who created the table. Look for principalEmail field.