I want to trigger airflow dags in postman.
recently I am hitting
http://localhost:8080/api/v1/dags/a_pipline/dagRuns/run_id
It's asking for run_id.
How can I specify run id in dags
I have found answers after a lot of struggle. Its working on postman restAPI
POST: http://localhost:8080/api/v1/dags/a_pipline/dagRuns
body: select json formate
{
"dag_run_id": "manual__2022-10-22T11:05:59.461142+00:00"
}
Coming to Airflow from 15 years of using LSF, the terminology trips me up.
A dag_id is its "name"
The above is a "dag_run_id", which is the id of an instance of a dag having run, or in the process of running. What LSF would call it's flow_id.
In LSF these are purely numerical, issued sequentially across all flow instances, 1,2,3 and so on.
In Airflow, these are datetimestamps, which theoretically are not unique across the whole system, but should be within a dag.
In the airflow CLI - you trigger a dag as follows:
airflow dags trigger name_of_the_dag
(glossing over authentication)
I'm just getting to grips with scripts doing this via the REST API...
Related
We are using AWS MWAA. We add our DAG.py files to our S3 bucket programatically. They then show up in the UI. However, they are "OFF" and you must click the "ON" button to start them.
EDIT: Also we may sometimes want to turn a DAG that's ON to OFF (programatically)
I am looking to do this programmatically, however I cannot figure out to.
The API does not seem to have it:
https://docs.aws.amazon.com/mwaa/latest/userguide/mwaa-actions-resources.html
Boto does not seem to have it:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/mwaa.html
Is it possible to manipulate a DAGs status from OFF/ON ON/OFF via API?
This is not doable via API but you can use is_paused_upon_creation this flag specifies if the dag is paused when created for the first time. If the dag exists already, this flag will be ignored.
You can set is_paused_upon_creation=False in the DAG contractor.
dag = DAG(
dag_id='tutorial',
default_args=default_args,
is_paused_upon_creation=False,
)
Another option is to do it via unpause CLI:
airflow dags unpause [-h] [-S SUBDIR] dag_id
So I have a very simple python script that writes a txt-file to my google storage bucket.
I just want to set this job to run each hour i.e not based on a trigger. It seems like that when using SDK, it needs to have a --triger- flag, but I only want it to be "triggered" by the scheduler.
Is that possible?
You can create a Cloud Function with Pub/Sub trigger and then create a Cloud Scheduler job targeting the topic which triggers the function.
I did it by following these steps:
Create a Cloud Function with Pub/Sub trigger
Select your topic or create a new one
This is the default code I am using:
exports.helloPubSub = (event, context) => {
const message = event.data
? Buffer.from(event.data, 'base64').toString()
: 'Hello, World';
console.log(message);
};
Create a Scheduler job targeting with the same Pub/Sub topic
Check it is working.
I tried it with the frequency ***** (every minute) and it works for me, I can see the logs from the Cloud Function.
Currently in order to execute a Cloud Function it needs to be triggered because once it stops the execution the only way to execute it again is through the trigger.
You can also follow the same steps I indicated to you in this page where you can find some images for further help.
I have multiple DAG's created in Airflow .but i want to triggger all of them via some common Module or DAG .Can we create a workflow like Azkaban to have all the DAgs invoked listed in that flow.
You can create a DAG that triggers other DAGs using TriggerDagRunOperator by passing their dag_id and other appropriate args
Source of the operator can be found in dagrun_operator.py
Also I suggest you go through the Scheduling and Triggers section of the docs
You can use SubdagOperator.
The main DAG see and manage all the SubDAGs as normal tasks
the Airflow admin GUI list only the main DAG in the main DAGs list, then it will be possible to "zoom-in" to the SubDAG in the Graph View section of the GUI.
We can manage all dags using its main dag or each sub-dags separately by going through its zoom-in option
here is an example for the operator:
https://github.com/apache/airflow/blob/master/airflow/example_dags/example_subdag_operator.py
I recommended using a common dag factor to create all sub-dags if all
are in the same pattern and workflow.
I am having a problem importing data from Excel sheet to a Amazon DynamoDB table. I have the Excel sheet in an Amazon S3 bucket and I want to import data from this sheet to a table in DynamoDB.
Currently I am following Import and Export DynamoDB Data Using AWS Data Pipeline but my pipeline is not working normally.
It gives me WAITING_FOR_RUNNER status and after sometime the status changed to CANCELED. Please suggest what I am doing wrong or is there any other way to import data from an Excel sheet to a DynamoDB table?
The potential reasons are as follows:-
Reason 1:
If your pipeline is in the SCHEDULED state and one or more tasks
appear stuck in the WAITING_FOR_RUNNER state, ensure that you set a
valid value for either the runsOn or workerGroup fields for those
tasks. If both values are empty or missing, the task cannot start
because there is no association between the task and a worker to
perform the tasks. In this situation, you've defined work but haven't
defined what computer will do that work. If applicable, verify that
the workerGroup value assigned to the pipeline component is exactly
the same name and case as the workerGroup value that you configured
for Task Runner.
Reason 2:-
Another potential cause of this problem is that the endpoint and
access key provided to Task Runner is not the same as the AWS Data
Pipeline console or the computer where the AWS Data Pipeline CLI tools
are installed. You might have created new pipelines with no visible
errors, but Task Runner polls the wrong location due to the difference
in credentials, or polls the correct location with insufficient
permissions to identify and run the work specified by the pipeline
definition.
I know there are api to configure the notification when a job is failed or finished.
But what if, say, I run a hive query that count the number of rows in a table. If the returned result is zero I want to send out emails to the concerned parties. How can I do that?
Thanks.
You may want to look at Airflow and Qubole's operator for airflow. We use airflow to orchestrate all jobs being run using Qubole and in some cases non Qubole environments. We DataDog API to report success / failures of each task (Qubole / Non Qubole). DataDog in this case can be replaced by Airflow's email operator. Airflow also has some chat operator (like Slack)
There is no direct api for triggering notification based on results of a query.
However there is a way to do this using Qubole:
-Create a work flow in qubole with following steps:
1. Your query (any query) that writes output to a particular location on s3.
2. A shell script - This script reads result from your s3 and fails the job based on any criteria. For instance in your case, fail the job if result returns 0 rows.
-Schedule this work flow using "Scheduler" API to notify on failure.
You can also use "Sendmail" shell command to send mail based on results in step 2 above.