Return Glue Job Status in Airflow - amazon-web-services

I am using an older version of Airflow (1.10). We are using Python operators to trigger Glue jobs because Glue operators aren't available in this version. We have multiple jobs that need to run in a particular order. When we run the DAG, our first job triggers and then it passes as succeeded since the job was successfully started.
We are trying to use boto3 to check the status of the job, but we need it to do so continually. Any thoughts on how to check the status continually then only move on to the next Python operator based upon success?

Well, you could try to replicate the .job_completion method from the GlueJobSensor. So basically:
def my_glue_job_that_waits():
# botocore call that starts the job
while True:
try:
# botocore call to retrieve job state
if job_state == "SUCCEEDED":
# log statement
return #what you want the operator to return
else:
# log statement
time.sleep(POKE_INTERVAL)
except:
# what you want to happen if the call above fails
But I highly encourage you to upgrade to Airflow 2 if you can. Long term it will save you a lot of time both being able to use new features and not running into conflicts with provider packages.

Related

GCP Airflow Operators: BQ LOAD and sensor for job ID

I need to schedule automatically a bq load process that gets AVRO files from a GCS bucket and load them in BigQuery, and wait for its completion in order to execute another task upon completion, specifically a task that will read from above mentioned table.
As showed here there is a nice API to run this [command][1] , example given:
bq load \
--source_format=AVRO \
mydataset.mytable \
gs://mybucket/mydata.avro
This will give me a job_id
Waiting on bqjob_Rsdadsajobc9sda5dsa17_0sdasda931b47_1 ... (0s) Current status
job_id that I can check with bq show --job=true bqjob_Rsdadsajobc9sda5dsa17_0sdasda931b47_1
And that is nice... I guess under the hood the bq load job is a DataTransfer. I found some operators related to this: https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/bigquery_dts.html
Even if the documentation does not cover specifically avro load configuration, digging through the documentation, gave me what I was looking for.
My question is: Is there an easier way of getting the status of the job given a job_id similar to the bq show --job=true <job_id> command?
Is there something that might help me in not going through creating a DataTransferJob, starting it, monitoring it and then delete it (because, I don't need it to stay there since next time the parameters will change).
Maybe a custom sensor, using the python-sdk-api?
Thank you in advance.
[1]: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro
I think you can use GCSToBigQueryOperator operators and tasks sequencing with Airflow to solve your issue :
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator
with airflow.DAG(
'dag_name',
default_args=your_args,
schedule_interval=None) as dag:
load_avro_to_bq = GCSToBigQueryOperator(
task_id='load_avro_file_to_bq',
bucket={your_bucket},
source_objects=['folder/{SOURCE}/*.avro'],
destination_project_dataset_table='{your_project}:{your_dataset}.{your_table}',
source_format='AVRO',
compression=None,
create_disposition='CREATE_NEVER',
write_disposition='WRITE_TRUNCATE'
)
second_task = DummyOperator(task_id='Next operator', dag=dag)
load_avro_to_bq >> second_task
The first operator allows to load the Avro file from GCS to BigQuery
If this operator is in success, the second task is executed otherwise it's not executed

Return a python object from GKEPodOperator

I have an Airflow DAG executing calling a GKEPodOperator where a python script is ran to process and load data to BigQuery. Can the GKEPodOperator return like a python list or dictionary from the executed script back to the DAG so I can make use of that to write custom emails using DAGOperator?
First GKEPodOperator is deprecated. You should use GKEStartPodOperator.
Like other operators you can pass values between tasks with Xcom. Note that the GKEStartPodOperator inherits from KubernetesPodOperator thus the xcom mechanism is different than other operators. It works with launching a sidecar container. You can read more about it with examples here.
Now that you have the desired value stored as XCOM (as string) you want to pull it so you can use it in your custom operator:
Airflow >= 2.1.0 (not yet released)
There is native support for to convert xcom directly into native python object. See PR. You will need to set render_template_as_native_obj=True. You can read more about it in the docs: Rendering Fields as Native Python Objects
Airflow < 2.1.0:
In the follow up task you need to pull the xcom value and convert it to whatever Python object you'd like. You can see example for it here.

How to get dag status like running or success or failure

I want to know the status of dag whether it is running or failure or success. I am triggering dag through CL argument airflow trigger and after the execution of job, I want to know the status of the run. I couldn't find any way.
I tried airflow dag_state but it is giving none. What should I do if there are more than one runs in a day to get status of latest run through command line argument or through python code.
You can use list_dag_runs command with the CLI to list the dag runs for a given dag ID. The information returned includes the state of each run.
You can also retrieve the information via python code a few different ways. One such way that I've used in the past is the 'find' method in airflow.models.dagrun.DagRun
An example with python3 on how to get the state of dag_runs via DagRun.find():
dag_id = 'fake_dag_id'
dag_runs = DagRun.find(dag_id=dag_id)
for dag_run in dag_runs:
print(dag_run.state)
You can use the following CL
airflow dag_state dag_id execution_date
Example:
airflow dag_state test_dag_id 2019-11-08T18:36:39.628099+00:00
In the above example:
test_dag_id is the actual dag
2019-11-08T18:36:39.628099+00:00 is the execution date. You can get this from airflow UI for your run.
Other option is to use airflow rest api plugin. This is better option. You can trigger a DAG and also check the status of the dag.
https://github.com/teamclairvoyant/airflow-rest-api-plugin

Google App Engine, tasks in Task Queue are not executed automatically

My tasks are added to Task Queue, but nothing executed automatically. I need to click the button "Run now" to run tasks, tasks are executed without problem. Have I missed some configurations ?
I use default queue configuration, standard App Engine with python 27.
from google.appengine.api import taskqueue
taskqueue.add(
url='/inserturl',
params={'name': 'tablename'})
This documentation is for the API you are now mentioning. The idea would be the same: you need to specify the parameter for when you want the task to be executed. In this case, you have different options, such as countdown or eta. Here is the specific documentation for the method you are using to add a task to the queue (taskqueue.add)
ORIGINAL ANSWER
If you follow this tutorial to create queues and tasks, you will see it is based on the following github repo. If you go to the file where the tasks are created (create_app_engine_queue_task.py). There is where you should specify the time when the task must be executed. In this tutorial, to finally create the task, they use the following command:
python create_app_engine_queue_task.py --project=$PROJECT_ID --location=$LOCATION_ID --queue=$QUEUE_ID --payload=hello
However, it is missing the time when you want to execute it, it should look like this
python create_app_engine_queue_task.py --project=$PROJECT_ID --location=$LOCATION_ID --queue=$QUEUE_ID --payload=hello --in_seconds=["countdown" for when the task will be executed, in seconds]
Basically, the key is in this part of the code in create_app_engine_queue_task.py:
if in_seconds is not None:
# Convert "seconds from now" into an rfc3339 datetime string.
d = datetime.datetime.utcnow() + datetime.timedelta(seconds=in_seconds)
# Create Timestamp protobuf.
timestamp = timestamp_pb2.Timestamp()
timestamp.FromDatetime(d)
# Add the timestamp to the tasks.
task['schedule_time'] = timestamp
If you create the task now and you go to your console, you will see you task will execute and disappear from the queue in the amount of seconds you specified.

Is there an api to send notifications based on job outputs?

I know there are api to configure the notification when a job is failed or finished.
But what if, say, I run a hive query that count the number of rows in a table. If the returned result is zero I want to send out emails to the concerned parties. How can I do that?
Thanks.
You may want to look at Airflow and Qubole's operator for airflow. We use airflow to orchestrate all jobs being run using Qubole and in some cases non Qubole environments. We DataDog API to report success / failures of each task (Qubole / Non Qubole). DataDog in this case can be replaced by Airflow's email operator. Airflow also has some chat operator (like Slack)
There is no direct api for triggering notification based on results of a query.
However there is a way to do this using Qubole:
-Create a work flow in qubole with following steps:
1. Your query (any query) that writes output to a particular location on s3.
2. A shell script - This script reads result from your s3 and fails the job based on any criteria. For instance in your case, fail the job if result returns 0 rows.
-Schedule this work flow using "Scheduler" API to notify on failure.
You can also use "Sendmail" shell command to send mail based on results in step 2 above.