I have created two Dag's to check the email configuration for Airflow.
Basically I want to get an email alert whenever a job is failed.
I have also gone through the following links but unfortunately, I am not able to resolve the problem.
Link 1
Link 2
DAG One: ( Success Job )
from datetime import datetime
from datetime import timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['firstnamelastname#company.com','firstnamelastname#company.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(seconds=5),
'email_on_success': True
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
def print_hello():
return 'Hello world!'
dag = DAG('success', description='Simple tutorial DAG',
schedule_interval='0 12 * * *',default_args=default_args,
start_date=datetime(2017, 3, 20), catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
dummy_operator >> hello_operator
DAG Two : ( Failed Job )
from datetime import datetime
from datetime import timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['firstnamelastname#company.com','firstnamelastname#company.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(seconds=5),
'email_on_success': True
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
def print_hello():
xxxx
return 'Hello world!'
dag = DAG('success', description='Simple tutorial DAG',
schedule_interval='0 12 * * *',default_args=default_args,
start_date=datetime(2017, 3, 20), catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
dummy_operator >> hello_operator
I was expecting to get an email for both of the jobs. Since both of the Jobs contains configuration for email_on_success and email_on_failure
But I did not receive any email.
Please have a look at the Job Run Stats :
Here is my SMTP Configuration under airflow.cfg :
smtp_host = email-smtp.ap-south-1.amazonaws.com
smtp_starttls = True
smtp_ssl = False
# Uncomment and set the user/pass settings if you want to use SMTP AUTH
smtp_user = XXXXXXXXXXXXXXXXXXX
smtp_password = XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
smtp_port = 587
smtp_mail_from = firstnamelastname#company.com
I have obtained the username and password from the Create My SMTP Credentials under the SES Service. I also have a verified email address. Security Group for my EC2 contains all outbound traffic for all protocol, all port and for destination 0.0.0.0/0
What else I am missing here?
Is it possible to configure/generate logs for the email sending process?
Related
I need to trigger a AWS step function state Machine whenever a file received in AWS s3 location using Airflow File Sensor operator.
Im trying this but its not working.
from airflow.models import DAG
from datetime import datetime, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.operators.sensors import S3KeySensor
import boto3
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2022, 2, 22),
'email': ['nic#enye.tech'],
'email_on_failure': False,
'max_active_runs': 1,
'email_on_retry': False,
'retry_delay': timedelta(minutes=5)
}
dg = DAG('cloudwalker_s3_sensor',
schedule_inte`your text`rval='#daily',`your text`
default_args=default_args,
catchup=False
)
s3_buckname = 'demo1-s3-sensor'
s3_locat = 'demo/testfile.txt'
state_machine_arn = 'arn:......'
s3_sensor = S3KeySensor(
task_id='s3_file_check',
poke_interval=60,
timeout=180,
soft_fail=False,
retries=2,
bucket_key=s3_locat,
bucket_name=s3_buckname,
aws_conn_id='customer_demo',
dag=dg)
def processing_func(**kwargs):
print("Reading the file")
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=s3_buckname, Key=s3_locat)
lin = obj['Body'].read().decode("utf-8")
print(lin)
start_execution = StepFunctionStartExecutionOperator(task_id='start_execution', state_machine_arn=state_machine_arn)
s3_sensor >> func_task
pasted in the question
currently, am executing my spark-submit commands in airflow by SSH using BashOperator & BashCommand but our client is not allowing us to do SSH into the cluster, is that possible to execute the Spark-submit command without SSH into cluster from airflow?
You can use DataprocSubmitJobOperator to submit jobs in Airflow. Just make sure to pass correct parameters to the operator. Take note that the job parameter is a dictionary based from Dataproc Job. So you can use this operator to submit different jobs like pyspark, pig, hive, etc.
The code below submits a pyspark job:
import datetime
from airflow import models
from airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperator
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
PROJECT_ID = "my-project"
CLUSTER_NAME = "airflow-cluster" # name of created dataproc cluster
PYSPARK_URI = "gs://dataproc-examples/pyspark/hello-world/hello-world.py" # public sample script
REGION = "us-central1"
PYSPARK_JOB = {
"reference": {"project_id": PROJECT_ID},
"placement": {"cluster_name": CLUSTER_NAME},
"pyspark_job": {"main_python_file_uri": PYSPARK_URI},
}
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with models.DAG(
'submit_dataproc_spark',
catchup=False,
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
submit_dataproc_job = DataprocSubmitJobOperator(
task_id="pyspark_task", job=PYSPARK_JOB, region=REGION, project_id=PROJECT_ID
)
submit_dataproc_job
Airflow run:
Airflow logs:
Dataproc job:
Have a dag like this:
import os
from datetime import timedelta
from xxx import on_failure_opsgenie
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
DAG_ID = os.path.basename(__file__).replace(".py", "")
DEFAULT_ARGS = {
"owner": "airflow",
"depends_on_past": False,
"email": ["airflow#example.com"],
"email_on_failure": False,
"email_on_retry": False,
}
def kaboom(*args, **kwargs):
print("goodbye cruel world")
print(args)
print(kwargs)
assert 1 == 2
with DAG(
dag_id=DAG_ID,
default_args=DEFAULT_ARGS,
description="Print contents of airflow.cfg to logs",
dagrun_timeout=timedelta(hours=2),
start_date=days_ago(1),
schedule_interval=None,
on_failure_callback=on_failure_opsgenie,
) as dag:
get_airflow_cfg_operator = PythonOperator(task_id="gonna_explode", python_callable=kaboom)
The DAG fails as expected, purposefully. However, on_failure_opsgenie is not doing what it should; how do I get the logs or debug a failed on-failure-callback in AWS MWAA?
I'm trying to upload a file from my local machine to GCS and I'm using the LocalFilesystemToGCSOperator. I'm following this howto https://airflow.readthedocs.io/en/latest/howto/operator/google/transfer/local_to_gcs.html#prerequisite-tasks. I've set up a connection to GCP with a path to a json file. This is the DAG code:
import os
from airflow import models
from airflow.providers.google.cloud.transfers.local_to_gcs import LocalFilesystemToGCSOperator
from airflow.utils import dates
BUCKET_NAME = 'bucket-name'
PATH_TO_UPLOAD_FILE = '...path-to/airflow/dags/example-text.txt'
DESTINATION_FILE_LOCATION = '/test-dir-input/example-text.txt'
with models.DAG(
'example_local_to_gcs',
default_args=dict(start_date=dates.days_ago(1)),
schedule_interval=None,
) as dag:
upload_file = LocalFilesystemToGCSOperator(
gcp_conn_id='custom_gcp_connection',
task_id="upload_file",
src=PATH_TO_UPLOAD_FILE,
dst=DESTINATION_FILE_LOCATION,
bucket=BUCKET_NAME,
mime_type='text/plain'
)
When I trigger the DAG it is marked as a success but the file is not in the bucket
It looks like there's a problem with your path_to_upload and destination_file_location.
To give you an idea, here's a separate post that could also help you. The relevant parameters similar to yours were declared like this for example:
src='/Users/john/Documents/tmp',
dst='gs://constantine-bucket',
bucket='constantine-bucket',
You should remove the ... and make sure that the destination_file_location refers to your bucket name or the folder inside it like this:
BUCKET_NAME = 'bucket-name'
PATH_TO_UPLOAD_FILE = '/path-to/airflow/dags/example-text.txt'
DESTINATION_FILE_LOCATION = 'gs://bucket-name/example-text.txt'
# Or in a folder on your bucket
# DESTINATION_FILE_LOCATION = 'gs://bucket-name/folder/example-text.txt'
The following code did the trick for me.
Please note that the service account used must have storage.objects permissions in the destination-bucket to write the file.
import os
import datetime
from pathlib import Path
from airflow import DAG
from airflow.configuration import conf
from airflow.operators.dummy import DummyOperator
from airflow.providers.google.cloud.transfers.local_to_gcs import LocalFilesystemToGCSOperator
comp_home_path = Path(conf.get("core", "dags_folder")).parent.absolute()
comp_bucket_path = "data/uploaded" # <- if your file is within a folder
comp_local_path = os.path.join(comp_home_path, comp_bucket_path)
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.datetime.today(),
'end_date': None,
'email': ['somename#somecompany.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=1)
}
sch_interval = None
dag = DAG(
'mv_local_to_GCS',
default_args=default_args,
tags=["example"],
catchup=False,
schedule_interval=sch_interval
)
mv_local_gcs = LocalFilesystemToGCSOperator(
task_id="local_to_gcs",
src=comp_local_path+"/yourfilename.csv",# PATH_TO_UPLOAD_FILE
dst="somefolder/yournewfilename.csv",# BUCKET_FILE_LOCATION
bucket="yourproject",#using NO 'gs://' nor '/' at the end, only the project, folders, if any, in dst
dag=dag
)
start = DummyOperator(task_id='Starting', dag=dag)
start >> mv_local_gcs
I want to receive an email notification for task success, failure and retry in GCP composer using Sendgrid.
Currently, all the tasks in my DAG are running successfully. I want to receive notification in that case.
Also when certain tasks are failing or retrying, I want to get those notifications as well. I did the following steps and didn't receive any notification when I forced a task to fail.
Created GCP Composer environment, added environment variables.
SENDGRID_MAIL_FROM : abc#gmail.com
SENDGRID_API_KEY :
Created following DAG.
import json
from datetime import timedelta, datetime
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.contrib.operators.bigquery_check_operator import BigQueryCheckOperator
from airflow.operators.email_operator import EmailOperator
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime(2020, 3, 30),
'email': ['abc#gmail.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 2,
'retry_delay': timedelta(minutes=5),
}
schedule_interval = "05 23 * * *"
dag = DAG(
'DAG_NAME',
default_args=default_args,
schedule_interval=schedule_interval
)
# Config variables
BQ_CONN_ID = ""
BQ_PROJECT = ""
BQ_DATASET = ""
## Task 1
t1 = BigQueryCheckOperator(----)
## Task 2
t2 = BigQueryCheckOperator(----)
## Task 3
t3 = BigQueryOperator(----)
t4 = EmailOperator(
task_id='send_email',
to='abc#gmail.com',
subject='Airflow Alert',
html_content=""" <h3>Email Test</h3> """,
dag=dag
)
# Setting up Dependencies
t1>>t2>>t3>>t4
Am I missing anything? Please tell me what needs to be done, thanks.
Firstly, you need to check which versions of Composer and Sendgrid you are using.
For instance, the latest version of Sendgrid supported on airflow-1.10.3 is v5.6.0. You can refer to the the airflow's setup.py for what dependencies are installed for a specific airflow version.
I recommend you to check the instructions for setting up Sendgrid with Cloud Composer once again. Make sure of a few things:
You set up the environment variables as the guide says, to configure Sendgrid as your email server, you need to obtain your SENDGRID_API_KEY (have you generate it with right permission? At a minimum, the key must have Mail send permissions to send email) and SENDGRID_MAIL_FROM(is the structure correct? noreply-composer#) as environment variables.
In your airflow.cfg file, check if email_backend variable is set to use Sendgrid:
email_backend = airflow.contrib.utils.sendgrid.send_email
Try sending a test DAG, as the guide says, for example you can use this:
from airflow import DAG
from airflow.operators.email_operator import EmailOperator
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago
default_args = {
'owner': 'name.surname',
'start_date': days_ago(1),
'email_on_failure': True,
'email': ['name.surname#company.com'],
}
dag = DAG(
'mail-test',
schedule_interval='#once',
default_args=default_args,
)
send_mail = EmailOperator(
task_id='sendmail',
to='name.surname#company.com',
subject='TEST Mail from Cloud Composer',
html_content='Mail Contents',
dag=dag,
)
failed_bash = BashOperator(
task_id='run_bash',
bash_command='exit 1',
dag=dag,
)
send_mail >> failed_bash
Additionally, please check the spam filter in your email client. If that continues to fail, I'd then start suspecting a firewall rule (if you have edited them) may be causing the issue.
Let me know about the results. I hope it helps.