currently, am executing my spark-submit commands in airflow by SSH using BashOperator & BashCommand but our client is not allowing us to do SSH into the cluster, is that possible to execute the Spark-submit command without SSH into cluster from airflow?
You can use DataprocSubmitJobOperator to submit jobs in Airflow. Just make sure to pass correct parameters to the operator. Take note that the job parameter is a dictionary based from Dataproc Job. So you can use this operator to submit different jobs like pyspark, pig, hive, etc.
The code below submits a pyspark job:
import datetime
from airflow import models
from airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperator
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
PROJECT_ID = "my-project"
CLUSTER_NAME = "airflow-cluster" # name of created dataproc cluster
PYSPARK_URI = "gs://dataproc-examples/pyspark/hello-world/hello-world.py" # public sample script
REGION = "us-central1"
PYSPARK_JOB = {
"reference": {"project_id": PROJECT_ID},
"placement": {"cluster_name": CLUSTER_NAME},
"pyspark_job": {"main_python_file_uri": PYSPARK_URI},
}
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with models.DAG(
'submit_dataproc_spark',
catchup=False,
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
submit_dataproc_job = DataprocSubmitJobOperator(
task_id="pyspark_task", job=PYSPARK_JOB, region=REGION, project_id=PROJECT_ID
)
submit_dataproc_job
Airflow run:
Airflow logs:
Dataproc job:
Related
I need to trigger a AWS step function state Machine whenever a file received in AWS s3 location using Airflow File Sensor operator.
Im trying this but its not working.
from airflow.models import DAG
from datetime import datetime, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.operators.sensors import S3KeySensor
import boto3
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2022, 2, 22),
'email': ['nic#enye.tech'],
'email_on_failure': False,
'max_active_runs': 1,
'email_on_retry': False,
'retry_delay': timedelta(minutes=5)
}
dg = DAG('cloudwalker_s3_sensor',
schedule_inte`your text`rval='#daily',`your text`
default_args=default_args,
catchup=False
)
s3_buckname = 'demo1-s3-sensor'
s3_locat = 'demo/testfile.txt'
state_machine_arn = 'arn:......'
s3_sensor = S3KeySensor(
task_id='s3_file_check',
poke_interval=60,
timeout=180,
soft_fail=False,
retries=2,
bucket_key=s3_locat,
bucket_name=s3_buckname,
aws_conn_id='customer_demo',
dag=dg)
def processing_func(**kwargs):
print("Reading the file")
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=s3_buckname, Key=s3_locat)
lin = obj['Body'].read().decode("utf-8")
print(lin)
start_execution = StepFunctionStartExecutionOperator(task_id='start_execution', state_machine_arn=state_machine_arn)
s3_sensor >> func_task
pasted in the question
I want to receive an email notification for task success, failure and retry in GCP composer using Sendgrid.
Currently, all the tasks in my DAG are running successfully. I want to receive notification in that case.
Also when certain tasks are failing or retrying, I want to get those notifications as well. I did the following steps and didn't receive any notification when I forced a task to fail.
Created GCP Composer environment, added environment variables.
SENDGRID_MAIL_FROM : abc#gmail.com
SENDGRID_API_KEY :
Created following DAG.
import json
from datetime import timedelta, datetime
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.contrib.operators.bigquery_check_operator import BigQueryCheckOperator
from airflow.operators.email_operator import EmailOperator
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime(2020, 3, 30),
'email': ['abc#gmail.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 2,
'retry_delay': timedelta(minutes=5),
}
schedule_interval = "05 23 * * *"
dag = DAG(
'DAG_NAME',
default_args=default_args,
schedule_interval=schedule_interval
)
# Config variables
BQ_CONN_ID = ""
BQ_PROJECT = ""
BQ_DATASET = ""
## Task 1
t1 = BigQueryCheckOperator(----)
## Task 2
t2 = BigQueryCheckOperator(----)
## Task 3
t3 = BigQueryOperator(----)
t4 = EmailOperator(
task_id='send_email',
to='abc#gmail.com',
subject='Airflow Alert',
html_content=""" <h3>Email Test</h3> """,
dag=dag
)
# Setting up Dependencies
t1>>t2>>t3>>t4
Am I missing anything? Please tell me what needs to be done, thanks.
Firstly, you need to check which versions of Composer and Sendgrid you are using.
For instance, the latest version of Sendgrid supported on airflow-1.10.3 is v5.6.0. You can refer to the the airflow's setup.py for what dependencies are installed for a specific airflow version.
I recommend you to check the instructions for setting up Sendgrid with Cloud Composer once again. Make sure of a few things:
You set up the environment variables as the guide says, to configure Sendgrid as your email server, you need to obtain your SENDGRID_API_KEY (have you generate it with right permission? At a minimum, the key must have Mail send permissions to send email) and SENDGRID_MAIL_FROM(is the structure correct? noreply-composer#) as environment variables.
In your airflow.cfg file, check if email_backend variable is set to use Sendgrid:
email_backend = airflow.contrib.utils.sendgrid.send_email
Try sending a test DAG, as the guide says, for example you can use this:
from airflow import DAG
from airflow.operators.email_operator import EmailOperator
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago
default_args = {
'owner': 'name.surname',
'start_date': days_ago(1),
'email_on_failure': True,
'email': ['name.surname#company.com'],
}
dag = DAG(
'mail-test',
schedule_interval='#once',
default_args=default_args,
)
send_mail = EmailOperator(
task_id='sendmail',
to='name.surname#company.com',
subject='TEST Mail from Cloud Composer',
html_content='Mail Contents',
dag=dag,
)
failed_bash = BashOperator(
task_id='run_bash',
bash_command='exit 1',
dag=dag,
)
send_mail >> failed_bash
Additionally, please check the spam filter in your email client. If that continues to fail, I'd then start suspecting a firewall rule (if you have edited them) may be causing the issue.
Let me know about the results. I hope it helps.
I have created two Dag's to check the email configuration for Airflow.
Basically I want to get an email alert whenever a job is failed.
I have also gone through the following links but unfortunately, I am not able to resolve the problem.
Link 1
Link 2
DAG One: ( Success Job )
from datetime import datetime
from datetime import timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['firstnamelastname#company.com','firstnamelastname#company.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(seconds=5),
'email_on_success': True
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
def print_hello():
return 'Hello world!'
dag = DAG('success', description='Simple tutorial DAG',
schedule_interval='0 12 * * *',default_args=default_args,
start_date=datetime(2017, 3, 20), catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
dummy_operator >> hello_operator
DAG Two : ( Failed Job )
from datetime import datetime
from datetime import timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['firstnamelastname#company.com','firstnamelastname#company.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(seconds=5),
'email_on_success': True
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
def print_hello():
xxxx
return 'Hello world!'
dag = DAG('success', description='Simple tutorial DAG',
schedule_interval='0 12 * * *',default_args=default_args,
start_date=datetime(2017, 3, 20), catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
dummy_operator >> hello_operator
I was expecting to get an email for both of the jobs. Since both of the Jobs contains configuration for email_on_success and email_on_failure
But I did not receive any email.
Please have a look at the Job Run Stats :
Here is my SMTP Configuration under airflow.cfg :
smtp_host = email-smtp.ap-south-1.amazonaws.com
smtp_starttls = True
smtp_ssl = False
# Uncomment and set the user/pass settings if you want to use SMTP AUTH
smtp_user = XXXXXXXXXXXXXXXXXXX
smtp_password = XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
smtp_port = 587
smtp_mail_from = firstnamelastname#company.com
I have obtained the username and password from the Create My SMTP Credentials under the SES Service. I also have a verified email address. Security Group for my EC2 contains all outbound traffic for all protocol, all port and for destination 0.0.0.0/0
What else I am missing here?
Is it possible to configure/generate logs for the email sending process?
Below is my simple DAG/ Python script that is inside the DAGS folder on Google cloud bucket .
from airflow import DAG
import airflow
from airflow.operators import BashOperator
from datetime import datetime,timedelta , date
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from generate_csv_feeds import generate_csv
DEFAULT_DAG_ARGS = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.utcnow(),
'email_on_failure': False,
'schedule_interval':'*/5 * * * *'
}
with DAG('DAG_MAIN',default_args=DEFAULT_DAG_ARGS,catchup=False) as dag:
generate_csv = PythonOperator(
task_id='generate_mktg_csv',
python_callable=generate_csv,
op_args=['get_data.sql','feeds_data_airflow.csv']
)
csv_generated = BashOperator(
task_id='csv_generated',
bash_command='echo CSV Generated Succesfully.')
generate_csv >> csv_generated
The issue is that it does not get triggered automatically at all nor does it get executed if i trigger it externally via the Command line. But strangely it works when i run it from the Airflow UI . I need this to run every 5 minutes . I am not sure if this has anything to do with Google Composer. Any help would be appreciated . Thanks in advance
I think this is due to your start_date being datetime.utcnow(). It is not recommended to use moving start_date especially datetime.utcnow() because the DAG is triggered at start_date + schedule_interval and as the start_date is moving, the DAG is never triggered. See the FAQ https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date.
Try with a fixed start_date like datetime(2019, 08, 04).
For some reason I can't deploy DAG files on Google Composer if I import google.cloud.storage in the DAG. If I try to deploy such a DAG file then it doesn't get added to the DagBag so ends up with a non-link entry in the Airflow website and is not usable. At this point there's the usual information icon saying: This DAG isn't available in the web server's DagBag object. It shows up in this list because the scheduler marked it as active in the metadata database. Unlike an actual syntax error there is no error message at the top of the page.
I have boiled this down precisely as to whether I import google.cloud.storage or not. Not even whether I actually use it. For example this dag works fine if I comment out the storage import line, does not install in Composer if I replace it. Would anyone have any clue as to why?
import datetime
from airflow import DAG
from google.cloud import storage
from airflow.operators.python_operator import PythonOperator
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'email': ['kevin#here.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': datetime.datetime(2017,1,1),
}
def ingest_file(**kwargs):
status = 'OK'
return status
# Not scheduled, trigger only
dag = DAG('ingest_test', default_args=default_args, schedule_interval=None)
ingest = PythonOperator(task_id = 'ingest', provide_context = True,
python_callable = ingest_file, dag = dag)
If you require PyPi packages in your DAG or custom Operators then you don't get an error message, the DAG just doesn't deploy. If you're getting this then make sure all the packages you need are installed in the Composer environment.
Note, that the behaviour of being present then not present is still there but does actually settle after a short while