I'm using the GCP Composer API (Airflow) and my DAG to scale up the number of workers keep returning me the error below:
Broken DAG: [/home/airflow/gcs/dags/cluster_scale_workers.py] 'module' object has no attribute 'DataProcClusterScaleOperator'
Seems to be something related to the ScaleOperator, however when I look at the Airflow Read the Docs and cross check with my code, seems that nothing is wrong. What am I missing?
Is it related to GCP Airflow version?
Code:
import datetime
import os
from airflow import models
from airflow.contrib.operators import dataproc_operator
from airflow.utils import trigger_rule
yesterday = datetime.datetime.combine(
datetime.datetime.today() - datetime.timedelta(1),
datetime.datetime.min.time())
default_dag_args = {
'start_date': yesterday,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'project_id': models.Variable.get('gcp_project'),
'cluster_name': 'hive-cluster'
}
with models.DAG(
'scale_workers',
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
scale_to_6_workers = dataproc_operator.DataprocClusterScaleOperator(
task_id='scale_dataproc_cluster_6',
cluster_name='hive-cluster',
num_workers=6,
num_preemptible_workers=3,
dag=dag
)
I managed to find the issue and sort it out. The comment provided by Ashish Kumar above is correct.
The problem was that the Airflow version I was using (1.9.0) did not support the DataProcClusterScaleOperator. I created another instance by activating BETA and choosing the version 1.10.0.
Which fixed my issue.
Related
I'm trying to pick a JSON file from a Cloud Storage bucket and dump it into BigQuery using Apache Airflow, however, I'm getting the following error:
Access Denied: BigQuery BigQuery: Missing required OAuth scope. Need BigQuery or Cloud Platform write scope.
This is my code:
from datetime import timedelta, datetime
import json
from airflow.operators.bash_operator import BashOperator
from airflow.contrib.operators.mysql_to_gcs import MySqlToGoogleCloudStorageOperator
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.contrib.operators.bigquery_check_operator import BigQueryCheckOperator
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator
default_args = {
'owner': 'airflow',
'depends_on_past': True,
#'start_date': seven_days_ago,
'start_date': datetime(2022, 11, 1),
'email': ['uzair.zafar#gmail.pk'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 5,
'retry_delay': timedelta(minutes=5),
}
with DAG('checking_airflow',
default_args=default_args,
description='dag to start the logging of data in logging table',
schedule_interval='#daily',
start_date=datetime(2022, 11, 1),
) as dag:
dump_csv_to_temp_table = GCSToBigQueryOperator(
task_id='gcs_to_bq_load',
google_cloud_storage_conn_id='gcp_connection',
bucket='airflow-dags',
#filename='users/users.csv',
source_objects='users/users0.json',
#schema_object='schemas/users.json',
source_format='NEWLINE_DELIMITED_JSON',
create_disposition='CREATE_IF_NEEDED',
destination_project_dataset_table='project.supply_chain.temporary_users',
write_disposition='WRITE_TRUNCATE',
dag=dag,
)
dump_csv_to_temp_table
Please assist me to solve this issue.
Could you please share more details. Are you using Cloud Composer to perform this task?
If the environment is running there should be a default google connection that you can use. Go to Airflow UI >> Admin >> Connections and you should see there google_cloud_default.
In composer if you don't specify the connection it will use the default one for interacting with google cloud ressources.
I am new at Airflow but I've defined a Dag to send a basic email every day at 9am. My DAG is the following one:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash_operator import BashOperator
from airflow.operators.email_operator import EmailOperator
from airflow.utils.dates import days_ago
date_log = str(datetime.today())
my_email = ''
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(0),
'email': ['my_email'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'concurrency': 1,
'max_active_runs': 1
}
with DAG('TEST', default_args=default_args, schedule_interval='0 9 * * *',max_active_runs=1, catchup=False) as dag:
t_teste = EmailOperator(dag=dag, task_id='successful_notification',
to='my_email',
subject='Airflow Dag ' + date_log,
html_content="""""")
t_teste
I've all the configurations as I needed, and I have the webserver and scheduler running. Also, I have my Dag active on UI. My problem is that my DAG seems to be doing nothing. It hasn't run for two days, and even if it passes the scheduled time, it doesn't run as expected. I have already tested and run my trigger manually, and it ran successfully. But if I wait for the trigger time, it does nothing.
Do you know how what I am doing wrong?
Thanks!
Your DAG will never be scheduled. Airflow schedule calculates state_date + schedule_interval and schedule the DAG at the END of the interval.
>>> import airflow
>>> from airflow.utils.dates import days_ago
>>> print(days_ago(0))
2021-06-26 00:00:00+00:00
Calculating 2021-06-26 (today) + schedule_interval it means that the DAG will run on 2021-06-27 09:00 however when we reach 2021-06-27 the calculation will produce 2021-06-28 09:00 and so on resulting in DAG never actually runs.
The conclusion is: never use dynamic values in start_date!
To solve your issue simply change:
'start_date': days_ago(0) to some static value like: 'start_date': datetime(2021,6,25)
note that if you are running older versions of Airflow you might also need to change the dag_id.
I'm new to GCP . I have a sample python script created in a GCP environment which is running fine. I want to schedule this in Airflow. I copied the file in DAG folder in the environment (gs://us-west2-*******-6f9ce4ef-bucket/dags), but it's showing up in the airflow DAG ..
This is the location in airflow config.
dags_folder = /home/airflow/gcs/dags
Pls do let me know how to get my python code to show up in airflow..do i have to setup any other things. I kept all default.
Thanks in advance.
What you did is already correct, wherein you placed your python script in your gs://auto-generated-bucket/dags/. I'm not sure if you were able to use the airflow library in your script, but this library will let you configure the behavior of your DAG in airflow. You can see an example in the Cloud Composer quickstart.
You can check an in-depth tutorial of DAGs here.
Sample DAG (test_dag.py) that prints the dag_run.id:
# test_dag.py #
import datetime
import airflow
from airflow.operators import bash_operator
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with airflow.DAG(
'this_is_the_test_dag', ## <-- This string will be displayed in the AIRFLOW web interface as the DAG name ##
'catchup=False',
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
# Print the dag_run id from the Airflow logs
print_dag_run_conf = bash_operator.BashOperator(
task_id='print_dag_run_conf', bash_command='echo {{ dag_run.id }}')
gs://auto-generated-bucket/dags/ gcs location:
Airflow Web server:
Below is my DAG code, it is not getting triggered #scheduled time. If i run manually, it is working fine.
Not sure where is the problem. I tried testing two to three corn expression but without any luck.
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
#from datetime import timedelta
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5)
}
dag = DAG(
'airflow_worker_pod_monitoring',
default_args=default_args,
description='liveness monitoring dag',
schedule_interval='0,00 15 * * *',
start_date=YESTERDAY,
dagrun_timeout=datetime.timedelta(minutes=60))
# priority_weight has type int in Airflow DB, uses the maximum.
t1 = BashOperator(
task_id='monitor_pod',
bash_command='bash /home/airflow/gcs/data/testscript.sh' ,
dag=dag,
depends_on_past=False,
priority_weight=2**31-1)```
How to make this working?
There are 4 potential causes of this:
schedule_interval provided in default_args
Existing DAG's schedule_interval was modified
start_date is set to datetime.now()
start_date not aligned with schedule_interval
Your issue could be caused by 2. or 4. above since 1. & 3. are properly configured.
Airflow documentation recommends using static data https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date . Using dynamic date (such as airflow.utils.dates.days_ago(0)) as start_date is not advisable and may cause issues as the dag gets confused at 00:00 and switch to next day incorrectly. Set your start_date to be a fixed datetime and it should work correctly.
In general, the way the Airflow scheduling works is not always straight forward. I recommend this article going deeper into this topic https://towardsdatascience.com/airflow-schedule-interval-101-bbdda31cc463
Similar to previous questions, but none of the answers given worked. I have a DAG:
import datetime
import os
from airflow import DAG
from airflow.contrib.operators.dataflow_operator import DataflowTemplateOperator
from airflow.operators import BashOperator
PROJECT = os.environ['PROJECT']
GCS_BUCKET = os.environ['BUCKET']
API_KEY = os.environ['API_KEY']
default_args = {
'owner': 'me',
'start_date': datetime.datetime(2019, 7, 30),
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': datetime.timedelta(hours=1),
'catchup': False
}
dag = DAG('dag-name',
schedule_interval=datetime.timedelta(hours=2),
default_args=default_args,
max_active_runs=1,
concurrency=1,
catchup=False)
DEFAULT_OPTIONS_TEMPLATE = {
'project': PROJECT,
'stagingLocation': 'gs://{}/staging'.format(GCS_BUCKET),
'tempLocation': 'gs://{}/temp'.format(GCS_BUCKET)
}
def my-dataflow-job(template_location, name):
run_time = datetime.datetime.utcnow()
a_value = run_time.strftime('%Y%m%d%H')
t1 = DataflowTemplateOperator(
task_id='{}-task'.format(name),
template=template_location,
parameters={'an_argument': a_value},
dataflow_default_options=DEFAULT_OPTIONS_TEMPLATE,
poll_sleep=30
)
t2 = BashOperator(
task_id='{}-loader-heartbeat'.format(name),
bash_command='curl --fail -XGET "[a heartbeat URL]" --header "Authorization: heartbeat_service {1}"'.format(name, API_KEY)
)
t1 >> t2
with dag:
backup_bt_to_bq('gs://[path to gcs]'.format(GCS_BUCKET), 'name')
As you can see, I'm trying very hard to prevent Airflow from trying to backfill. Yet, when I deploy the DAG (late in the day, on 7/30/2019), it just keeps running the DAG one after the other, after the other, after the other.
Since this task is moving a bit of data around, this is not desirable. How do I get airflow to respect the "run this every other hour" schedule_interval??
As you can see, I've set catchup: False in both the DAG args AND the default args (just in case, started with them in the DAG args). The retry delay is also a long period.
Each DAG run is reported as a success.
I'm running with the following version:
composer-1.5.0-airflow-1.10.1
My next step is kubernetes cron...
I suspect you did not have catchup=False when you first created the dag. I think airflow may not recognize changes to the catchup parameter after inital dag creation.
Try renaming it and see what happens. E.g. add a v2 and enable it. After enabling it, it will run once even though catchup is false, because there is a valid completed interval (i.e. current time is >= start_time + schedule_interval), but that is all.
Of course, test with a fake operator that doesn't do anything expensive.