Dataflow job not triggering on cloud from Composer(Airflow) - google-cloud-platform

I am trying to execute apache beam pipeline from composer and facing below issue that the job does not get trigger on GCP.
Job log: (parameterized below stuff not to reveal company specific details:
NFO - Running command: java -jar /tmp/dataflow40103bb6-GcsToBqDataIngestion.jar --runner=DataflowRunner --project=<project_id> --zone=northamerica-northeast1-a --stagingLocation=gs:// --maxNumWorkers=1 --tempLocation=<> --region=northamerica-northeast1 --subnetwork=<network_link> --serviceAccount= --usePublicIps=false --pipelineConfig=pipeline_config/pgp_comm_apps.properties --workerMachineType=n1-standard-2 --env=dev --jobName=test-ingestion-7e20f260#-#{"workflow": "fds-test-dataflow", "task-id": "load-data", "execution-date": "2022-08-02T19:31:51.473861+00:00"}
DAG code:
import datetime
from airflow import models
# The DAG object; we'll need this to instantiate a DAG
from airflow import DAG
from airflow.contrib.operators.dataflow_operator import DataFlowJavaOperator
# Operators; we need this to operate!
from airflow.operators.bash_operator import BashOperator
default_dag_args = {
# The start_date describes when a DAG is valid / can be run. Set this to a
# fixed point in time rather than dynamically, since it is evaluated every
# time a DAG is parsed. See:
# https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime.datetime(2022, 7, 27),
'email': ['test#test.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'dataflow_default_options': {
'project': '<prod_id>',
#"region": "northamerica-northeast1",
"zone": "northamerica-northeast1-a",
'stagingLocation': 'gs://location',
}
#'retry_delay': timedelta(minutes=30),
}
# Define a DAG (directed acyclic graph) of tasks.
# Any task you create within the context manager is automatically added to the
# DAG object.
with models.DAG(
'fds-test-dataflow',
catchup=False,
schedule_interval=None,
#schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
task = DataFlowJavaOperator(
task_id='load-data',
gcp_conn_id="gcp_connection",
job_name='test-ingestion',
jar='gs://path_to_jar',
delegate_to="<SA>",
location='northamerica-northeast1',
options={
'maxNumWorkers': '1',
'project': '<proj_id>',
'tempLocation': 'gs://location/',
'region': 'northamerica-northeast1',
"zone": "northamerica-northeast1-a",
'subnetwork': 'network',
'serviceAccount': 'SA',
'usePublicIps': 'false',
'pipelineConfig': 'pipeline_config/pgp_comm_apps.properties',
"currentTms": '"2022-06-28 10:00:00"',
'labels': {},
'workerMachineType': 'n1-standard-2',
'env': 'dev'
},
dag=dag,)
task

Related

Access Denied: BigQuery BigQuery: Missing required OAuth scope. Need BigQuery or Cloud Platform write scope

I'm trying to pick a JSON file from a Cloud Storage bucket and dump it into BigQuery using Apache Airflow, however, I'm getting the following error:
Access Denied: BigQuery BigQuery: Missing required OAuth scope. Need BigQuery or Cloud Platform write scope.
This is my code:
from datetime import timedelta, datetime
import json
from airflow.operators.bash_operator import BashOperator
from airflow.contrib.operators.mysql_to_gcs import MySqlToGoogleCloudStorageOperator
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.contrib.operators.bigquery_check_operator import BigQueryCheckOperator
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator
default_args = {
'owner': 'airflow',
'depends_on_past': True,
#'start_date': seven_days_ago,
'start_date': datetime(2022, 11, 1),
'email': ['uzair.zafar#gmail.pk'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 5,
'retry_delay': timedelta(minutes=5),
}
with DAG('checking_airflow',
default_args=default_args,
description='dag to start the logging of data in logging table',
schedule_interval='#daily',
start_date=datetime(2022, 11, 1),
) as dag:
dump_csv_to_temp_table = GCSToBigQueryOperator(
task_id='gcs_to_bq_load',
google_cloud_storage_conn_id='gcp_connection',
bucket='airflow-dags',
#filename='users/users.csv',
source_objects='users/users0.json',
#schema_object='schemas/users.json',
source_format='NEWLINE_DELIMITED_JSON',
create_disposition='CREATE_IF_NEEDED',
destination_project_dataset_table='project.supply_chain.temporary_users',
write_disposition='WRITE_TRUNCATE',
dag=dag,
)
dump_csv_to_temp_table
Please assist me to solve this issue.
Could you please share more details. Are you using Cloud Composer to perform this task?
If the environment is running there should be a default google connection that you can use. Go to Airflow UI >> Admin >> Connections and you should see there google_cloud_default.
In composer if you don't specify the connection it will use the default one for interacting with google cloud ressources.

Google Cloud Storage To Google Cloud SQL (Postgres) Operator in Airflow (or Composer)

I am trying to load data from a CSV file in GCS, but there is no predefined operator that does this in Airflow.
I built a simple operator using a PSQL hook and a GCS file reader, but I'm wondering if there is a better solution for this, as right now the way the custom operator workes is running on a loop, row by row, a series of "INSERT INTO" statements with the open GCS file.
You can use CloudSQLImportInstanceOperator on your use case. It imports data from Cloud Storage(CSV file) into a Cloud SQL instance. You can go to this link for an in-depth explanation about this operator.
import datetime
from airflow import models
from airflow.operators import bash
from airflow.providers.google.cloud.operators.cloud_sql import CloudSQLImportInstanceOperator
# If you are running Airflow in more than one time zone
# see https://airflow.apache.org/docs/apache-airflow/stable/timezone.html
# for best practices
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
PROJECT_ID = "project-id"
DAG_ID = "cloudsql"
#BUCKET_NAME = f"{DAG_ID}_{ENV_ID}_bucket"
INSTANCE_NAME = "instance-name"
import_body = {
"importContext": {
"uri": "gs://bucket/file.csv",
"fileType": "CSV",
"csvImportOptions": {
"table": "table",
"columns": [
"column1",
"column2"
]
},
"database": "guestbook",
"importUser": "postgres"
}
}
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with models.DAG(
'composer_quickstart',
catchup=False,
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
# Print the dag_run id from the Airflow logs
sql_import_task = CloudSQLImportInstanceOperator(
body=import_body, instance=INSTANCE_NAME, task_id='sql_import_task', project_id=PROJECT_ID)
sql_import_task
Yes there is no operator to insert data from GCS into CLoud SQL, but you can use the CloudSqlHook, to import the GCS file.
Here you find an example for body, which is a dict contains your file rows, if your file is too big, you can import it in batchs (1k-10K rows) which is much better than a loop with INSERT INTO.

Apache Airflow - Dag doesn't start even with start_date and schedule_interval defined

I am new at Airflow but I've defined a Dag to send a basic email every day at 9am. My DAG is the following one:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash_operator import BashOperator
from airflow.operators.email_operator import EmailOperator
from airflow.utils.dates import days_ago
date_log = str(datetime.today())
my_email = ''
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(0),
'email': ['my_email'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'concurrency': 1,
'max_active_runs': 1
}
with DAG('TEST', default_args=default_args, schedule_interval='0 9 * * *',max_active_runs=1, catchup=False) as dag:
t_teste = EmailOperator(dag=dag, task_id='successful_notification',
to='my_email',
subject='Airflow Dag ' + date_log,
html_content="""""")
t_teste
I've all the configurations as I needed, and I have the webserver and scheduler running. Also, I have my Dag active on UI. My problem is that my DAG seems to be doing nothing. It hasn't run for two days, and even if it passes the scheduled time, it doesn't run as expected. I have already tested and run my trigger manually, and it ran successfully. But if I wait for the trigger time, it does nothing.
Do you know how what I am doing wrong?
Thanks!
Your DAG will never be scheduled. Airflow schedule calculates state_date + schedule_interval and schedule the DAG at the END of the interval.
>>> import airflow
>>> from airflow.utils.dates import days_ago
>>> print(days_ago(0))
2021-06-26 00:00:00+00:00
Calculating 2021-06-26 (today) + schedule_interval it means that the DAG will run on 2021-06-27 09:00 however when we reach 2021-06-27 the calculation will produce 2021-06-28 09:00 and so on resulting in DAG never actually runs.
The conclusion is: never use dynamic values in start_date!
To solve your issue simply change:
'start_date': days_ago(0) to some static value like: 'start_date': datetime(2021,6,25)
note that if you are running older versions of Airflow you might also need to change the dag_id.

Airflow - GCP - files from DAG folder are not showing up

I'm new to GCP . I have a sample python script created in a GCP environment which is running fine. I want to schedule this in Airflow. I copied the file in DAG folder in the environment (gs://us-west2-*******-6f9ce4ef-bucket/dags), but it's showing up in the airflow DAG ..
This is the location in airflow config.
dags_folder = /home/airflow/gcs/dags
Pls do let me know how to get my python code to show up in airflow..do i have to setup any other things. I kept all default.
Thanks in advance.
What you did is already correct, wherein you placed your python script in your gs://auto-generated-bucket/dags/. I'm not sure if you were able to use the airflow library in your script, but this library will let you configure the behavior of your DAG in airflow. You can see an example in the Cloud Composer quickstart.
You can check an in-depth tutorial of DAGs here.
Sample DAG (test_dag.py) that prints the dag_run.id:
# test_dag.py #
import datetime
import airflow
from airflow.operators import bash_operator
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with airflow.DAG(
'this_is_the_test_dag', ## <-- This string will be displayed in the AIRFLOW web interface as the DAG name ##
'catchup=False',
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
# Print the dag_run id from the Airflow logs
print_dag_run_conf = bash_operator.BashOperator(
task_id='print_dag_run_conf', bash_command='echo {{ dag_run.id }}')
gs://auto-generated-bucket/dags/ gcs location:
Airflow Web server:

Airflow keeps running my DAG, despite catchup=False, schedule_interval=datetime.timedelta(hours=2)

Similar to previous questions, but none of the answers given worked. I have a DAG:
import datetime
import os
from airflow import DAG
from airflow.contrib.operators.dataflow_operator import DataflowTemplateOperator
from airflow.operators import BashOperator
PROJECT = os.environ['PROJECT']
GCS_BUCKET = os.environ['BUCKET']
API_KEY = os.environ['API_KEY']
default_args = {
'owner': 'me',
'start_date': datetime.datetime(2019, 7, 30),
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': datetime.timedelta(hours=1),
'catchup': False
}
dag = DAG('dag-name',
schedule_interval=datetime.timedelta(hours=2),
default_args=default_args,
max_active_runs=1,
concurrency=1,
catchup=False)
DEFAULT_OPTIONS_TEMPLATE = {
'project': PROJECT,
'stagingLocation': 'gs://{}/staging'.format(GCS_BUCKET),
'tempLocation': 'gs://{}/temp'.format(GCS_BUCKET)
}
def my-dataflow-job(template_location, name):
run_time = datetime.datetime.utcnow()
a_value = run_time.strftime('%Y%m%d%H')
t1 = DataflowTemplateOperator(
task_id='{}-task'.format(name),
template=template_location,
parameters={'an_argument': a_value},
dataflow_default_options=DEFAULT_OPTIONS_TEMPLATE,
poll_sleep=30
)
t2 = BashOperator(
task_id='{}-loader-heartbeat'.format(name),
bash_command='curl --fail -XGET "[a heartbeat URL]" --header "Authorization: heartbeat_service {1}"'.format(name, API_KEY)
)
t1 >> t2
with dag:
backup_bt_to_bq('gs://[path to gcs]'.format(GCS_BUCKET), 'name')
As you can see, I'm trying very hard to prevent Airflow from trying to backfill. Yet, when I deploy the DAG (late in the day, on 7/30/2019), it just keeps running the DAG one after the other, after the other, after the other.
Since this task is moving a bit of data around, this is not desirable. How do I get airflow to respect the "run this every other hour" schedule_interval??
As you can see, I've set catchup: False in both the DAG args AND the default args (just in case, started with them in the DAG args). The retry delay is also a long period.
Each DAG run is reported as a success.
I'm running with the following version:
composer-1.5.0-airflow-1.10.1
My next step is kubernetes cron...
I suspect you did not have catchup=False when you first created the dag. I think airflow may not recognize changes to the catchup parameter after inital dag creation.
Try renaming it and see what happens. E.g. add a v2 and enable it. After enabling it, it will run once even though catchup is false, because there is a valid completed interval (i.e. current time is >= start_time + schedule_interval), but that is all.
Of course, test with a fake operator that doesn't do anything expensive.