I'm new to GCP . I have a sample python script created in a GCP environment which is running fine. I want to schedule this in Airflow. I copied the file in DAG folder in the environment (gs://us-west2-*******-6f9ce4ef-bucket/dags), but it's showing up in the airflow DAG ..
This is the location in airflow config.
dags_folder = /home/airflow/gcs/dags
Pls do let me know how to get my python code to show up in airflow..do i have to setup any other things. I kept all default.
Thanks in advance.
What you did is already correct, wherein you placed your python script in your gs://auto-generated-bucket/dags/. I'm not sure if you were able to use the airflow library in your script, but this library will let you configure the behavior of your DAG in airflow. You can see an example in the Cloud Composer quickstart.
You can check an in-depth tutorial of DAGs here.
Sample DAG (test_dag.py) that prints the dag_run.id:
# test_dag.py #
import datetime
import airflow
from airflow.operators import bash_operator
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with airflow.DAG(
'this_is_the_test_dag', ## <-- This string will be displayed in the AIRFLOW web interface as the DAG name ##
'catchup=False',
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
# Print the dag_run id from the Airflow logs
print_dag_run_conf = bash_operator.BashOperator(
task_id='print_dag_run_conf', bash_command='echo {{ dag_run.id }}')
gs://auto-generated-bucket/dags/ gcs location:
Airflow Web server:
Related
I'm trying to pick a JSON file from a Cloud Storage bucket and dump it into BigQuery using Apache Airflow, however, I'm getting the following error:
Access Denied: BigQuery BigQuery: Missing required OAuth scope. Need BigQuery or Cloud Platform write scope.
This is my code:
from datetime import timedelta, datetime
import json
from airflow.operators.bash_operator import BashOperator
from airflow.contrib.operators.mysql_to_gcs import MySqlToGoogleCloudStorageOperator
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.contrib.operators.bigquery_check_operator import BigQueryCheckOperator
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator
default_args = {
'owner': 'airflow',
'depends_on_past': True,
#'start_date': seven_days_ago,
'start_date': datetime(2022, 11, 1),
'email': ['uzair.zafar#gmail.pk'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 5,
'retry_delay': timedelta(minutes=5),
}
with DAG('checking_airflow',
default_args=default_args,
description='dag to start the logging of data in logging table',
schedule_interval='#daily',
start_date=datetime(2022, 11, 1),
) as dag:
dump_csv_to_temp_table = GCSToBigQueryOperator(
task_id='gcs_to_bq_load',
google_cloud_storage_conn_id='gcp_connection',
bucket='airflow-dags',
#filename='users/users.csv',
source_objects='users/users0.json',
#schema_object='schemas/users.json',
source_format='NEWLINE_DELIMITED_JSON',
create_disposition='CREATE_IF_NEEDED',
destination_project_dataset_table='project.supply_chain.temporary_users',
write_disposition='WRITE_TRUNCATE',
dag=dag,
)
dump_csv_to_temp_table
Please assist me to solve this issue.
Could you please share more details. Are you using Cloud Composer to perform this task?
If the environment is running there should be a default google connection that you can use. Go to Airflow UI >> Admin >> Connections and you should see there google_cloud_default.
In composer if you don't specify the connection it will use the default one for interacting with google cloud ressources.
I'm trying to run a container with Cloud Run as a task of an Airflow's DAG.
Seems that there are no things like a CloudRunOperator or similar and I can't find anything on documentations (either Cloud Run and Airflow one).
Have someone ever dealt with this problem?
If yes, how can I run a container with Cloud Run and handle xcom?
Thanks in advance!!
AFAIK when a container is deployed to Cloud Run it automatically listens possible requests to be sent. See document for reference.
Instead you can send a request to access the deployed container. You can do this by using the code below.
This DAG has three tasks print_token, task_get_op and process_data.
print_token prints the identity token needed to authenticate the requests to your deployed Cloud Run container. I used "xcom_pull" get the output of "BashOperator" and assign the authentication token to token so this could be used to authenticate to HTTP request that you will perform.
task_get_op performs a GET on the connection cloud_run (this just contains the Cloud Run endpoint) and defined headers 'Authorization': 'Bearer ' + token for the authentication.
process_data performs "xcom_pull" on "task_get_op" to get the output and print it using PythonOperator.
import datetime
import airflow
from airflow.operators import bash
from airflow.operators import python
from airflow.providers.http.operators.http import SimpleHttpOperator
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with airflow.DAG(
'composer_http_request',
'catchup=False',
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
print_token = bash.BashOperator(
task_id='print_token',
bash_command='gcloud auth print-identity-token "--audiences=https://hello-world-fri824-ab.c.run.app"' # The end point of the deployed Cloud Run container
)
token = "{{ task_instance.xcom_pull(task_ids='print_token') }}" # gets output from 'print_token' task
task_get_op = SimpleHttpOperator(
task_id='get_op',
method='GET',
http_conn_id='cloud_run',
headers={'Authorization': 'Bearer ' + token },
)
def process_data_from_http(**kwargs):
ti = kwargs['ti']
http_data = ti.xcom_pull(task_ids='get_op')
print(http_data)
process_data = python.PythonOperator(
task_id='process_data_from_http',
python_callable=process_data_from_http,
provide_context=True
)
print_token >> task_get_op >> process_data
cloud_run connection:
Output (graph):
print_token logs:
task_get_op logs:
process_data logs (output from GET):
NOTE: I'm using Cloud Composer 1.17.7 and Airflow 2.0.2 and installed apache-airflow-providers-http to be able to use the SimpleHttpOperator.
We are migrating to Apache Airflow using ECS Fargate.
The problem we are facing, it's simple. We have a simple DAG that one of its tasks is to communicate with some external service in AWS (let's say, download a file from S3). This is the script of the DAG:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
# default arguments for each task
default_args = {
'owner': 'thomas',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('test_s3_download',
default_args=default_args,
schedule_interval=None)
TEST_BUCKET = 'bucket-dev'
TEST_KEY = 'BlueMetric/dms.json'
# simple download task
def download_file(bucket, key):
import boto3
s3 = boto3.resource('s3')
print(s3.Object(bucket, key).get()['Body'].read())
download_from_s3 = PythonOperator(
task_id='download_from_s3',
python_callable=download_file,
op_kwargs={'bucket': TEST_BUCKET, 'key': TEST_KEY},
dag=dag)
sleep_task = BashOperator(
task_id='sleep_for_1',
bash_command='sleep 1',
dag=dag)
download_from_s3.set_downstream(sleep_task)
As we have done other times when using docker, we create within the docker container, in ~/.aws the config file that reads:
[default]
region = eu-west-1
and as long as the container is within the AWS boundaries, it'll resolve every request without any need to specify credentials.
This is the Dockerfile we are using:
FROM puckel/docker-airflow:1.10.7
USER root
COPY entrypoint.sh /entrypoint.sh
COPY requirements.txt /requirements.txt
RUN apt-get update
RUN ["chmod", "+x", "/entrypoint.sh"]
RUN mkdir -p /home/airflow/.aws \
&& touch /home/airflow/.aws/config \
&& echo '[default]' > /home/airflow/.aws/config \
&& echo 'region = eu-west-1' >> /home/airflow/.aws/config
RUN ["chown", "-R", "airflow", "/home/airflow"]
USER airflow
ENTRYPOINT ["/entrypoint.sh"]
# # Expose webUI and flower respectively
EXPOSE 8080
EXPOSE 5555
and everything works like a charm. Directory and change of owner are done successfully but when running the DAG, it fails saying:
...
...
File "/usr/local/airflow/.local/lib/python3.7/site-packages/botocore/signers.py", line 160, in sign
auth.add_auth(request)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/botocore/auth.py", line 357, in add_auth
raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials
[2020-08-24 11:15:02,125] {{taskinstance.py:1117}} INFO - All retries failed; marking task as FAILED
So we are thinking that the worker node of Airflow does use another user.
Does any of you know what's going on? Thank you for any advice/light you can provide.
Create a proper task_role_arn for the task definition. This role is the one assumed by the processes triggered inside the container. Another annotation is that the error should not read:
Unable to locate credentials
that misleads, but
Access Denied: you don't have permission to s3:GetObject.
As I am working with two clouds, My task is to rsync files coming into s3 bucket to gcs bucket. To achieve this I am using GCP composer (Airflow) service where I am scheduling this rsync operation to sync files. I am using Airflow connection (aws_default) to store AWS access key and secret access key. Everything is working fine but thing is that I am able to see credentials in logs which is again exposing credentials and I don't want to display them even in logs. Please assist if there is any way so that credentials should not display in logs.
import airflow
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.hooks.base_hook import BaseHook
from datetime import timedelta, datetime
START_TIME = datetime.utcnow() - timedelta(hours=1)
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'wait_for_downstream': True,
'start_date': START_TIME,
'email_on_failure': False,
'email_on_retry': False,
'retries': 3,
'retry_delay': timedelta(minutes=3)
}
aws_connection = BaseHook.get_connection('aws_default')
bash_env = {
"AWS_ACCESS_KEY_ID": aws_connection.login,
"AWS_SECRET_ACCESS_KEY": aws_connection.password
}
rsync_command = '''
set -e;
export AWS_ACCESS_KEY_ID="%s";
export AWS_SECRET_ACCESS_KEY="%s";
''' %(bash_env.get('AWS_ACCESS_KEY_ID'), bash_env.get('AWS_SECRET_ACCESS_KEY')) \
+ '''
gsutil -m rsync -r -n s3://aws_bucket/{{ execution_date.strftime('%Y/%m/%d/%H') }}/ gs://gcp_bucket/good/test/
'''
dag = DAG(
'rsync',
default_args=default_args,
description='This dag is for gsutil rsync from s3 buket to gcs storage',
schedule_interval=timedelta(minutes=20),
dagrun_timeout=timedelta(minutes=15)
)
s3_sync = BashOperator(
task_id='gsutil_s3_gcp_sync',
bash_command=rsync_command,
dag=dag,
depends_on_past=False,
execution_timeout=timedelta(hours=1),
)
I would suggest putting the credentials in a boto config file separate from Airflow. More info on config file here
It has a credential section:
[Credentials]
aws_access_key_id
aws_secret_access_key
gs_access_key_id
gs_host
gs_host_header
gs_json_host
gs_json_host_header
gs_json_port
gs_oauth2_refresh_token
gs_port
gs_secret_access_key
gs_service_client_id
gs_service_key_file
gs_service_key_file_password
s3_host
s3_host_header
s3_port
I'm using the GCP Composer API (Airflow) and my DAG to scale up the number of workers keep returning me the error below:
Broken DAG: [/home/airflow/gcs/dags/cluster_scale_workers.py] 'module' object has no attribute 'DataProcClusterScaleOperator'
Seems to be something related to the ScaleOperator, however when I look at the Airflow Read the Docs and cross check with my code, seems that nothing is wrong. What am I missing?
Is it related to GCP Airflow version?
Code:
import datetime
import os
from airflow import models
from airflow.contrib.operators import dataproc_operator
from airflow.utils import trigger_rule
yesterday = datetime.datetime.combine(
datetime.datetime.today() - datetime.timedelta(1),
datetime.datetime.min.time())
default_dag_args = {
'start_date': yesterday,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'project_id': models.Variable.get('gcp_project'),
'cluster_name': 'hive-cluster'
}
with models.DAG(
'scale_workers',
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
scale_to_6_workers = dataproc_operator.DataprocClusterScaleOperator(
task_id='scale_dataproc_cluster_6',
cluster_name='hive-cluster',
num_workers=6,
num_preemptible_workers=3,
dag=dag
)
I managed to find the issue and sort it out. The comment provided by Ashish Kumar above is correct.
The problem was that the Airflow version I was using (1.9.0) did not support the DataProcClusterScaleOperator. I created another instance by activating BETA and choosing the version 1.10.0.
Which fixed my issue.