Execute a Bash or Python script on DataProc Cluster using DAG - google-cloud-platform

I have created a DAG on the Composer(Airflow), that successfully creates a cluster:
with airflow.DAG(dag_id='......',
default_args=default_args,
catchup=False,
description='.......',
schedule_interval=None) as dag:
start_of_the_dag = DummyOperator(task_id='start_of_the_dag')
create_dataproc_cluster = DataprocClusterCreateOperator(
task_id='............',
cluster_name='..........',
storage_bucket=os.environ['gcs_bucket'],
project_id=os.environ['gcp_project'],
service_account=os.environ['service_account'],
master_machine_type='n1-standard-32',
worker_machine_type='n1-standard-32',
num_workers=4,
num_preemptible_workers=0,
image_version='1.3-debian10',
metadata={"enable-oslogin":"true"},
# custom_image=job['cluster_config']['image'],
internal_ip_only=True,
region=os.environ['gcp_region'],
zone=os.environ['gce_zone'],
subnetwork_uri=os.environ['subnetwork_uri'],
tags=os.environ['firewall_rules_tags'].split(','),
autoscaling_policy=None,
enable_http_port_access=True,
enable_optional_components=True,
init_actions_uris=None,
auto_delete_ttl=3600,
dag=dag
)
start_of_the_dag>>create_dataproc_cluster
What I want to do is create a python or shell script that will be executed from the master node of the cluster that I just created. Using BashOperator of PythonOperator the execution is made on the composer cluster. (Is that correct?)
To sum up, I want to use an operator after create_dataproc_cluster that will execute some commands from the master node of the dataproc cluster.

Related

GCP ComposerV2 missing log files

We deployed GCP ComposerV2 with the most recent airflow version. It works perfectly. But from time to time "airflow_monitoring" predefined DAG crashes.
Here are the logs of the issue:
*** Log file is not found: gs://********/logs/airflow_monitoring/echo/2021-12-14T12:36:55+00:00/1.log. The task might not have been executed or worker executing it might have finished abnormally (e.g. was evicted)
*** 404 GET https://storage.googleapis.com/download/storage/v1/b/********/o/logs%2Fairflow_monitoring%2Fecho%2F2021-12-14T12%3A36%3A55%2B00%3A00%2F1.log?alt=media: No such object: ********/logs/airflow_monitoring/echo/2021-12-14T12:36:55+00:00/1.log: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)
We don't change anything, this issue has happened randomly.
Here is the code of "airflow_monitoring" predefined DAG:
"""A liveness prober dag for monitoring composer.googleapis.com/environment/healthy."""
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'airflow_monitoring',
default_args=default_args,
description='liveness monitoring dag',
schedule_interval=None,
dagrun_timeout=timedelta(minutes=20))
# priority_weight has type int in Airflow DB, uses the maximum.
t1 = BashOperator(
task_id='echo',
bash_command='echo test',
dag=dag,
depends_on_past=False,
priority_weight=2**31-1)
I think the log says everything:
*** Log file is not found: gs://********/logs/airflow_monitoring/echo/2021-12-14T12:36:55+00:00/1.log. The task might not have been executed or worker executing it might have finished abnormally (e.g. was evicted)
Kubernetes environment might from time to time evict a running task (for example when it fails over to another node because disk crashed or because the machines need to be restarted.
I think you should set retry to 2 and it should automatically retry in such case.

Apache Airflow S3ListOperator not listing files

I am trying to use the airflow.providers.amazon.aws.operators.s3_list S3ListOperator to list files in an S3 bucket in my AWS account with the DAG operator below:
list_bucket = S3ListOperator(
task_id = 'list_files_in_bucket',
bucket = '<MY_BUCKET>',
aws_conn_id = 's3_default'
)
I have configured my Extra Connection details in the form of: {"aws_access_key_id": "<MY_ACCESS_KEY>", "aws_secret_access_key": "<MY_SECRET_KEY>"}
When I run my Airflow job, it appears it is executing fine & my task status is Success. Here is the Log output:
[2021-04-27 11:44:50,009] {base_aws.py:368} INFO - Airflow Connection: aws_conn_id=s3_default
[2021-04-27 11:44:50,013] {base_aws.py:170} INFO - Credentials retrieved from extra_config
[2021-04-27 11:44:50,013] {base_aws.py:84} INFO - Creating session with aws_access_key_id=<MY_ACCESS_KEY> region_name=None
[2021-04-27 11:44:50,027] {base_aws.py:157} INFO - role_arn is None
[2021-04-27 11:44:50,661] {taskinstance.py:1185} INFO - Marking task as SUCCESS. dag_id=two_step, task_id=list_files_in_bucket, execution_date=20210427T184422, start_date=20210427T184439, end_date=20210427T184450
[2021-04-27 11:44:50,676] {taskinstance.py:1246} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-04-27 11:44:50,700] {local_task_job.py:146} INFO - Task exited with return code 0
Is there anything I can do to print the files in my bucket to Logs?
TIA
This code is enough and you don't need to use print function. Just check the corresponding log, then go to xcom, and the return list is there.
list_bucket = S3ListOperator(
task_id='list_files_in_bucket',
bucket='ob-air-pre',
prefix='data/',
delimiter='/',
aws_conn_id='aws'
)
The result from executing S3ListOperator is an XCom object that is stored in the Airflow database after the task instance has completed.
You need to declare another operator to feed in the results from the S3ListOperator and print them out.
For example in Airflow 2.0.0 and up you can use TaskFlow:
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils import timezone
dag = DAG(
dag_id='my-workflow',
start_date=timezone.parse('2021-01-14 21:00')
)
#dag.task(task_id="print_objects")
def print_objects(objects):
print(objects)
list_bucket = S3ListOperator(
task_id='list_files_in_bucket',
bucket='<MY_BUCKET>',
aws_conn_id='s3_default',
dag=dag
)
print_objects(list_bucket.output)
In older versions,
from airflow.models import DAG
from airflow.operators.python import PythonOperator
from airflow.utils import timezone
dag = DAG(
dag_id='my-workflow',
start_date=timezone.parse('2021-01-14 21:00')
)
def print_objects(objects):
print(objects)
list_bucket = S3ListOperator(
dag=dag,
task_id='list_files_in_bucket',
bucket='<MY_BUCKET>',
aws_conn_id='s3_default',
)
print_objects_in_bucket = PythonOperator(
dag=dag,
task_id='print_objects_in_bucket',
python_callable=print_objects,
op_args=("{{ti.xcom_pull(task_ids='list_files_in_bucket')}}",)
)
list_bucket >> print_objects_in_bucket

Airflow - GCP - files from DAG folder are not showing up

I'm new to GCP . I have a sample python script created in a GCP environment which is running fine. I want to schedule this in Airflow. I copied the file in DAG folder in the environment (gs://us-west2-*******-6f9ce4ef-bucket/dags), but it's showing up in the airflow DAG ..
This is the location in airflow config.
dags_folder = /home/airflow/gcs/dags
Pls do let me know how to get my python code to show up in airflow..do i have to setup any other things. I kept all default.
Thanks in advance.
What you did is already correct, wherein you placed your python script in your gs://auto-generated-bucket/dags/. I'm not sure if you were able to use the airflow library in your script, but this library will let you configure the behavior of your DAG in airflow. You can see an example in the Cloud Composer quickstart.
You can check an in-depth tutorial of DAGs here.
Sample DAG (test_dag.py) that prints the dag_run.id:
# test_dag.py #
import datetime
import airflow
from airflow.operators import bash_operator
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with airflow.DAG(
'this_is_the_test_dag', ## <-- This string will be displayed in the AIRFLOW web interface as the DAG name ##
'catchup=False',
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
# Print the dag_run id from the Airflow logs
print_dag_run_conf = bash_operator.BashOperator(
task_id='print_dag_run_conf', bash_command='echo {{ dag_run.id }}')
gs://auto-generated-bucket/dags/ gcs location:
Airflow Web server:

Triggering DAG in a loop

I am trying to trigger DAG task for 3 times , how can this be done using python script.
Currently my flow job is
dag = DAG('dag1', default_args=default_args, concurrency=1,
max_active_runs=1,
schedule_interval=None, catchup=False)
task = BashOperator(
task_id='task1',
bash_command='ssh "runJob.sh"',
dag=dag)
Is there any better way to trigger job for a given specified number of times ?
I assume you use Airflow 2, you can use the following command to trigger a DAG run.
airflow dags trigger [-h] [-c CONF] [-e EXEC_DATE] [-r RUN_ID] [-S SUBDIR] dag_id
You can read more on this command here.
Based on that command, we can use use subprocess to run it.
import subprocess
command = [
"airflow",
"dags",
"trigger",
"-e",
"2021-02-05 00:00:00",
"-r",
"manual__2021-02-05T00:00:00+00:00",
"my_dag",
]
for i in range(3):
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()
In airflow Airflow 2.0 you can use the TriggerDagRunOperator.
You should be able to use that in a loop.
https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/trigger_dagrun/index.html
trigger = TriggerDagRunOperator(
task_id="some-id",
trigger_dag_id="dag1",
conf={"some_param": 'some_value'},
reset_dag_run = True
)
Alternatively you can also put in the a task and call the execute method,
like TriggerDagRunOperator(...).execute() and pass in the current context to the execute method
which you can find using the get_current_context function from airflow.operators.python.

Apache airflow cannot locate AWS credentials when using boto3 inside a DAG

We are migrating to Apache Airflow using ECS Fargate.
The problem we are facing, it's simple. We have a simple DAG that one of its tasks is to communicate with some external service in AWS (let's say, download a file from S3). This is the script of the DAG:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
# default arguments for each task
default_args = {
'owner': 'thomas',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('test_s3_download',
default_args=default_args,
schedule_interval=None)
TEST_BUCKET = 'bucket-dev'
TEST_KEY = 'BlueMetric/dms.json'
# simple download task
def download_file(bucket, key):
import boto3
s3 = boto3.resource('s3')
print(s3.Object(bucket, key).get()['Body'].read())
download_from_s3 = PythonOperator(
task_id='download_from_s3',
python_callable=download_file,
op_kwargs={'bucket': TEST_BUCKET, 'key': TEST_KEY},
dag=dag)
sleep_task = BashOperator(
task_id='sleep_for_1',
bash_command='sleep 1',
dag=dag)
download_from_s3.set_downstream(sleep_task)
As we have done other times when using docker, we create within the docker container, in ~/.aws the config file that reads:
[default]
region = eu-west-1
and as long as the container is within the AWS boundaries, it'll resolve every request without any need to specify credentials.
This is the Dockerfile we are using:
FROM puckel/docker-airflow:1.10.7
USER root
COPY entrypoint.sh /entrypoint.sh
COPY requirements.txt /requirements.txt
RUN apt-get update
RUN ["chmod", "+x", "/entrypoint.sh"]
RUN mkdir -p /home/airflow/.aws \
&& touch /home/airflow/.aws/config \
&& echo '[default]' > /home/airflow/.aws/config \
&& echo 'region = eu-west-1' >> /home/airflow/.aws/config
RUN ["chown", "-R", "airflow", "/home/airflow"]
USER airflow
ENTRYPOINT ["/entrypoint.sh"]
# # Expose webUI and flower respectively
EXPOSE 8080
EXPOSE 5555
and everything works like a charm. Directory and change of owner are done successfully but when running the DAG, it fails saying:
...
...
File "/usr/local/airflow/.local/lib/python3.7/site-packages/botocore/signers.py", line 160, in sign
auth.add_auth(request)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/botocore/auth.py", line 357, in add_auth
raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials
[2020-08-24 11:15:02,125] {{taskinstance.py:1117}} INFO - All retries failed; marking task as FAILED
So we are thinking that the worker node of Airflow does use another user.
Does any of you know what's going on? Thank you for any advice/light you can provide.
Create a proper task_role_arn for the task definition. This role is the one assumed by the processes triggered inside the container. Another annotation is that the error should not read:
Unable to locate credentials
that misleads, but
Access Denied: you don't have permission to s3:GetObject.