Triggering DAG in a loop - airflow-scheduler

I am trying to trigger DAG task for 3 times , how can this be done using python script.
Currently my flow job is
dag = DAG('dag1', default_args=default_args, concurrency=1,
max_active_runs=1,
schedule_interval=None, catchup=False)
task = BashOperator(
task_id='task1',
bash_command='ssh "runJob.sh"',
dag=dag)
Is there any better way to trigger job for a given specified number of times ?

I assume you use Airflow 2, you can use the following command to trigger a DAG run.
airflow dags trigger [-h] [-c CONF] [-e EXEC_DATE] [-r RUN_ID] [-S SUBDIR] dag_id
You can read more on this command here.
Based on that command, we can use use subprocess to run it.
import subprocess
command = [
"airflow",
"dags",
"trigger",
"-e",
"2021-02-05 00:00:00",
"-r",
"manual__2021-02-05T00:00:00+00:00",
"my_dag",
]
for i in range(3):
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()

In airflow Airflow 2.0 you can use the TriggerDagRunOperator.
You should be able to use that in a loop.
https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/trigger_dagrun/index.html
trigger = TriggerDagRunOperator(
task_id="some-id",
trigger_dag_id="dag1",
conf={"some_param": 'some_value'},
reset_dag_run = True
)
Alternatively you can also put in the a task and call the execute method,
like TriggerDagRunOperator(...).execute() and pass in the current context to the execute method
which you can find using the get_current_context function from airflow.operators.python.

Related

GCP ComposerV2 missing log files

We deployed GCP ComposerV2 with the most recent airflow version. It works perfectly. But from time to time "airflow_monitoring" predefined DAG crashes.
Here are the logs of the issue:
*** Log file is not found: gs://********/logs/airflow_monitoring/echo/2021-12-14T12:36:55+00:00/1.log. The task might not have been executed or worker executing it might have finished abnormally (e.g. was evicted)
*** 404 GET https://storage.googleapis.com/download/storage/v1/b/********/o/logs%2Fairflow_monitoring%2Fecho%2F2021-12-14T12%3A36%3A55%2B00%3A00%2F1.log?alt=media: No such object: ********/logs/airflow_monitoring/echo/2021-12-14T12:36:55+00:00/1.log: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)
We don't change anything, this issue has happened randomly.
Here is the code of "airflow_monitoring" predefined DAG:
"""A liveness prober dag for monitoring composer.googleapis.com/environment/healthy."""
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'airflow_monitoring',
default_args=default_args,
description='liveness monitoring dag',
schedule_interval=None,
dagrun_timeout=timedelta(minutes=20))
# priority_weight has type int in Airflow DB, uses the maximum.
t1 = BashOperator(
task_id='echo',
bash_command='echo test',
dag=dag,
depends_on_past=False,
priority_weight=2**31-1)
I think the log says everything:
*** Log file is not found: gs://********/logs/airflow_monitoring/echo/2021-12-14T12:36:55+00:00/1.log. The task might not have been executed or worker executing it might have finished abnormally (e.g. was evicted)
Kubernetes environment might from time to time evict a running task (for example when it fails over to another node because disk crashed or because the machines need to be restarted.
I think you should set retry to 2 and it should automatically retry in such case.

Execute a Bash or Python script on DataProc Cluster using DAG

I have created a DAG on the Composer(Airflow), that successfully creates a cluster:
with airflow.DAG(dag_id='......',
default_args=default_args,
catchup=False,
description='.......',
schedule_interval=None) as dag:
start_of_the_dag = DummyOperator(task_id='start_of_the_dag')
create_dataproc_cluster = DataprocClusterCreateOperator(
task_id='............',
cluster_name='..........',
storage_bucket=os.environ['gcs_bucket'],
project_id=os.environ['gcp_project'],
service_account=os.environ['service_account'],
master_machine_type='n1-standard-32',
worker_machine_type='n1-standard-32',
num_workers=4,
num_preemptible_workers=0,
image_version='1.3-debian10',
metadata={"enable-oslogin":"true"},
# custom_image=job['cluster_config']['image'],
internal_ip_only=True,
region=os.environ['gcp_region'],
zone=os.environ['gce_zone'],
subnetwork_uri=os.environ['subnetwork_uri'],
tags=os.environ['firewall_rules_tags'].split(','),
autoscaling_policy=None,
enable_http_port_access=True,
enable_optional_components=True,
init_actions_uris=None,
auto_delete_ttl=3600,
dag=dag
)
start_of_the_dag>>create_dataproc_cluster
What I want to do is create a python or shell script that will be executed from the master node of the cluster that I just created. Using BashOperator of PythonOperator the execution is made on the composer cluster. (Is that correct?)
To sum up, I want to use an operator after create_dataproc_cluster that will execute some commands from the master node of the dataproc cluster.

Apache airflow cannot locate AWS credentials when using boto3 inside a DAG

We are migrating to Apache Airflow using ECS Fargate.
The problem we are facing, it's simple. We have a simple DAG that one of its tasks is to communicate with some external service in AWS (let's say, download a file from S3). This is the script of the DAG:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
# default arguments for each task
default_args = {
'owner': 'thomas',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('test_s3_download',
default_args=default_args,
schedule_interval=None)
TEST_BUCKET = 'bucket-dev'
TEST_KEY = 'BlueMetric/dms.json'
# simple download task
def download_file(bucket, key):
import boto3
s3 = boto3.resource('s3')
print(s3.Object(bucket, key).get()['Body'].read())
download_from_s3 = PythonOperator(
task_id='download_from_s3',
python_callable=download_file,
op_kwargs={'bucket': TEST_BUCKET, 'key': TEST_KEY},
dag=dag)
sleep_task = BashOperator(
task_id='sleep_for_1',
bash_command='sleep 1',
dag=dag)
download_from_s3.set_downstream(sleep_task)
As we have done other times when using docker, we create within the docker container, in ~/.aws the config file that reads:
[default]
region = eu-west-1
and as long as the container is within the AWS boundaries, it'll resolve every request without any need to specify credentials.
This is the Dockerfile we are using:
FROM puckel/docker-airflow:1.10.7
USER root
COPY entrypoint.sh /entrypoint.sh
COPY requirements.txt /requirements.txt
RUN apt-get update
RUN ["chmod", "+x", "/entrypoint.sh"]
RUN mkdir -p /home/airflow/.aws \
&& touch /home/airflow/.aws/config \
&& echo '[default]' > /home/airflow/.aws/config \
&& echo 'region = eu-west-1' >> /home/airflow/.aws/config
RUN ["chown", "-R", "airflow", "/home/airflow"]
USER airflow
ENTRYPOINT ["/entrypoint.sh"]
# # Expose webUI and flower respectively
EXPOSE 8080
EXPOSE 5555
and everything works like a charm. Directory and change of owner are done successfully but when running the DAG, it fails saying:
...
...
File "/usr/local/airflow/.local/lib/python3.7/site-packages/botocore/signers.py", line 160, in sign
auth.add_auth(request)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/botocore/auth.py", line 357, in add_auth
raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials
[2020-08-24 11:15:02,125] {{taskinstance.py:1117}} INFO - All retries failed; marking task as FAILED
So we are thinking that the worker node of Airflow does use another user.
Does any of you know what's going on? Thank you for any advice/light you can provide.
Create a proper task_role_arn for the task definition. This role is the one assumed by the processes triggered inside the container. Another annotation is that the error should not read:
Unable to locate credentials
that misleads, but
Access Denied: you don't have permission to s3:GetObject.

arguments error while calling an AWS Glue Pythonshell job from boto3

Based on the previous post, I have an AWS Glue Pythonshell job that needs to retrieve some information from the arguments that are passed to it through a boto3 call.
My Glue job name is test_metrics
The Glue pythonshell code looks like below
import sys
from awsglue.utils import getResolvedOptions
args = getResolvedOptions(sys.argv,
['test_metrics',
's3_target_path_key',
's3_target_path_value'])
print ("Target path key is: ", args['s3_target_path_key'])
print ("Target Path value is: ", args['s3_target_path_value'])
The boto3 code that calls this job is below:
glue = boto3.client('glue')
response = glue.start_job_run(
JobName = 'test_metrics',
Arguments = {
'--s3_target_path_key': 's3://my_target',
'--s3_target_path_value': 's3://my_target_value'
}
)
print(response)
I see a 200 response after I run the boto3 code in my local machine, but Glue error log tells me:
test_metrics.py: error: the following arguments are required: --test_metrics
What am I missing?
Which job you are trying to launch? Spark Job or Python shell job?
If spark job, JOB_NAME is mandatory parameter. In Python shell job, it is not needed at all.
So in your python shell job, replace
args = getResolvedOptions(sys.argv,
['test_metrics',
's3_target_path_key',
's3_target_path_value'])
with
args = getResolvedOptions(sys.argv,
['s3_target_path_key',
's3_target_path_value'])
Seems like the documentation is kinda broken.
I had to update the boto3 code like below to make it work
glue = boto3.client('glue')
response = glue.start_job_run(
JobName = 'test_metrics',
Arguments = {
'--test_metrics': 'test_metrics',
'--s3_target_path_key': 's3://my_target',
'--s3_target_path_value': 's3://my_target_value'} )
We can get glue job name in python shell from sys.argv

Looking for a boto3 python example of injecting a aws pig step into an already running emr?

I'm looking for a good BOTO3 example of an AWS EMR already running and I wish to inject a Pig Step into that EMR. Previously, I used the boto2.42 version of:
from boto.emr.connection import EmrConnection
from boto.emr.step import InstallPigStep, PigStep
# AWS_ACCESS_KEY = '' # REQUIRED
# AWS_SECRET_KEY = '' # REQUIRED
# conn = EmrConnection(AWS_ACCESS_KEY, AWS_SECRET_KEY)
# loop next element on bucket_compare list
pig_file = 's3://elasticmapreduce/samples/pig-apache/do-reports2.pig'
INPUT = 's3://elasticmapreduce/samples/pig-apache/input/access_log_1'
OUTPUT = '' # REQUIRED, S3 bucket for job output
pig_args = ['-p', 'INPUT=%s' % INPUT,
'-p', 'OUTPUT=%s' % OUTPUT]
pig_step = PigStep('Process Reports', pig_file, pig_args=pig_args)
steps = [InstallPigStep(), pig_step]
conn.run_jobflow(name='prs-dev-test', steps=steps,
hadoop_version='2.7.2-amzn-2', ami_version='latest',
num_instances=2, keep_alive=False)
The main problem now is that, BOTO3 doesn't use: from boto.emr.connection import EmrConnection, nor from boto.emr.step import InstallPigStep, PigStep and I can't find an equivalent set of modules?
After a bit of checking, I've found a very simple way to inject Pig Script commands from within Python using the awscli and subprocess modules. One can import awscli & subprocess, and then encapsulate and inject the desired PIG steps to an already running EMR with:
import awscli
import subprocess
cmd='aws emr add-steps --cluster-id j-GU07FE0VTHNG --steps Type=PIG,Name="AggPigProgram",ActionOnFailure=CONTINUE,Args=[-f,s3://dev-end2end-test/pig_scripts/AggRuleBag.pig,-p,INPUT=s3://dev-end2end-test/input_location,-p,OUTPUT=s3://end2end-test/output_location]'
push=subprocess.Popen(cmd, shell=True, stdout = subprocess.PIPE)
print(push.returncode)
Of course, you'll have to find your JobFlowID using something like:
aws emr list-clusters --active
Using the same subprocess and push command above. Of course you can add monitoring to your hearts delight instead of just a print statement.
Here is how to add a new step to existing emr cluster job flow for a pig job sing boto3
Note: your script log file, input and output directories should have
the complete path in the format
's3://<bucket>/<directory>/<file_or_key>'
emrcon = boto3.client("emr")
cluster_id1 = cluster_status_file_content #Retrieved from S3, where it was recorded on creation
step_id = emrcon.add_job_flow_steps(JobFlowId=str(cluster_id1),
Steps=[{
'Name': str(pig_job_name),
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['pig', "-l", str(pig_log_file_full_path), "-f", str(pig_job_run_script_full_path), "-p", "INPUT=" + str(pig_input_dir_full_path),
"-p", "OUTPUT=" + str(pig_output_dir_full_path) ]
}
}]
)
Please see screenshot to monitor-