Apache airflow cannot locate AWS credentials when using boto3 inside a DAG - amazon-web-services

We are migrating to Apache Airflow using ECS Fargate.
The problem we are facing, it's simple. We have a simple DAG that one of its tasks is to communicate with some external service in AWS (let's say, download a file from S3). This is the script of the DAG:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
# default arguments for each task
default_args = {
'owner': 'thomas',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('test_s3_download',
default_args=default_args,
schedule_interval=None)
TEST_BUCKET = 'bucket-dev'
TEST_KEY = 'BlueMetric/dms.json'
# simple download task
def download_file(bucket, key):
import boto3
s3 = boto3.resource('s3')
print(s3.Object(bucket, key).get()['Body'].read())
download_from_s3 = PythonOperator(
task_id='download_from_s3',
python_callable=download_file,
op_kwargs={'bucket': TEST_BUCKET, 'key': TEST_KEY},
dag=dag)
sleep_task = BashOperator(
task_id='sleep_for_1',
bash_command='sleep 1',
dag=dag)
download_from_s3.set_downstream(sleep_task)
As we have done other times when using docker, we create within the docker container, in ~/.aws the config file that reads:
[default]
region = eu-west-1
and as long as the container is within the AWS boundaries, it'll resolve every request without any need to specify credentials.
This is the Dockerfile we are using:
FROM puckel/docker-airflow:1.10.7
USER root
COPY entrypoint.sh /entrypoint.sh
COPY requirements.txt /requirements.txt
RUN apt-get update
RUN ["chmod", "+x", "/entrypoint.sh"]
RUN mkdir -p /home/airflow/.aws \
&& touch /home/airflow/.aws/config \
&& echo '[default]' > /home/airflow/.aws/config \
&& echo 'region = eu-west-1' >> /home/airflow/.aws/config
RUN ["chown", "-R", "airflow", "/home/airflow"]
USER airflow
ENTRYPOINT ["/entrypoint.sh"]
# # Expose webUI and flower respectively
EXPOSE 8080
EXPOSE 5555
and everything works like a charm. Directory and change of owner are done successfully but when running the DAG, it fails saying:
...
...
File "/usr/local/airflow/.local/lib/python3.7/site-packages/botocore/signers.py", line 160, in sign
auth.add_auth(request)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/botocore/auth.py", line 357, in add_auth
raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials
[2020-08-24 11:15:02,125] {{taskinstance.py:1117}} INFO - All retries failed; marking task as FAILED
So we are thinking that the worker node of Airflow does use another user.
Does any of you know what's going on? Thank you for any advice/light you can provide.

Create a proper task_role_arn for the task definition. This role is the one assumed by the processes triggered inside the container. Another annotation is that the error should not read:
Unable to locate credentials
that misleads, but
Access Denied: you don't have permission to s3:GetObject.

Related

How to execute Cloud Run containers into an Airflow DAG?

I'm trying to run a container with Cloud Run as a task of an Airflow's DAG.
Seems that there are no things like a CloudRunOperator or similar and I can't find anything on documentations (either Cloud Run and Airflow one).
Have someone ever dealt with this problem?
If yes, how can I run a container with Cloud Run and handle xcom?
Thanks in advance!!
AFAIK when a container is deployed to Cloud Run it automatically listens possible requests to be sent. See document for reference.
Instead you can send a request to access the deployed container. You can do this by using the code below.
This DAG has three tasks print_token, task_get_op and process_data.
print_token prints the identity token needed to authenticate the requests to your deployed Cloud Run container. I used "xcom_pull" get the output of "BashOperator" and assign the authentication token to token so this could be used to authenticate to HTTP request that you will perform.
task_get_op performs a GET on the connection cloud_run (this just contains the Cloud Run endpoint) and defined headers 'Authorization': 'Bearer ' + token for the authentication.
process_data performs "xcom_pull" on "task_get_op" to get the output and print it using PythonOperator.
import datetime
import airflow
from airflow.operators import bash
from airflow.operators import python
from airflow.providers.http.operators.http import SimpleHttpOperator
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with airflow.DAG(
'composer_http_request',
'catchup=False',
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
print_token = bash.BashOperator(
task_id='print_token',
bash_command='gcloud auth print-identity-token "--audiences=https://hello-world-fri824-ab.c.run.app"' # The end point of the deployed Cloud Run container
)
token = "{{ task_instance.xcom_pull(task_ids='print_token') }}" # gets output from 'print_token' task
task_get_op = SimpleHttpOperator(
task_id='get_op',
method='GET',
http_conn_id='cloud_run',
headers={'Authorization': 'Bearer ' + token },
)
def process_data_from_http(**kwargs):
ti = kwargs['ti']
http_data = ti.xcom_pull(task_ids='get_op')
print(http_data)
process_data = python.PythonOperator(
task_id='process_data_from_http',
python_callable=process_data_from_http,
provide_context=True
)
print_token >> task_get_op >> process_data
cloud_run connection:
Output (graph):
print_token logs:
task_get_op logs:
process_data logs (output from GET):
NOTE: I'm using Cloud Composer 1.17.7 and Airflow 2.0.2 and installed apache-airflow-providers-http to be able to use the SimpleHttpOperator.

Unable to download files from an Amazon S3 bucket to newly created EC2 instance using Boto3 ( passing UserData script as parameter)

I have written a python program using boto3 to launch a new instance and supplied startup script using UserData parameter as shown below ( '--' some id)
launchec2.py :
import boto3
ec2 = boto3.resource('ec2')
instances = ec2.create_instances(ImageId='--', MinCount=1, MaxCount=1, InstanceType = 't2.micro',KeyName='--',
SecurityGroupIds=['--'], UserData = open('startup_script.sh').read())
print(instances)
instance = instances[0]
instance.wait_until_running()
instance.load()
# printing dns name
print(instance.public_dns_name)
startup_script.sh :
#!/usr/bin/python
import os
os.system('sudo yum install -y python-pip')
os.system('sudo pip install boto3')
os.system('sudo yum -y update')
os.system('sudo yum install -y httpd')
os.system('sudo service httpd start')
# os.system('cd /var/www/html')
import boto3
access_id_key = ''
secret_access_key = ''
session_token_key = ''
s3 = boto3.resource('s3',aws_access_key_id = access_id_key,aws_secret_access_key = secret_acces_key,aws_session_token = session_token_key)
my_bucket = s3.Bucket('cs351-lab2')
s3client = boto3.client('s3',aws_access_key_id = access_id_key,aws_secret_access_key = secret_access_key,aws_session_token = session_token_key)
for file_object in my_bucket.objects.all():
print(file_object.key,type(file_object.key))
s3client.download_file('cs351-lab2',file_object.key,file_object.key)
I have supplied every value like access id, secret key, and session token correctly. Now my problem is that the script is working perfectly fine up to os.system('sudo service httpd start')( when it is passed as UserData) and then it is not able to download files from
s3 bucket.
But if I run the script manually in that instance by command "./startup_script.sh" after creating startup_script.sh and enabling permissions to execute it, it is perfectly working fine and able to install all the files from the s3 bucket, but I am not sure why it is unable to download files when passed as UserData in launchec2.py.
I am using putty.
Can someone please let me know the solution? it would be of great help.
User Data scripts execute as the root user.
Therefore, they should not use the sudo command.
When you run it manually, you are running it as the ec2-user, which is a different environment.

Airflow - GCP - files from DAG folder are not showing up

I'm new to GCP . I have a sample python script created in a GCP environment which is running fine. I want to schedule this in Airflow. I copied the file in DAG folder in the environment (gs://us-west2-*******-6f9ce4ef-bucket/dags), but it's showing up in the airflow DAG ..
This is the location in airflow config.
dags_folder = /home/airflow/gcs/dags
Pls do let me know how to get my python code to show up in airflow..do i have to setup any other things. I kept all default.
Thanks in advance.
What you did is already correct, wherein you placed your python script in your gs://auto-generated-bucket/dags/. I'm not sure if you were able to use the airflow library in your script, but this library will let you configure the behavior of your DAG in airflow. You can see an example in the Cloud Composer quickstart.
You can check an in-depth tutorial of DAGs here.
Sample DAG (test_dag.py) that prints the dag_run.id:
# test_dag.py #
import datetime
import airflow
from airflow.operators import bash_operator
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with airflow.DAG(
'this_is_the_test_dag', ## <-- This string will be displayed in the AIRFLOW web interface as the DAG name ##
'catchup=False',
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
# Print the dag_run id from the Airflow logs
print_dag_run_conf = bash_operator.BashOperator(
task_id='print_dag_run_conf', bash_command='echo {{ dag_run.id }}')
gs://auto-generated-bucket/dags/ gcs location:
Airflow Web server:

How to use Airflow AWS connection credentials in Airflow using BashOprator to transfer files from AWS s3 bucket to GCS

As I am working with two clouds, My task is to rsync files coming into s3 bucket to gcs bucket. To achieve this I am using GCP composer (Airflow) service where I am scheduling this rsync operation to sync files. I am using Airflow connection (aws_default) to store AWS access key and secret access key. Everything is working fine but thing is that I am able to see credentials in logs which is again exposing credentials and I don't want to display them even in logs. Please assist if there is any way so that credentials should not display in logs.
import airflow
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.hooks.base_hook import BaseHook
from datetime import timedelta, datetime
START_TIME = datetime.utcnow() - timedelta(hours=1)
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'wait_for_downstream': True,
'start_date': START_TIME,
'email_on_failure': False,
'email_on_retry': False,
'retries': 3,
'retry_delay': timedelta(minutes=3)
}
aws_connection = BaseHook.get_connection('aws_default')
bash_env = {
"AWS_ACCESS_KEY_ID": aws_connection.login,
"AWS_SECRET_ACCESS_KEY": aws_connection.password
}
rsync_command = '''
set -e;
export AWS_ACCESS_KEY_ID="%s";
export AWS_SECRET_ACCESS_KEY="%s";
''' %(bash_env.get('AWS_ACCESS_KEY_ID'), bash_env.get('AWS_SECRET_ACCESS_KEY')) \
+ '''
gsutil -m rsync -r -n s3://aws_bucket/{{ execution_date.strftime('%Y/%m/%d/%H') }}/ gs://gcp_bucket/good/test/
'''
dag = DAG(
'rsync',
default_args=default_args,
description='This dag is for gsutil rsync from s3 buket to gcs storage',
schedule_interval=timedelta(minutes=20),
dagrun_timeout=timedelta(minutes=15)
)
s3_sync = BashOperator(
task_id='gsutil_s3_gcp_sync',
bash_command=rsync_command,
dag=dag,
depends_on_past=False,
execution_timeout=timedelta(hours=1),
)
I would suggest putting the credentials in a boto config file separate from Airflow. More info on config file here
It has a credential section:
[Credentials]
aws_access_key_id
aws_secret_access_key
gs_access_key_id
gs_host
gs_host_header
gs_json_host
gs_json_host_header
gs_json_port
gs_oauth2_refresh_token
gs_port
gs_secret_access_key
gs_service_client_id
gs_service_key_file
gs_service_key_file_password
s3_host
s3_host_header
s3_port

Looking for a boto3 python example of injecting a aws pig step into an already running emr?

I'm looking for a good BOTO3 example of an AWS EMR already running and I wish to inject a Pig Step into that EMR. Previously, I used the boto2.42 version of:
from boto.emr.connection import EmrConnection
from boto.emr.step import InstallPigStep, PigStep
# AWS_ACCESS_KEY = '' # REQUIRED
# AWS_SECRET_KEY = '' # REQUIRED
# conn = EmrConnection(AWS_ACCESS_KEY, AWS_SECRET_KEY)
# loop next element on bucket_compare list
pig_file = 's3://elasticmapreduce/samples/pig-apache/do-reports2.pig'
INPUT = 's3://elasticmapreduce/samples/pig-apache/input/access_log_1'
OUTPUT = '' # REQUIRED, S3 bucket for job output
pig_args = ['-p', 'INPUT=%s' % INPUT,
'-p', 'OUTPUT=%s' % OUTPUT]
pig_step = PigStep('Process Reports', pig_file, pig_args=pig_args)
steps = [InstallPigStep(), pig_step]
conn.run_jobflow(name='prs-dev-test', steps=steps,
hadoop_version='2.7.2-amzn-2', ami_version='latest',
num_instances=2, keep_alive=False)
The main problem now is that, BOTO3 doesn't use: from boto.emr.connection import EmrConnection, nor from boto.emr.step import InstallPigStep, PigStep and I can't find an equivalent set of modules?
After a bit of checking, I've found a very simple way to inject Pig Script commands from within Python using the awscli and subprocess modules. One can import awscli & subprocess, and then encapsulate and inject the desired PIG steps to an already running EMR with:
import awscli
import subprocess
cmd='aws emr add-steps --cluster-id j-GU07FE0VTHNG --steps Type=PIG,Name="AggPigProgram",ActionOnFailure=CONTINUE,Args=[-f,s3://dev-end2end-test/pig_scripts/AggRuleBag.pig,-p,INPUT=s3://dev-end2end-test/input_location,-p,OUTPUT=s3://end2end-test/output_location]'
push=subprocess.Popen(cmd, shell=True, stdout = subprocess.PIPE)
print(push.returncode)
Of course, you'll have to find your JobFlowID using something like:
aws emr list-clusters --active
Using the same subprocess and push command above. Of course you can add monitoring to your hearts delight instead of just a print statement.
Here is how to add a new step to existing emr cluster job flow for a pig job sing boto3
Note: your script log file, input and output directories should have
the complete path in the format
's3://<bucket>/<directory>/<file_or_key>'
emrcon = boto3.client("emr")
cluster_id1 = cluster_status_file_content #Retrieved from S3, where it was recorded on creation
step_id = emrcon.add_job_flow_steps(JobFlowId=str(cluster_id1),
Steps=[{
'Name': str(pig_job_name),
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['pig', "-l", str(pig_log_file_full_path), "-f", str(pig_job_run_script_full_path), "-p", "INPUT=" + str(pig_input_dir_full_path),
"-p", "OUTPUT=" + str(pig_output_dir_full_path) ]
}
}]
)
Please see screenshot to monitor-