Apache Airflow S3ListOperator not listing files - amazon-web-services

I am trying to use the airflow.providers.amazon.aws.operators.s3_list S3ListOperator to list files in an S3 bucket in my AWS account with the DAG operator below:
list_bucket = S3ListOperator(
task_id = 'list_files_in_bucket',
bucket = '<MY_BUCKET>',
aws_conn_id = 's3_default'
)
I have configured my Extra Connection details in the form of: {"aws_access_key_id": "<MY_ACCESS_KEY>", "aws_secret_access_key": "<MY_SECRET_KEY>"}
When I run my Airflow job, it appears it is executing fine & my task status is Success. Here is the Log output:
[2021-04-27 11:44:50,009] {base_aws.py:368} INFO - Airflow Connection: aws_conn_id=s3_default
[2021-04-27 11:44:50,013] {base_aws.py:170} INFO - Credentials retrieved from extra_config
[2021-04-27 11:44:50,013] {base_aws.py:84} INFO - Creating session with aws_access_key_id=<MY_ACCESS_KEY> region_name=None
[2021-04-27 11:44:50,027] {base_aws.py:157} INFO - role_arn is None
[2021-04-27 11:44:50,661] {taskinstance.py:1185} INFO - Marking task as SUCCESS. dag_id=two_step, task_id=list_files_in_bucket, execution_date=20210427T184422, start_date=20210427T184439, end_date=20210427T184450
[2021-04-27 11:44:50,676] {taskinstance.py:1246} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-04-27 11:44:50,700] {local_task_job.py:146} INFO - Task exited with return code 0
Is there anything I can do to print the files in my bucket to Logs?
TIA

This code is enough and you don't need to use print function. Just check the corresponding log, then go to xcom, and the return list is there.
list_bucket = S3ListOperator(
task_id='list_files_in_bucket',
bucket='ob-air-pre',
prefix='data/',
delimiter='/',
aws_conn_id='aws'
)

The result from executing S3ListOperator is an XCom object that is stored in the Airflow database after the task instance has completed.
You need to declare another operator to feed in the results from the S3ListOperator and print them out.
For example in Airflow 2.0.0 and up you can use TaskFlow:
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils import timezone
dag = DAG(
dag_id='my-workflow',
start_date=timezone.parse('2021-01-14 21:00')
)
#dag.task(task_id="print_objects")
def print_objects(objects):
print(objects)
list_bucket = S3ListOperator(
task_id='list_files_in_bucket',
bucket='<MY_BUCKET>',
aws_conn_id='s3_default',
dag=dag
)
print_objects(list_bucket.output)
In older versions,
from airflow.models import DAG
from airflow.operators.python import PythonOperator
from airflow.utils import timezone
dag = DAG(
dag_id='my-workflow',
start_date=timezone.parse('2021-01-14 21:00')
)
def print_objects(objects):
print(objects)
list_bucket = S3ListOperator(
dag=dag,
task_id='list_files_in_bucket',
bucket='<MY_BUCKET>',
aws_conn_id='s3_default',
)
print_objects_in_bucket = PythonOperator(
dag=dag,
task_id='print_objects_in_bucket',
python_callable=print_objects,
op_args=("{{ti.xcom_pull(task_ids='list_files_in_bucket')}}",)
)
list_bucket >> print_objects_in_bucket

Related

MWAA to start and stop EC2 Instances

I'm new to using the AWS managed Airflow service. I want to use Airflow to start an EC2 instance, make sure it's running, and then do some further work in the instance.
So far I have this dag below which is basically a copy of this.
This, however, fails every time and I'm not adept enough to know why?
import os
from datetime import datetime
from airflow import DAG
from airflow.models.baseoperator import chain
from airflow.providers.amazon.aws.operators.ec2 import EC2StartInstanceOperator, EC2StopInstanceOperator
from airflow.providers.amazon.aws.sensors.ec2 import EC2InstanceStateSensor
INSTANCE_ID = os.getenv("INSTANCE_ID", "instance-id")
with DAG(
dag_id='example_ec2',
schedule_interval=None,
start_date=datetime(2021, 1, 1),
tags=['example'],
catchup=False,
) as dag:
# [START howto_operator_ec2_start_instance]
start_instance = EC2StartInstanceOperator(
task_id="ec2_start_instance",
instance_id=INSTANCE_ID,
)
# [END howto_operator_ec2_start_instance]
# [START howto_sensor_ec2_instance_state]
instance_state = EC2InstanceStateSensor(
task_id="ec2_instance_state",
instance_id=INSTANCE_ID,
target_state="running",
)
# [END howto_sensor_ec2_instance_state]
chain(start_instance, instance_state)
Just going to answer my own question in case anyone else finds this.
The line below doesnt work for whatever reason
INSTANCE_ID = os.getenv("INSTANCE_ID", "instance-id")
The alternative is to do the following
from airflow.models import Variable # import Variable
INSTANCE_ID = Variable.get("INSTANCE_ID") # get INSTANCE_ID
Make sure you have added INSTANCE_ID as a variable by going to Admin -> Variable.

Advice/Guidance - composer/beam/dataflow on gcp

I am trying to learn/try out cloud composer/beam/dataflow on gcp.
I have written functions to do some basic cleaning of data in python, and used a DAG in cloud composer to run this function to download a file from a bucket, process it, and upload it to a bucket at a set frequency.
It was all bespoke written functionality. I am now trying to figure out how I use beam pipeline and data flow instead and use cloud composer to kick off the dataflow job.
The cleaning I am trying to do, is take a csv input of col1,col2,col3,col4,col5 and combine the middle 3 columns to output a csv of col1,combinedcol234,col5.
Questions I have are...
How do I pull in my own functions within a beam pipeline to do this merge?
Should I be pulling in my own functions or do beam have built in ways of doing this?
How do I then trigger a pipeline from a dag?
Does anyone have any example code on git hub?
I have been googling and trying to research but can't seem to find anything that helps me get my head around it enough.
Any help would be appreciated. Thank you.
You can use the DataflowCreatePythonJobOperator to run a dataflow job in a python.
You have to instantiate your cloud composer environment;
Add the dataflow job file in a bucket;
Add the input file to a bucket;
Add the following dag in the DAGs directory of the composer environment:
composer_dataflow_dag.py:
import datetime
from airflow import models
from airflow.providers.google.cloud.operators.dataflow import DataflowCreatePythonJobOperator
from airflow.utils.dates import days_ago
bucket_path = "gs://<bucket name>"
project_id = "<project name>"
gce_zone = "us-central1-a"
import pytz
tz = pytz.timezone('US/Pacific')
tstmp = datetime.datetime.now(tz).strftime('%Y%m%d%H%M%S')
default_args = {
# Tell airflow to start one day ago, so that it runs as soon as you upload it
"start_date": days_ago(1),
"dataflow_default_options": {
"project": project_id,
# Set to your zone
"zone": gce_zone,
# This is a subfolder for storing temporary files, like the staged pipeline job.
"tempLocation": bucket_path + "/tmp/",
},
}
with models.DAG(
"composer_dataflow_dag",
default_args=default_args,
schedule_interval=datetime.timedelta(days=1), # Override to match your needs
) as dag:
create_mastertable = DataflowCreatePythonJobOperator(
task_id="create_mastertable",
py_file=f'gs://<bucket name>/dataflow-job.py',
options={"runner":"DataflowRunner","project":project_id,"region":"us-central1" ,"temp_location":"gs://<bucket name>/", "staging_location":"gs://<bucket name>/"},
job_name=f'job{tstmp}',
location='us-central1',
wait_until_finished=True,
)
Here is the dataflow job file, with the modification you want to concatenate some columns data:
dataflow-job.py
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import os
from datetime import datetime
import pytz
tz = pytz.timezone('US/Pacific')
tstmp = datetime.now(tz).strftime('%Y-%m-%d %H:%M:%S')
bucket_path = "gs://<bucket>"
input_file = f'{bucket_path}/inputFile.txt'
output = f'{bucket_path}/output_{tstmp}.txt'
p = beam.Pipeline(options=PipelineOptions())
( p | 'Read from a File' >> beam.io.ReadFromText(input_file, skip_header_lines=1)
| beam.Map(lambda x:x.split(","))
| beam.Map(lambda x:f'{x[0]},{x[1]}{x[2]}{x[3]},{x[4]}')
| beam.io.WriteToText(output) )
p.run().wait_until_finish()
After running the result will be stored in the gcs Bucket:
A beam program is just an ordinary Python program that builds up a pipeline and runs it. For example
'''
def main():
with beam.Pipline() as p:
p | beam.io.ReadFromText(...) | beam.Map(...) | beam.io.WriteToText(...)
'''
Many examples can be found in the repository and the programming guide is useful toohttps://beam.apache.org/documentation/programming-guide/ . The easiest way to read CSV files is with the dataframes API which retruns an object you can manipulate as if it were a Pandas Dataframe, or you can turn into a PCollection (where each column is an attribute of a named tuple) and process with Beam's Map, FlatMap, etc, e.g.
pcoll | beam.Map(
lambda row: (row.col1, func(row.col2, row.col3, row.col4), row.col5)))

Airflow PostgresToGoogleCloudStorageOperator authentication error

Is there any documentation regarding how to authenticate with google cloud storage through an airflow dag?
I am trying to use airflow's PostgresToGoogleCloudStorageOperator to upload the results from a query to google cloud storage. However I am getting the below error.
{gcp_api_base_hook.py:146} INFO - Getting connection using `google.auth.default()` since no key file is defined for hook.
{taskinstance.py:1150} ERROR - 403 POST https://storage.googleapis.com/upload/storage/v1/b/my_test_bucket123/o?uploadType=multipart: ('Request failed with status code', 403, 'Expected one of', <HTTPStatus.OK: 200>)
I have configured "gcp_conn_id" in the airflow UI and provided the json from my service account key but still getting this error.
Below is the entire dag file
from datetime import datetime
from datetime import timedelta
from airflow import DAG
from airflow.contrib.operators.postgres_to_gcs_operator import PostgresToGoogleCloudStorageOperator
DESTINATION_BUCKET = 'test'
DESTINATION_DIRECTORY = "uploads"
dag_params = {
'dag_id': 'upload_test',
'start_date': datetime(2020, 7, 7),
'schedule_interval': '#once',
}
with DAG(**dag_params) as dag:
upload = PostgresToGoogleCloudStorageOperator(
task_id="upload",
bucket=DESTINATION_BUCKET,
filename=DESTINATION_DIRECTORY + "/{{ execution_date }}" + "/{}.json",
sql='''SELECT * FROM schema.table;''',
postgres_conn_id="postgres_default",
gcp_conn_id="gcp_default"
)
Sure. There is this very nice documentation with examples for Google Connection:
https://airflow.apache.org/docs/apache-airflow-providers-google/stable/connections/gcp.html
Also GCP has this very nice overview of all the authentication methods you can use:
https://cloud.google.com/endpoints/docs/openapi/authentication-method

How to enable glue logging for AWS Glue script only

I am struggling to enable DEBUG logging for a Glue script using PySpark only.
I have tried:
import...
def quiet_logs(sc):
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org").setLevel(logger.Level.ERROR)
logger.LogManager.getLogger("akka").setLevel(logger.Level.ERROR)
def main():
# Get the Spark Context
sc = SparkContext.getOrCreate()
sc.setLogLevel("DEBUG")
quiet_logs(sc)
context = GlueContext(sc)
logger = context.get_logger()
logger.debug("I only want to see this..., and for all others, only ERRORS")
...
I have '--enable-continuous-cloudwatch-log' set to true, but simply cannot get the log trail to only write debug messages for my own script.
I haven't managed to do exactly what you want, but I was able to do something similar by setting up a separate custom log, and this might achieve what you're after.
import os
from watchtower import CloudWatchLogHandler
import logging
args = getResolvedOptions(sys.argv,["JOB_RUN_ID"])
job_run_id = args["JOB_RUN_ID"]
os.environ["AWS_DEFAULT_REGION"] = "eu-west-1"
lsn = f"{job_run_id}_custom"
cw = CloudWatchLogHandler(
log_group="/aws-glue/jobs/logs-v2", stream_name=lsn, send_interval=4
)
slog = logging.getLogger()
slog.setLevel(logging.DEBUG)
slog.handlers = []
slog.addHandler(cw)
slog.info("hello from the custom logger")
Now anything you log to slog will go into a separate logger accessible as one of the entries in the 'output' logs
Note you need to include watchtower as a --additional-python-modules when you run the glue job
More info here
This should be regular logging setup in Python.
I have tested this in a Glue job and only one debug-message was visible. The job was configured with "Continuous logging" and they ended up in "Output logs"
import logging
# Warning level on root
logging.basicConfig(level=logging.WARNING, format='%(asctime)s [%(levelname)s] [%(name)s] %(message)s')
logger = logging.getLogger(__name__)
# Debug level only for this logger
logger.setLevel(logging.DEBUG)
logger.debug("DEBUG_LOG test")
You can also mute specific loggers:
logging.getLogger('botocore.vendored.requests.packages.urllib3.connectionpool').setLevel(logging.WARN)

airflow DAG keeps retrying without showing any errors

I use google composer. I have a dag that uses the panda.read_csv() function to read a .csv.gz file. The DAG keeps trying without showing any errors. Here is the airflow log:
*** Reading remote log from gs://us-central1-data-airflo-dxxxxx-bucket/logs/youtubetv_gcpbucket_to_bq_daily_v2_csv/file_transfer_gcp_to_bq/2018-11-04T20:00:00/1.log.
[2018-11-05 21:03:58,123] {cli.py:374} INFO - Running on host airflow-worker-77846bb966-vgrbz
[2018-11-05 21:03:58,239] {models.py:1196} INFO - Dependencies all met for <TaskInstance: youtubetv_gcpbucket_to_bq_daily_v2_csv.file_transfer_gcp_to_bq 2018-11-04 20:00:00 [queued]>
[2018-11-05 21:03:58,297] {models.py:1196} INFO - Dependencies all met for <TaskInstance: youtubetv_gcpbucket_to_bq_daily_v2_csv.file_transfer_gcp_to_bq 2018-11-04 20:00:00 [queued]>
[2018-11-05 21:03:58,298] {models.py:1406} INFO -
----------------------------------------------------------------------
---------
Starting attempt 1 of
----------------------------------------------------------------------
---------
[2018-11-05 21:03:58,337] {models.py:1427} INFO - Executing <Task(BranchPythonOperator): file_transfer_gcp_to_bq> on 2018-11-04 20:00:00
[2018-11-05 21:03:58,338] {base_task_runner.py:115} INFO - Running: ['bash', '-c', u'airflow run youtubetv_gcpbucket_to_bq_daily_v2_csv file_transfer_gcp_to_bq 2018-11-04T20:00:00 --job_id 15096 --raw -sd DAGS_FOLDER/dags/testdags/youtubetv_gcp_to_bq_v2.py']
python code in DAG:
from datetime import datetime,timedelta
from airflow import DAG
from airflow import models
import os
import io,logging, sys
import pandas as pd
from io import BytesIO, StringIO
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.subdag_operator import SubDagOperator
from airflow.operators.python_operator import BranchPythonOperator
from airflow.operators.bash_operator import BashOperator
#GCP
from google.cloud import storage
import google.cloud
from google.cloud import bigquery
from google.oauth2 import service_account
from airflow.operators.slack_operator import SlackAPIPostOperator
from airflow.models import Connection
from airflow.utils.db import provide_session
from airflow.utils.trigger_rule import TriggerRule
def readCSV(checked_date,file_name, **kwargs):
subDir=checked_date.replace('-','/')
fileobj = get_byte_fileobj(BQ_PROJECT_NAME, YOUTUBETV_BUCKET, subDir+"/"+file_name)
df_chunks = pd.read_csv(fileobj, compression='gzip',memory_map=True, chunksize=1000000) # return TextFileReader
print ("done reaCSV")
return df_chunks
DAG:
file_transfer_gcp_to_bq = BranchPythonOperator(
task_id='file_transfer_gcp_to_bq',
provide_context=True,
python_callable=readCSV,
op_kwargs={'checked_date': '2018-11-03', 'file_name':'daily_events_xxxxx_partner_report.csv.gz'}
)
The DAG is successfully run on my local airflow version.
def readCSV(checked_date,file_name, **kwargs):
subDir=checked_date.replace('-','/')
fileobj = get_byte_fileobj(BQ_PROJECT_NAME, YOUTUBETV_BUCKET, subDir+"/"+file_name)
df = pd.read_csv(fileobj, compression='gzip',memory_map=True)
return df
tested get_byte_fileobj and it works as a stand alone function.
Based on this discussion airflow google composer group it is a known issue.
One of the reason can be because of overkilling all the composer resources (in my case memory)
I have a similar issue recently.
In my case it's beacause the kubernetes worker overload.
You can watch the worker performance on kubernetes dashboard too see whether your case is cluster overloadding issue.
If yes, you can try set the value of an airflow configuration celeryd_concurrency lower to reduce the parallism in a worker and see whether the cluster loads goes down