I'm new to using the AWS managed Airflow service. I want to use Airflow to start an EC2 instance, make sure it's running, and then do some further work in the instance.
So far I have this dag below which is basically a copy of this.
This, however, fails every time and I'm not adept enough to know why?
import os
from datetime import datetime
from airflow import DAG
from airflow.models.baseoperator import chain
from airflow.providers.amazon.aws.operators.ec2 import EC2StartInstanceOperator, EC2StopInstanceOperator
from airflow.providers.amazon.aws.sensors.ec2 import EC2InstanceStateSensor
INSTANCE_ID = os.getenv("INSTANCE_ID", "instance-id")
with DAG(
dag_id='example_ec2',
schedule_interval=None,
start_date=datetime(2021, 1, 1),
tags=['example'],
catchup=False,
) as dag:
# [START howto_operator_ec2_start_instance]
start_instance = EC2StartInstanceOperator(
task_id="ec2_start_instance",
instance_id=INSTANCE_ID,
)
# [END howto_operator_ec2_start_instance]
# [START howto_sensor_ec2_instance_state]
instance_state = EC2InstanceStateSensor(
task_id="ec2_instance_state",
instance_id=INSTANCE_ID,
target_state="running",
)
# [END howto_sensor_ec2_instance_state]
chain(start_instance, instance_state)
Just going to answer my own question in case anyone else finds this.
The line below doesnt work for whatever reason
INSTANCE_ID = os.getenv("INSTANCE_ID", "instance-id")
The alternative is to do the following
from airflow.models import Variable # import Variable
INSTANCE_ID = Variable.get("INSTANCE_ID") # get INSTANCE_ID
Make sure you have added INSTANCE_ID as a variable by going to Admin -> Variable.
Related
I have a simple requirement, I need to run sagemaker prediction inside a spark job
am trying to run the below
ENDPOINT_NAME = "MY-ENDPOINT_NAME"
from sagemaker_pyspark import SageMakerModel
from sagemaker_pyspark import EndpointCreationPolicy
from sagemaker_pyspark.transformation.serializers import ProtobufRequestRowSerializer
from sagemaker_pyspark.transformation.deserializers import ProtobufResponseRowDeserializer
attachedModel = SageMakerModel(
existingEndpointName=ENDPOINT_NAME,
endpointCreationPolicy=EndpointCreationPolicy.DO_NOT_CREATE,
endpointInstanceType=None, # Required
endpointInitialInstanceCount=None, # Required
requestRowSerializer=ProtobufRequestRowSerializer(
featuresColumnName="featureCol"
), # Optional: already default value
responseRowDeserializer= ProtobufResponseRowDeserializer(schema=ouput_schema),
)
transformedData2 = attachedModel.transform(df)
transformedData2.show()
I get the following error TypeError: 'JavaPackage' object is not callable
this was solved by ...
classpath = ":".join(sagemaker_pyspark.classpath_jars())
conf = SparkConf() \
.set("spark.driver.extraClassPath", classpath)
sc = SparkContext(conf=conf)
I am trying to learn/try out cloud composer/beam/dataflow on gcp.
I have written functions to do some basic cleaning of data in python, and used a DAG in cloud composer to run this function to download a file from a bucket, process it, and upload it to a bucket at a set frequency.
It was all bespoke written functionality. I am now trying to figure out how I use beam pipeline and data flow instead and use cloud composer to kick off the dataflow job.
The cleaning I am trying to do, is take a csv input of col1,col2,col3,col4,col5 and combine the middle 3 columns to output a csv of col1,combinedcol234,col5.
Questions I have are...
How do I pull in my own functions within a beam pipeline to do this merge?
Should I be pulling in my own functions or do beam have built in ways of doing this?
How do I then trigger a pipeline from a dag?
Does anyone have any example code on git hub?
I have been googling and trying to research but can't seem to find anything that helps me get my head around it enough.
Any help would be appreciated. Thank you.
You can use the DataflowCreatePythonJobOperator to run a dataflow job in a python.
You have to instantiate your cloud composer environment;
Add the dataflow job file in a bucket;
Add the input file to a bucket;
Add the following dag in the DAGs directory of the composer environment:
composer_dataflow_dag.py:
import datetime
from airflow import models
from airflow.providers.google.cloud.operators.dataflow import DataflowCreatePythonJobOperator
from airflow.utils.dates import days_ago
bucket_path = "gs://<bucket name>"
project_id = "<project name>"
gce_zone = "us-central1-a"
import pytz
tz = pytz.timezone('US/Pacific')
tstmp = datetime.datetime.now(tz).strftime('%Y%m%d%H%M%S')
default_args = {
# Tell airflow to start one day ago, so that it runs as soon as you upload it
"start_date": days_ago(1),
"dataflow_default_options": {
"project": project_id,
# Set to your zone
"zone": gce_zone,
# This is a subfolder for storing temporary files, like the staged pipeline job.
"tempLocation": bucket_path + "/tmp/",
},
}
with models.DAG(
"composer_dataflow_dag",
default_args=default_args,
schedule_interval=datetime.timedelta(days=1), # Override to match your needs
) as dag:
create_mastertable = DataflowCreatePythonJobOperator(
task_id="create_mastertable",
py_file=f'gs://<bucket name>/dataflow-job.py',
options={"runner":"DataflowRunner","project":project_id,"region":"us-central1" ,"temp_location":"gs://<bucket name>/", "staging_location":"gs://<bucket name>/"},
job_name=f'job{tstmp}',
location='us-central1',
wait_until_finished=True,
)
Here is the dataflow job file, with the modification you want to concatenate some columns data:
dataflow-job.py
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import os
from datetime import datetime
import pytz
tz = pytz.timezone('US/Pacific')
tstmp = datetime.now(tz).strftime('%Y-%m-%d %H:%M:%S')
bucket_path = "gs://<bucket>"
input_file = f'{bucket_path}/inputFile.txt'
output = f'{bucket_path}/output_{tstmp}.txt'
p = beam.Pipeline(options=PipelineOptions())
( p | 'Read from a File' >> beam.io.ReadFromText(input_file, skip_header_lines=1)
| beam.Map(lambda x:x.split(","))
| beam.Map(lambda x:f'{x[0]},{x[1]}{x[2]}{x[3]},{x[4]}')
| beam.io.WriteToText(output) )
p.run().wait_until_finish()
After running the result will be stored in the gcs Bucket:
A beam program is just an ordinary Python program that builds up a pipeline and runs it. For example
'''
def main():
with beam.Pipline() as p:
p | beam.io.ReadFromText(...) | beam.Map(...) | beam.io.WriteToText(...)
'''
Many examples can be found in the repository and the programming guide is useful toohttps://beam.apache.org/documentation/programming-guide/ . The easiest way to read CSV files is with the dataframes API which retruns an object you can manipulate as if it were a Pandas Dataframe, or you can turn into a PCollection (where each column is an attribute of a named tuple) and process with Beam's Map, FlatMap, etc, e.g.
pcoll | beam.Map(
lambda row: (row.col1, func(row.col2, row.col3, row.col4), row.col5)))
I am trying to use the airflow.providers.amazon.aws.operators.s3_list S3ListOperator to list files in an S3 bucket in my AWS account with the DAG operator below:
list_bucket = S3ListOperator(
task_id = 'list_files_in_bucket',
bucket = '<MY_BUCKET>',
aws_conn_id = 's3_default'
)
I have configured my Extra Connection details in the form of: {"aws_access_key_id": "<MY_ACCESS_KEY>", "aws_secret_access_key": "<MY_SECRET_KEY>"}
When I run my Airflow job, it appears it is executing fine & my task status is Success. Here is the Log output:
[2021-04-27 11:44:50,009] {base_aws.py:368} INFO - Airflow Connection: aws_conn_id=s3_default
[2021-04-27 11:44:50,013] {base_aws.py:170} INFO - Credentials retrieved from extra_config
[2021-04-27 11:44:50,013] {base_aws.py:84} INFO - Creating session with aws_access_key_id=<MY_ACCESS_KEY> region_name=None
[2021-04-27 11:44:50,027] {base_aws.py:157} INFO - role_arn is None
[2021-04-27 11:44:50,661] {taskinstance.py:1185} INFO - Marking task as SUCCESS. dag_id=two_step, task_id=list_files_in_bucket, execution_date=20210427T184422, start_date=20210427T184439, end_date=20210427T184450
[2021-04-27 11:44:50,676] {taskinstance.py:1246} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-04-27 11:44:50,700] {local_task_job.py:146} INFO - Task exited with return code 0
Is there anything I can do to print the files in my bucket to Logs?
TIA
This code is enough and you don't need to use print function. Just check the corresponding log, then go to xcom, and the return list is there.
list_bucket = S3ListOperator(
task_id='list_files_in_bucket',
bucket='ob-air-pre',
prefix='data/',
delimiter='/',
aws_conn_id='aws'
)
The result from executing S3ListOperator is an XCom object that is stored in the Airflow database after the task instance has completed.
You need to declare another operator to feed in the results from the S3ListOperator and print them out.
For example in Airflow 2.0.0 and up you can use TaskFlow:
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils import timezone
dag = DAG(
dag_id='my-workflow',
start_date=timezone.parse('2021-01-14 21:00')
)
#dag.task(task_id="print_objects")
def print_objects(objects):
print(objects)
list_bucket = S3ListOperator(
task_id='list_files_in_bucket',
bucket='<MY_BUCKET>',
aws_conn_id='s3_default',
dag=dag
)
print_objects(list_bucket.output)
In older versions,
from airflow.models import DAG
from airflow.operators.python import PythonOperator
from airflow.utils import timezone
dag = DAG(
dag_id='my-workflow',
start_date=timezone.parse('2021-01-14 21:00')
)
def print_objects(objects):
print(objects)
list_bucket = S3ListOperator(
dag=dag,
task_id='list_files_in_bucket',
bucket='<MY_BUCKET>',
aws_conn_id='s3_default',
)
print_objects_in_bucket = PythonOperator(
dag=dag,
task_id='print_objects_in_bucket',
python_callable=print_objects,
op_args=("{{ti.xcom_pull(task_ids='list_files_in_bucket')}}",)
)
list_bucket >> print_objects_in_bucket
I'm trying to schedule a Dataflow that ends after a set amount of time using a template. I'm able to successfully do this when using the command line, but when I try and do it with Google Cloud Scheduler I run into an error when I create my template.
The error is
File "pipelin_stream.py", line 37, in <module>
main()
File "pipelin_stream.py", line 34, in main
result.cancel()
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1638, in cancel
raise IOError('Failed to get the Dataflow job id.')
IOError: Failed to get the Dataflow job id.
The command I'm using to make the template is
python pipelin_stream.py \
--runner Dataflowrunner \
--project $PROJECT \
--temp_location $BUCKET/tmp \
--staging_location $BUCKET/staging \
--template_location $BUCKET/templates/time_template_test \
--streaming
And the pipeline file I have is this
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import pubsub_v1
from google.cloud import bigquery
import apache_beam as beam
import logging
import argparse
import sys
PROJECT = 'projectID'
schema = 'ex1:DATE, ex2:STRING'
TOPIC = "projects/topic-name/topics/scraping-test"
def main(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument("--input_topic")
parser.add_argument("--output")
known_args = parser.parse_known_args(argv)
p = beam.Pipeline(options=PipelineOptions(region='us-central1', service_account_email='email'))
(p
| 'ReadData' >> beam.io.ReadFromPubSub(topic=TOPIC).with_output_types(bytes)
| 'Decode' >> beam.Map(lambda x:x.decode('utf-8'))
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery('tablename'.format(PROJECT), schema=schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
result = p.run()
result.wait_until_finish(duration=3000)
result.cancel() # If the pipeline has not finished, you can cancel it
if __name__ == '__main__':
logger = logging.getLogger().setLevel(logging.INFO)
main()
Does anyone have an idea why I might be getting this error?
The error is raised by the cancel function after the waiting time and it appears to be harmless.
To prove it, I managed to reproduce your exact issue from my virtual machine with python 3.5. The template is created in the given path by --template_location and can be used to run jobs. Note that I needed to apply some changes to your code to get it to actually work in Dataflow.
In case it is of any use to you, I ended up using this pipeline code
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import pubsub_v1
from google.cloud import bigquery
import apache_beam as beam
import logging
import argparse
import datetime
# Fill this values in order to have them by default
# Note that the table in BQ needs to have the column names message_body and publish_time
Table = 'projectid:datasetid.tableid'
schema = 'ex1:STRING, ex2:TIMESTAMP'
TOPIC = "projects/<projectid>/topics/<topicname>"
class AddTimestamps(beam.DoFn):
def process(self, element, publish_time=beam.DoFn.TimestampParam):
"""Processes each incoming element by extracting the Pub/Sub
message and its publish timestamp into a dictionary. `publish_time`
defaults to the publish timestamp returned by the Pub/Sub server. It
is bound to each element by Beam at runtime.
"""
yield {
"message_body": element.decode("utf-8"),
"publish_time": datetime.datetime.utcfromtimestamp(
float(publish_time)
).strftime("%Y-%m-%d %H:%M:%S.%f"),
}
def main(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument("--input_topic", default=TOPIC)
parser.add_argument("--output_table", default=Table)
args, beam_args = parser.parse_known_args(argv)
# save_main_session needs to be set to true due to modules being used among the code (mostly datetime)
# Uncomment the service account email to specify a custom service account
p = beam.Pipeline(argv=beam_args,options=PipelineOptions(save_main_session=True,
region='us-central1'))#, service_account_email='email'))
(p
| 'ReadData' >> beam.io.ReadFromPubSub(topic=args.input_topic).with_output_types(bytes)
| "Add timestamps to messages" >> beam.ParDo(AddTimestamps())
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery(args.output_table, schema=schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
result = p.run()
#Warning: Cancel does not work properly in a template
result.wait_until_finish(duration=3000)
result.cancel() # Cancel the streaming pipeline after a while to avoid consuming more resources
if __name__ == '__main__':
logger = logging.getLogger().setLevel(logging.INFO)
main()
Afterwards I ran commands:
# Fill accordingly
PROJECT="MYPROJECT-ID"
BUCKET="MYBUCKET"
TEMPLATE_NAME="TRIAL"
# create the template
python3 -m templates.template-pubsub-bigquery \
--runner DataflowRunner \
--project $PROJECT \
--staging_location gs://$BUCKET/staging \
--temp_location gs://$BUCKET/temp \
--template_location gs://$BUCKET/templates/$TEMPLATE_NAME \
--streaming
to create the pipeline (which yields the error you mentioned but still creates the template).
And
# Fill job-name and gcs location accordingly
# Uncomment and fill the parameters should you want to use your own
gcloud dataflow jobs run <job-name> \
--gcs-location "gs://<MYBUCKET>/dataflow/templates/mytemplate"
# --parameters input_topic="", output_table=""
To run the pipeline.
As I said, the template was properly created and the pipeline worked properly.
Edit
Indeed the cancel function does not work properly in the template. It seems to be an issue with it needing the job id on template creation which of course it does not exist and as a result it omits the function.
I found this other post that handles extracting the Job id on the pipeline. I tried some tweaks to make it work within the template code itself but I think is not necessary. Given you want to schedule their execution I would go for the easier option and execute the streaming pipeline template at a certain time (e.g. 9:01 GMT) and cancel the pipeline with script
import logging, re,os
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
def retrieve_job_id():
#Fill as needed
project = '<project-id>'
job_prefix = "<job-name>"
location = '<location>'
logging.info("Looking for jobs with prefix {} in region {}...".format(job_prefix, location))
try:
credentials = GoogleCredentials.get_application_default()
dataflow = build('dataflow', 'v1b3', credentials=credentials)
result = dataflow.projects().locations().jobs().list(
projectId=project,
location=location,
).execute()
job_id = "none"
for job in result['jobs']:
if re.findall(r'' + re.escape(job_prefix) + '', job['name']):
job_id = job['id']
break
logging.info("Job ID: {}".format(job_id))
return job_id
except Exception as e:
logging.info("Error retrieving Job ID")
raise KeyError(e)
os.system('gcloud dataflow jobs cancel {}'.format(retrieve_job_id()))
at another time (e.g. 9:05 GMT). This script assumes you are running the script with the same job name each time and takes the latest appearance of the name and cancels it. I tried it several times and it works fine.
I use google composer. I have a dag that uses the panda.read_csv() function to read a .csv.gz file. The DAG keeps trying without showing any errors. Here is the airflow log:
*** Reading remote log from gs://us-central1-data-airflo-dxxxxx-bucket/logs/youtubetv_gcpbucket_to_bq_daily_v2_csv/file_transfer_gcp_to_bq/2018-11-04T20:00:00/1.log.
[2018-11-05 21:03:58,123] {cli.py:374} INFO - Running on host airflow-worker-77846bb966-vgrbz
[2018-11-05 21:03:58,239] {models.py:1196} INFO - Dependencies all met for <TaskInstance: youtubetv_gcpbucket_to_bq_daily_v2_csv.file_transfer_gcp_to_bq 2018-11-04 20:00:00 [queued]>
[2018-11-05 21:03:58,297] {models.py:1196} INFO - Dependencies all met for <TaskInstance: youtubetv_gcpbucket_to_bq_daily_v2_csv.file_transfer_gcp_to_bq 2018-11-04 20:00:00 [queued]>
[2018-11-05 21:03:58,298] {models.py:1406} INFO -
----------------------------------------------------------------------
---------
Starting attempt 1 of
----------------------------------------------------------------------
---------
[2018-11-05 21:03:58,337] {models.py:1427} INFO - Executing <Task(BranchPythonOperator): file_transfer_gcp_to_bq> on 2018-11-04 20:00:00
[2018-11-05 21:03:58,338] {base_task_runner.py:115} INFO - Running: ['bash', '-c', u'airflow run youtubetv_gcpbucket_to_bq_daily_v2_csv file_transfer_gcp_to_bq 2018-11-04T20:00:00 --job_id 15096 --raw -sd DAGS_FOLDER/dags/testdags/youtubetv_gcp_to_bq_v2.py']
python code in DAG:
from datetime import datetime,timedelta
from airflow import DAG
from airflow import models
import os
import io,logging, sys
import pandas as pd
from io import BytesIO, StringIO
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.subdag_operator import SubDagOperator
from airflow.operators.python_operator import BranchPythonOperator
from airflow.operators.bash_operator import BashOperator
#GCP
from google.cloud import storage
import google.cloud
from google.cloud import bigquery
from google.oauth2 import service_account
from airflow.operators.slack_operator import SlackAPIPostOperator
from airflow.models import Connection
from airflow.utils.db import provide_session
from airflow.utils.trigger_rule import TriggerRule
def readCSV(checked_date,file_name, **kwargs):
subDir=checked_date.replace('-','/')
fileobj = get_byte_fileobj(BQ_PROJECT_NAME, YOUTUBETV_BUCKET, subDir+"/"+file_name)
df_chunks = pd.read_csv(fileobj, compression='gzip',memory_map=True, chunksize=1000000) # return TextFileReader
print ("done reaCSV")
return df_chunks
DAG:
file_transfer_gcp_to_bq = BranchPythonOperator(
task_id='file_transfer_gcp_to_bq',
provide_context=True,
python_callable=readCSV,
op_kwargs={'checked_date': '2018-11-03', 'file_name':'daily_events_xxxxx_partner_report.csv.gz'}
)
The DAG is successfully run on my local airflow version.
def readCSV(checked_date,file_name, **kwargs):
subDir=checked_date.replace('-','/')
fileobj = get_byte_fileobj(BQ_PROJECT_NAME, YOUTUBETV_BUCKET, subDir+"/"+file_name)
df = pd.read_csv(fileobj, compression='gzip',memory_map=True)
return df
tested get_byte_fileobj and it works as a stand alone function.
Based on this discussion airflow google composer group it is a known issue.
One of the reason can be because of overkilling all the composer resources (in my case memory)
I have a similar issue recently.
In my case it's beacause the kubernetes worker overload.
You can watch the worker performance on kubernetes dashboard too see whether your case is cluster overloadding issue.
If yes, you can try set the value of an airflow configuration celeryd_concurrency lower to reduce the parallism in a worker and see whether the cluster loads goes down