AWS StepFunctions Task state gets cancelled when tearing down a Google Cloud cluster - amazon-web-services

I am using AWS StepFunctions to carry out several tasks on the Google Cloud side - creating a Dataproc cluster, submitting a task to it, and then tearing it down (each of which have their own Task state, as well as "poller" tasks that check when the jobs have been finished in order to move onto the next Task).
The issue is, for tearing down the cluster, the Task goes into the "cancelled" (gray) status instead of "in progress", followed by the poller Task. Once the cluster deletion lambda function executes the cluster deletion method, it should move on to the poller Task.
Here is a look at the cluster deletion lambda function:
from pprint import pprint
from google.cloud import storage
import googleapiclient.discovery
from rkstr8.cloud.google import GoogleCloudLambdaAuth
import time
def handler(event, context):
creds = event['GCP_creds']
GoogleCloudLambdaAuth(creds).configure_google_creds()
dataproc = googleapiclient.discovery.build('dataproc', 'v1')
project_id = event['gcp-administrative']['project']
zone = event['gcp-administrative']['zone']
try:
region_as_list = zone.split('-')[:-1]
region = '-'.join(region_as_list)
except (AttributeError, IndexError, ValueError):
raise ValueError('Invalid zone provided, please check your input.')
cluster = event['dataproc-administrative']['cluster_name']
print('Tearing down cluster...')
request = dataproc.projects().regions().clusters().delete(
projectId=project_id,
region=region,
clusterName=cluster)
time.sleep(30)
result = request.execute()
return result
Here is what the relevant part of the state machine building code looks like:
dproc_submit_state = AsyncPoller(
stats_path=DPROC_SUBMIT_POLLER_STATUS_PATH,
async_task=Task(
name=DPROC_SUBMIT,
resource=DPROC_SUBMIT_ARN_VAR,
input_path=DPROC_SUBMIT_INPUT_PATH,
result_path=DPROC_SUBMIT_RESULT_PATH,
next=DPROC_SUBMIT_POLLER
),
pollr_task=Task(
name=DPROC_SUBMIT_POLLER,
resource=DPROC_SUBMIT_POLLER_ARN_VAR,
input_path=DPROC_SUBMIT_RESULT_PATH,
result_path=DPROC_SUBMIT_POLLER_STATUS_PATH
),
faild_task=Fail(
name='HailScriptFailed'
),
succd_task=DPROC_DELETE,
pollr_wait_time=self.conf["POLLER_WAIT_TIME"]
).states()
dproc_delete_state = AsyncPoller(
stats_path=DPROC_DELETE_POLLER_STATUS_PATH,
async_task=Task(
name=DPROC_DELETE,
resource=DPROC_DELETE_ARN_VAR,
input_path=DPROC_DELETE_INPUT_PATH,
result_path=DPROC_DELETE_RESULT_PATH,
next=DPROC_DELETE_POLLER
),
pollr_task=Task(
name=DPROC_DELETE_POLLER,
resource=DPROC_DELETE_POLLER_ARN_VAR,
input_path=DPROC_DELETE_RESULT_PATH,
result_path=DPROC_DELETE_POLLER_STATUS_PATH
),
faild_task=Fail(
name='ClusterDeleteFailed'
),
succd_task='PipelineSucceeded',
pollr_wait_time=self.conf["POLLER_WAIT_TIME"]
).states()
Here is what the state machine looks like:

Why are you sleeping for 30 seconds between creating a request and executing it?
The default timeout for lambda is 3 seconds. My guess is that your lambda is just timing out.

Related

start, monitor and define script of SageMaker processing job from local machine

I am looking at this, which makes all sense. Let us focus on this bit of code:
from sagemaker.processing import ProcessingInput, ProcessingOutput
sklearn_processor.run(
code="preprocessing.py",
inputs=[
ProcessingInput(source="s3://your-bucket/path/to/your/data", destination="/opt/ml/processing/input"),
],
outputs=[
ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test"),
],
arguments=["--train-test-split-ratio", "0.2"],
)
preprocessing_job_description = sklearn_processor.jobs[-1].describe()
Here preprocessing.py has to be obviously in the cloud. I am curious, could one also put scripts into an S3 bucket and trigger the job remotely. I can easily to this with hyper parameter optimisation, which does not require dedicated scripts though as I use an OOTB training image.
In this case I can fire off the job like so:
tuning_job_name = "amazing-hpo-job-" + strftime("%d-%H-%M-%S", gmtime())
smclient = boto3.Session().client("sagemaker")
smclient.create_hyper_parameter_tuning_job(
HyperParameterTuningJobName=tuning_job_name,
HyperParameterTuningJobConfig=tuning_job_config,
TrainingJobDefinition=training_job_definition
)
and then monitor the job's progress:
smclient = boto3.Session().client("sagemaker")
tuning_job_result = smclient.describe_hyper_parameter_tuning_job(
HyperParameterTuningJobName=tuning_job_name
)
status = tuning_job_result["HyperParameterTuningJobStatus"]
if status != "Completed":
print("Reminder: the tuning job has not been completed.")
job_count = tuning_job_result["TrainingJobStatusCounters"]["Completed"]
print("%d training jobs have completed" % job_count)
objective = tuning_job_result["HyperParameterTuningJobConfig"]["HyperParameterTuningJobObjective"]
is_minimize = objective["Type"] != "Maximize"
objective_name = objective["MetricName"]
if tuning_job_result.get("BestTrainingJob", None):
print("Best model found so far:")
pprint(tuning_job_result["BestTrainingJob"])
else:
print("No training jobs have reported results yet.")
I would think starting and monitoring a SageMaker processing job from a local machine should be possible as with an HPO job but what about the script(s)? Ideally I would like to develop and test them locally and the run remotely. Hope this makes sense?
Im not sure I understand the comparison to a Tuning Job.
Based on what you have described, in this case the preprocessing.py is actually stored locally. The SageMaker SDK will upload it to S3 for the remote Processing Job to access it. I suggest launching the Job and then taking a look at the inputs in the SageMaker Console.
If you wanted to test the Processing Job locally you can do so using Local Mode. This will basically imitate the Job locally which aids in debugging the script before kicking off a remote Processing Job. Kindly note docker is required to make use of Local Mode.
Example code for local mode:
from sagemaker.local import LocalSession
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
sagemaker_session = LocalSession()
sagemaker_session.config = {'local': {'local_code': True}}
# For local training a dummy role will be sufficient
role = 'arn:aws:iam::111111111111:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001'
processor = ScriptProcessor(command=['python3'],
image_uri='sagemaker-scikit-learn-processing-local',
role=role,
instance_count=1,
instance_type='local')
processor.run(code='processing_script.py',
inputs=[ProcessingInput(
source='./input_data/',
destination='/opt/ml/processing/input_data/')],
outputs=[ProcessingOutput(
output_name='word_count_data',
source='/opt/ml/processing/processed_data/')],
arguments=['job-type', 'word-count']
)
preprocessing_job_description = processor.jobs[-1].describe()
output_config = preprocessing_job_description['ProcessingOutputConfig']
print(output_config)
for output in output_config['Outputs']:
if output['OutputName'] == 'word_count_data':
word_count_data_file = output['S3Output']['S3Uri']
print('Output file is located on: {}'.format(word_count_data_file))

AWS Error "Calling the invoke API action failed with this message: Rate Exceeded" when I use s3.get_paginator('list_objects_v2')

Some third party application is uploading around 10000 object to my bucket+prefix in a day. My requirement is to fetch all objects which were uploaded to my bucket+prefix in last 24 hours.
There are so many files in my bucket+prefix.
So I assume that when I call
response = s3_paginator.paginate(Bucket=bucket,Prefix='inside-bucket-level-1/', PaginationConfig={"PageSize": 1000})
then may be it makes multiple calls to S3 API and may be that's why it is showing Rate Exceeded error.
Below is my Python Lambda Function.
import json
import boto3
import time
from datetime import datetime, timedelta
def lambda_handler(event, context):
s3 = boto3.client("s3")
from_date = datetime.today() - timedelta(days=1)
string_from_date = from_date.strftime("%Y-%m-%d, %H:%M:%S")
print("Date :", string_from_date)
s3_paginator = s3.get_paginator('list_objects_v2')
list_of_buckets = ['kush-dragon-data']
bucket_wise_list = {}
for bucket in list_of_buckets:
response = s3_paginator.paginate(Bucket=bucket,Prefix='inside-bucket-level-1/', PaginationConfig={"PageSize": 1000})
filtered_iterator = response.search(
"Contents[?to_string(LastModified)>='\"" + string_from_date + "\"'].Key")
keylist = []
for key_data in filtered_iterator:
if "/" in key_data:
splitted_array = key_data.split("/")
if len(splitted_array) > 1:
if splitted_array[-1]:
keylist.append(splitted_array[-1])
else:
keylist.append(key_data)
bucket_wise_list.update({bucket: keylist})
print("Total Number Of Object = ", bucket_wise_list)
# TODO implement
return {
'statusCode': 200,
'body': json.dumps(bucket_wise_list)
}
So when we execute above Lambda Function then it shows below error.
"Calling the invoke API action failed with this message: Rate Exceeded."
Can anyone help to resolve this error and achieve my requirement ?
This is probably due to your account restrictions, you should add retry with some seconds between retries or increase pagesize
This is most likely due to you reaching your quota limit for AWS S3 API calls. The "bigger hammer" solution is to request a quota increase, but if you don't want to do that, there is another way using botocore.Config built in retries, for example:
import json
import time
from datetime import datetime, timedelta
from boto3 import client
from botocore.config import Config
config = Config(
retries = {
'max_attempts': 10,
'mode': 'standard'
}
)
def lambda_handler(event, context):
s3 = client('s3', config=config)
###ALL OF YOUR CURRENT PYTHON CODE EXACTLY THE WAY IT IS###
This config will use exponentially increasing sleep timer for a maximum number of retries. From the docs:
Any retry attempt will include an exponential backoff by a base factor of 2 for a maximum backoff time of 20 seconds.
There is also an adaptive mode which is still experimental. For more info, see the docs on botocore.Config retries
Another (much less robust IMO) option would be to write your own paginator with a sleep programmed in, though you'd probably just want to use the builtin backoff in 99.99% of cases (even if you do have to write your own paginator). (this code is untested and isn't even asynchronous, so the sleep will be in addition to the wait time for a page response. To make the "sleep time" exactly sleep_secs, you'll need to use concurrent.futures or asyncio (AWS built in paginators mostly use concurrent.futures)):
from boto3 import client
from typing import Generator
from time import sleep
def get_pages(bucket:str,prefix:str,page_size:int,sleep_secs:float) -> Generator:
s3 = client('s3')
page:dict = client.list_objects_v2(
Bucket=bucket,
MaxKeys=page_size,
Prefix=prefix
)
next_token:str = page.get('NextContinuationToken')
yield page
while(next_token):
sleep(sleep_secs)
page = client.list_objects_v2(
Bucket=bucket,
MaxKeys=page_size,
Prefix=prefix,
ContinuationToken=next_token
)
next_token = page.get('NextContinuationToken')
yield page

How to run BigQuery after Dataflow job completed successfully

I am trying to run a query in BigQuery right after a dataflow job completes successfully. I have defined 3 different functions in main.py.
The first one is for running the dataflow job. The second one checks the dataflow jobs status. And the last one runs the query in BigQuery.
The trouble is the second function checks the dataflow job status multiple times for a period of time and after the dataflow job completes successfully, it does not stop checking the status.
And then function deployment fails due to 'function load attempt timed out' error.
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
import os
import re
import config
from google.cloud import bigquery
import time
global flag
def trigger_job(gcs_path, body):
credentials = GoogleCredentials.get_application_default()
service = build('dataflow', 'v1b3', credentials=credentials, cache_discovery=False)
request = service.projects().templates().launch(projectId=config.project_id, gcsPath=gcs_path, body=body)
response = request.execute()
def get_job_status(location, flag):
credentials=GoogleCredentials.get_application_default()
dataflow=build('dataflow', 'v1b3', credentials=credentials, cache_discovery=False)
result=dataflow.projects().jobs().list(projectId=config.project_id, location=location).execute()
for job in result['jobs']:
if re.findall(r'' + re.escape(config.job_name) + '', job['name']):
while flag==0:
if job['currentState'] != "JOB_STATE_DONE":
print('NOT DONE')
else:
flag=1
print('DONE')
break
def bq(sql):
client = bigquery.Client()
query_job = client.query(sql, location='US')
gcs_path = config.gcs_path
body=config.body
trigger_job(gcs_path,body)
flag=0
location='us-central1'
get_job_status(location,flag)
sql= """CREATE OR REPLACE TABLE 'table' AS SELECT * FROM 'table'"""
bq(SQL)
Cloud Function timeout is set to 540 seconds but deployment fails in 3-4 minutes.
Any help is very appreciated.
It appears from the code snippet provided that your HTTP-triggered cloud function is not returning a HTTP response.
All HTTP-based cloud functions must return a HTTP response for proper termination. From the google documentation Ensure HTTP functions send an HTTP response (Emphasis - mine):
If your function is HTTP-triggered, remember to send an HTTP response,
as shown below. Failing to do so can result in your function executing
until timeout. If this occurs, you will be charged for the entire
timeout time. Timeouts may also cause unpredictable behavior or cold
starts on subsequent invocations, resulting in unpredictable behavior
or additional latency.
Thus, you must have a function that in your main.py that returns some sort of value, ideally a value that can be coerced into a Flask http response.

How to troubleshoot and solve lambda function issue?

import sys
import botocore
import boto3
from botocore.exceptions import ClientError
def lambda_handler(event, context):
# TODO implement
rds = boto3.client('rds')
lambdaFunc = boto3.client('lambda')
print 'Trying to get Environment variable'
try:
funcResponse = lambdaFunc.get_function_configuration(
FunctionName='RDSInstanceStart'
)
#print (funcResponse)
DBinstance = funcResponse['Environment']['Variables']['DBInstanceName']
print 'Starting RDS service for DBInstance : ' + DBinstance
except ClientError as e:
print(e)
try:
response = rds.start_db_instance(
DBInstanceIdentifier=DBinstance
)
print 'Success :: '
return response
except ClientError as e:
print(e)
return
{
'message' : "Script execution completed. See Cloudwatch logs for complete output"
}
I have a running rds instance. Every day I start and stop my RDS instance(db.t2.micro (MSSQL Server)) of AWS using a lambda expression. It was working fine previously but unexpectedly today I faced the issue.
Where my rds instance is not started automatically by the lambda expression. I watched an error log but there is not an issue it usually seems like the daily log. I am unable to troubleshoot and solve the issue. Can anyone tell me about this issue?
FYI, a shortened version would be:
import boto3
import os
def lambda_handler(event, context):
rds_client = boto3.client('rds')
response = rds.start_db_instance(DBInstanceIdentifier=os.environ['DBInstanceName'])
print response
You can see the logs of each lambda calls in cloudwatch or in aws lambda-> monitoring -> view logs in cloud watch. This will open a page with logs of each lambda call.
if there is not logs. it means that lambda is not invoking.
you can check the roles and policies assign to lambda if that is correct.
You should print the response of the api you use to start the db (ex- start-db-instance). The response will be printed to CloudWatch Log.
https://docs.aws.amazon.com/cli/latest/reference/rds/start-db-instance.html
for later automation you might want to create a metric-filter on the Lambda's CloudWatch Logs on a certain keyword such as -
"\"DBInstanceStatus\": \"starting\""
there will be an Alarm created as well with setting say threshold < 1, and if the keyword is not found in a log the metric will push no Value (you can customize this setting under Advanced option) and the Alarm will go in to INSUFFICIENT_DATA and you can set notification for INSUFFICIENT_DATA using SNS.
You can tweak the Alarm a bit to treat missing data as Bad and then Alarm will transition to ALARM state when metric filter does not match with the incoming log.

Create AWS sagemaker endpoint and delete the same using AWS lambda

Is there a way to create sagemaker endpoint using AWS lambda ?
The maximum timeout limit for lambda is 300 seconds while my existing model takes 5-6 mins to host ?
One way is to combine Lambda and Step functions with a wait state to create sagemaker endpoint
In the step function have tasks to
1 . Launch AWS Lambda to CreateEndpoint
import time
import boto3
client = boto3.client('sagemaker')
endpoint_name = 'DEMO-imageclassification-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
endpoint_config_name = 'DEMO-imageclassification-epc--2018-06-18-17-02-44'
print(endpoint_name)
def lambda_handler(event, context):
create_endpoint_response = client.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name)
print(create_endpoint_response['EndpointArn'])
print('EndpointArn = {}'.format(create_endpoint_response['EndpointArn']))
# get the status of the endpoint
response = client.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print('EndpointStatus = {}'.format(status))
return status
2 . Wait task to wait for X minutes
3 . Another task with Lambda to check EndpointStatus and depending on EndpointStatus (OutOfService | Creating | Updating | RollingBack | InService | Deleting | Failed) either stop the job or continue polling
import time
import boto3
client = boto3.client('sagemaker')
endpoint_name = 'DEMO-imageclassification-2018-07-20-18-52-30'
endpoint_config_name = 'DEMO-imageclassification-epc--2018-06-18-17-02-44'
print(endpoint_name)
def lambda_handler(event, context):
# print the status of the endpoint
endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
status = endpoint_response['EndpointStatus']
print('Endpoint creation ended with EndpointStatus = {}'.format(status))
if status != 'InService':
raise Exception('Endpoint creation failed.')
# wait until the status has changed
client.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)
# print the status of the endpoint
endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
status = endpoint_response['EndpointStatus']
print('Endpoint creation ended with EndpointStatus = {}'.format(status))
if status != 'InService':
raise Exception('Endpoint creation failed.')
status = endpoint_response['EndpointStatus']
return
Another approach is to combination of AWS Lambda functions and CloudWatch rules which I think would be clumsy.
While rajesh answer is closer to what the question ask for, I like to add that sagemaker now has a batch transform job.
Instead of continously hosting a machine, this job can handle predicting large size of batches at once without caring about latency. So if the intention behind the question is to deploy the model for a short time to predict on a fix amount of batches. This might be the better approach.