Sns mail notification when a step is not kicked off within a threshold timeframe - amazon-web-services

I have an emr step which is submitted through step function. During step run I can see task is submitted, but emr step is not executed and emr console don’t have any information .
How can I debug this?
How can I send an sns when a step doesn’t start execution with in a threshold timeframe?in my case step function shows emr task submitted but no information on emr console and pipeline is long running without failing for more than half hr

You could start the debugging process through the Step Functions execution log and identify the specific step that has failed, and later, you can move on looking for the EMR console or the specific service that has failed. Usually when the EMR step doesn't appear in the EMR console, is due to a Runtime Error, caused by an exception raised when calling the EMR step.
For this scenario, you can use the Error Handling that Step Functions has, using the Catch and Timeout fields, you can find more details in the AWS documentation here.
Basically you need to add this fields as show bellow:
{
"StartAt": "EmrStep",
"States": {
"EmrStep": {
"Type": "Task",
"Resource": "arn:aws:emr:execute-X-step",
"Comment": "This is your EMR step",
"TimeoutSeconds": 10,
"Catch": [ {
"ErrorEquals": ["States.Timeout"],
"Next": "ShutdownClusterAndSendSNS"
} ],
"End": true
},
"ShutdownClusterAndSendSNS": {
"Type": "Pass",
"Comment": "This step handles the timeout exception raised",
"Result": "You can shutdown the EMR cluster to avoid increased cost here and later send a sns notification!",
"End": true
}
}
Note: To catch the timeout exception, you have to catch the error States.Timeout, but also you can define the same catch field for other types of error.

Related

Specifying Share Identifier in EventBridge rule for an AWS Batch job

I am writing a cloudformation template for an AWS Batch job triggered by an Eventbridge rule. However, I am getting the following error:
shareIdentifier must be specified. (Service: AWSBatch; Status Code: 400; Error Code: ClientException;
I cannot find any documentation of how to pass a shareIdentifier to my batch job, how can I add it to my eventbridge rule's cloudformation template?
I have tried passing as the Input variable:
Input: |
{
"shareIdentifier": "mid"
}
this is not picked up, I have also tried passing shareIdentifier/ShareIdentifier directly in the BatchParameters. this was an unrecognised key.
In the end, I couldn't crack this, I had to wrap it in a state machine step, and call that from eventbridge instead. This was the logic in the state machine to add the Share Identifier:
"States": {
"Batch SubmitJob": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob.sync",
"Parameters": {
"JobName": <name>,
"JobDefinition": <Arn>,
"JobQueue": <QueueName>,
"ShareIdentifier": <Share>
},
If anyone works out how to do it directly from Eventbridge, I'd love to hear it.

EC2 Instance Vanished

We have a peculiar situation today where we see that one of our EC2 instance has disappeared from the console and we weren't sure what caused this. Cloudtrail doesn't have any terminated event against this instance-id.
The last noted cloudtrail event for the instance-id that went down goes something like this
{
"eventVersion": "1.08",
"userIdentity": {
"type": "AWSService",
"invokedBy": "ec2.amazonaws.com"
},
"eventTime": "2022-03-23T05:46:40Z",
"eventSource": "sts.amazonaws.com",
"eventName": "AssumeRole",
"awsRegion": "ap-south-1",
"sourceIPAddress": "ec2.amazonaws.com",
"userAgent": "ec2.amazonaws.com",
"requestParameters": {
"roleArn": "arn:aws:iam::2************:role/ec2-instance-***********",
"roleSessionName": "i-06135ad01bb90****"
},
"responseElements": {
"credentials": {
"accessKeyId": "<redacted>",
"sessionToken": "<redacted>",
"expiration": "Mar 23, 2022, 12:01:34 PM"
}
},
"requestID": "d9882911-39e7-449b-9701-***********"",
"eventID": "0fa1b79b-08aa-48e6-8232-***********"",
"readOnly": true,
"resources": [
{
"accountId": "2************",
"type": "AWS::IAM::Role",
"ARN": "arn:aws:iam::2************:role/ec2-instance-***********"
}
],
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "2************",
"sharedEventID": "4b842373-e89d-438b-be3b-*********",
"eventCategory": "Management"
}
The only thing that I can think of is either a hardware failure from AWS side or some crude command ran within the OS by some user that took the instance down. Unfortunately we dont have AWS developer support as that's quite costly.
Has anyone faced anything similar? Any leads on how i can go ahead to find the root cause?
For anyone that is interested or may face this in future, we had to opt for AWS developer support to get an answer, here is what they had to say
From the case notes, I understood that the instance
'i-06135ad01bb******' was missing from yesterday. However, you tried
to check cloudtrail and could not find any traces of termination.
Please correct me if I misunderstood your query.
Upon reviewing the case description, I started checking the instance
using our internal tools and observed that the instance got terminated
on '2022-03-23 06:33 UTC' with the reason
'INSTANCE-INITIATED-SHUTDOWN'.
This means that the shutdown got initiated from OS. Please allow me to
inform you that AWS engineers do not have access or visibility to the
customer's instance/OS level due to data privacy[1] and shared
responsibility model[2]. Hence, I will not be in a position to check
how shutdown call got initiated.
To further investigate this, I checked using our internal tools and
could see that you have selected the 'termination' option: on shut
down which means that when an instance gets shutdown it will be
automatically terminated. So, I would request you to change the option
'termination' : on shut down to 'STOP' : on shut down. With this, if
the OS initiates a shutdown by any chance, the instance will be
stopped instead of getting terminated.
As per my analysis, I can confirm that the AWS infrastructure was healthy and there weren't any issues from our end. However, for the future, I would request you to consider the following best practices which you already might be aware of, however I am mentioning them here for the sake of completeness:
Enable termination protection
Regularly back up your data
Preserving root volume after termination:
Thank You fannymug. I think this was the case also for me.
Just to add up:
in cloudtrail search for the instance ID and select the RunInstances eventName
here it is possible to check the event details.
Double check the value for deleteOnTermination value. If it is set to true, termination protection is not enabled.
To avoid this, during EC2 creation process, look in advanced details > Termination Protection > Enable.
Shutdown behavior option can also help and can be set to Stop.
I hope it helps

Recursive AWS Lambda function calls - Best Practice

I've been tasked to look at a service built on AWS Lambda that performs a long-running task of turning VMs on and off. Mind you, I come from the Azure team so I am not familair with the styling or best practices of AWS services.
The approach the original developer has taken is to send the entire workload to one Lambda function and then have that function take a section of the workload and then recursively call itself with the remaining workload until all items are gone (workload = 0).
Pseudo-ish Code:
// Assume this gets sent to a HTTP Lambda endpoint as a whole
let workload = [1, 2, 3, 4, 5, 6, 7, 8]
// The Lambda HTTP endpoint
function Lambda(workload) {
if (!workload.length) {
return "No more work!"
}
const toDo = workload.splice(0, 2) // get first two items
doWork(toDo)
// Then... except it builds a new HTTP request with aws sdk
Lambda(workload) // 3, 4, 5, 6, 7, 8, etc.
}
This seems highly inefficient and unreliable (correct me if I am wrong). There is a lot of state being stored in this process and in my opinion that creates a lot of failure points.
My plan is to suggest we re-engineer the entire service to use a Queue/Worker type framework instead, where ideally the endpoint would handle one workload at a time, and be stateless.
The queue would be populated by a service (Jenkins? Lambda? Manually?), then a second service would read from the queue (and ideally scale-out as well, as needed).
UPDATE: AWS EventBridge now looks like the preferred solution.
It's "Coupling" that I was thinking of, see here: https://www.jeffersonfrank.com/insights/aws-lambda-design-considerations
Coupling
Coupling goes beyond Lambda design considerations—it’s more about the system as a whole. Lambdas within a microservice are sometimes tightly coupled, but this is nothing to worry about as long as the data passed between Lambdas within their little black box of a microservice is not over-pure HTTP and isn’t synchronous.
Lambdas shouldn’t be directly coupled to one another in a Request Response fashion, but asynchronously. Consider the scenario when an S3 Event invokes a Lambda function, then that Lambda also needs to call another Lambda within that same microservice and so on.
aws lambda coupling
You might be tempted to implement direct coupling, like allowing Lambda 1 to use the AWS SDK to call Lambda 2 and so on. This introduces some of the following problems:
If Lambda 1 is invoking Lambda 2 synchronously, it needs to wait for the latter to be done first. Lambda 1 might not know that Lambda 2 also called Lambda 3 synchronously, and Lambda 1 may now need to wait for both Lambda 2 and 3 to finish successfully. Lambda 1 might timeout as it needs to wait for all the Lambdas to complete first, and you’re also paying for each Lambda while they wait.
What if Lambda 3 has a concurrency limit set and is also called by another service? The call between Lambda 2 and 3 will fail until it has concurrency again. The error can be returned to all the way back to Lambda 1 but what does Lambda 1 then do with the error? It has to store that the S3 event was unsuccessful and that it needs to replay it.
This process can be redesigned to be event-driven: lambda coupling
Not only is this the solution to all the problems introduced by the direct coupling method, but it also provides a method of replaying the DLQ if an error occurred for each Lambda. No message will be lost or need to be stored externally, and the demand is decoupled from the processing.
AWS Step Functions is one way you can achieve this. Step Functions are used to orchestrate multiple Lambda functions in any manner you want - parallel executions, sequential executions or a mix of both. You can also put wait steps, condition checks, retries in between if you need.
Your overall step function might look something like this (say you want 1,2,3 to execute in parallel. Then when all these are complete, you want to execute 4, and then again 5 and 6 in parallel)
Configuring this is also pretty simple. It accepts a JSON like the following
{
"Comment": "An example of the Amazon States Language using a parallel state to execute two branches at the same time.",
"StartAt": "Parallel",
"States": {
"Parallel": {
"Type": "Parallel",
"Next": "Task4",
"Branches": [
{
"StartAt": "Task1",
"States": {
"Task1": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-south-1:XXX:function:XXX",
"End": true
}
}
},
{
"StartAt": "Task2",
"States": {
"Task2": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-south-1:XXX:function:XXX",
"End": true
}
}
},
{
"StartAt": "Task3",
"States": {
"Task3": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-south-1:XXX:function:XXX",
"End": true
}
}
}
]
},
"Task4": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-south-1:XXX:function:XXX",
"Next": "Parallel2"
},
"Parallel2": {
"Type": "Parallel",
"Next": "Final State",
"Branches": [
{
"StartAt": "Task5",
"States": {
"Task5": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-south-1:XXX:function:XXX",
"End": true
}
}
},
{
"StartAt": "Task6",
"States": {
"Task6": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-south-1:XXX:function:XXX",
"End": true
}
}
}
]
},
"Final State": {
"Type": "Pass",
"End": true
}
}
}

Cloud Tasks Not Triggering HTTPrequest Endpoints

I have one simple cloud task queue and have successfully submitted a task to the queue. It is supposed to deliver a JSON payload to my API to perform a basic database update. The task is created at the end of a process in a .net core 3.1 app running locally on my desktop triggered by postman and the API is a golang app running in cloud run. However, the task never seems to fire and never registers an error.
The tasks in queue is always 0 and the tasks running is always blank. I have hit the "Run Now" button dozens of times but it never changes anything and no log entries or failed attempts are ever registered.
The task is created with the OIDCToken with a service account and audience set for the service account that has the authorization to create tokens and execute the cloud run instance.
Screen Shot of Tasks Queue in Google Cloud Console
Task creation log entry shows that it was created OK:
{
"insertId": "efq7sxb14",
"jsonPayload": {
"taskCreationLog": {
"targetAddress": "PUT https://{readacted}",
"targetType": "HTTP",
"scheduleTime": "2020-04-25T01:15:48.434808Z",
"status": "OK"
},
"#type": "type.googleapis.com/google.cloud.tasks.logging.v1.TaskActivityLog",
"task": "projects/{readacted}/locations/us-central1/queues/database-updates/tasks/0998892809207251757"
},
"resource": {
"type": "cloud_tasks_queue",
"labels": {
"target_type": "HTTP",
"project_id": "{readacted}",
"queue_id": "database-updates"
}
},
"timestamp": "2020-04-25T01:15:48.435878120Z",
"severity": "INFO",
"logName": "projects/{readacted}/logs/cloudtasks.googleapis.com%2Ftask_operations_log",
"receiveTimestamp": "2020-04-25T01:15:49.469544393Z"
}
Any ideas as to why the tasks are not running? This is my first time using Cloud Tasks so don't rule out the idiot between the keyboard and the chair.
Thanks!
You might be using a non-default service. See Configuring Cloud Tasks queues
Try creating a task from the command line and watch the logs e.g.
gcloud tasks create-app-engine-task --queue=default \
--method=POST --relative-uri=/update_counter --routing=service:worker \
--body-content=10
In my own case, I used --routing=service:api and it worked straight away. Then I added AppEngineRouting to the AppEngineHttpRequest.

How to send the notification on every task execution in a state machine on AW step functions?

I am working on Amazon Step functions to leverage the workflow for multiple Batch jobs. The requirement is such that the Batch jobs should be executed sequentially and whenever a job transition from one to another job then send a notification with the execution status of the tasks to a SNS topic. I need to send a notification for SUCCESS and FAILURE of a task.
I have tried the Execution Events using Cloudwatch event rules, but Execution Events only gives information about the State Machine's execution, not about the Tasks execution.
As you have found states aren't in Cloudwatch events need to add this as a separate step, there is no way around this, have a notify step which either executes a lambda, or sync to SNS.
There is also another way to do this as you can compose step functions of step functions. So you have your parent step function and your child step function. Your child step function could be the batch job itself, and then you can make use of Cloudwatch events on the batch-job-step-function step function:
"BatchJob" : {
"Comment": "This snippet is in the parent step function. It will kick off another step function, called: batch-job-step-function",
"Type": "Task",
"Resource": "arn:aws:states:::states:startExecution.sync",
"Parameters": {
"StateMachineArn": "arn:aws:states:us-east-1:TODO:stateMachine:batch-job-step-function",
"Input": {
"batchJobInput.$": "$$.Execution.Input.batchJobInput"
}
},
"End": true | "Next" : "TODO"
}
Now you can put Cloudwatch Event Rules against: arn:aws:states:us-east-1:TODO:stateMachine:batch-job-step-function