tl;dr: I'm trying to figure out what about the messages below could cause SQS to fail to process them and trigger the redrive policy which sends them to a Dead Letter Queue. The AWS documentation for DLQs says:
Sometimes, messages can’t be processed because of a variety of possible issues, such as erroneous conditions within the producer or consumer application or an unexpected state change that causes an issue with your application code. For example, if a user places a web order with a particular product ID, but the product ID is deleted, the web store's code fails and displays an error, and the message with the order request is sent to a dead-letter queue.
The context here is that my company uses a Cloud Formation setup to run a virus scanner against files which users upload to our S3 buckets.
The buckets have bucket events which publish PUT actions to an SQS queue.
An EC2 instance subscribes to that queue and runs files which get uploaded to those buckets through a virus scanner.
The messages which enter the queue are coming from S3 bucket events, so it seems like that rules out "erroneous conditions within the producer." Could an SQS redrive policy get fired if a subscriber to the queue fails to process the message?
This is one of the messages which was sent to the DLQ (I've changed letters and numbers in each of the IDs):
{
"Records": [
{
"eventVersion": "2.1",
"eventSource": "aws:s3",
"awsRegion": "us-east-1",
"eventTime": "2019-09-30T20:21:13.762Z",
"eventName": "ObjectCreated:Put",
"userIdentity": {
"principalId": "AWS:AIDAIQ6ZKWSHYT34HC0X2"
},
"requestParameters": {
"sourceIPAddress": "52.161.96.193"
},
"responseElements": {
"x-amz-request-id": "9F500CA65B966D84",
"x-amz-id-2": "w1R6BLPAI68na+xNssfdscQjfOQk56gmof+Bp4nF/rY90jBWnlqliHLrnwHWx20329clJckCIzhI="
},
"s3": {
"s3SchemaVersion": "1.0",
"configurationId": "VirusScan",
"bucket": {
"name": "uploadcenter",
"ownerIdentity": {
"principalId": "A2CSGHOAZOCNTU"
},
"arn": "arn:aws:s3:::sharingcenter"
},
"object": {
"key": "Packard/f43edeee-6d58-118f-f8b8-4ec57f9cdb54Transformers/Transformers.mp4",
"size": 1317070058,
"eTag": "4a828a976dbdfe6fe1931f8e96437e2",
"sequencer": "005D20633476B28AE7"
}
}
}
]
}
I've been puzzling over this message and similar ones trying to figure out what may have triggered the redrive policy. Could it have been caused by the EC2 instance failing to process the message? There's nothing in Ruby script on the instance which would publish a message to the DLQ. Each of these files is uncommonly large. Is it possible that something in the process choked on the file because of its size, and that caused the redrive? If it's not possible for the EC2 failure to have caused the redrive, what is it about the message which would cause SQS to send it to the DLQ?
Amazon SQS is typically used as follows:
Something publishes a message to a queue (in your case, an S3 PUT event)
Worker(s) request a message from the queue and process the message
The message becomes "invisible" so that other workers cannot see it
If the message was processed successfully, the worker tells SQS to delete the message
If the worker does not respond within the invisibility timeout period, then SQS puts the message back on the queue
If a message fails more than a configured number of times (that is, if the workers do not delete the message), then the message is moved to a nominated Dead Letter Queue
Please note that there are no "subscribers" to SQS queues. Rather, applications call the SQS API and request a message.
The fact that you are getting messages in the DLQ indicates that the worker (virus checker) is not deleting the message within the invisibility period.
It is possible that the virus checker requires more time to scan large files, in which case you could increase the invisibility timeout on the queue to give it more time.
The workers can also signal back to SQS that they are still working on the message, which will refresh the timeout. This will need some modification to the virus checker to send such a signal at regular intervals.
Bottom line: The worker (virus checker) is not completing the task within the timeout period.
Related
I have set up my SageMaker endpoint using Async Endpoint and provided SNS Topic ARN for the SuccessTopic and ErrorTopic parameters. I am receiving success and failure messages from these SNS Topics without error but I want the failure messages to be more verbose. The failure message looks like this:
{
"awsRegion": "...",
"eventTime": "...",
"receivedTime": "...",
"invocationStatus": "Failed",
"failureReason": "ClientError: Received server error (500) from model. See the SageMaker Endpoint logs in your account for more information.",
"requestParameters": {
"endpointName": ...,
"inputLocation": ...
},
"inferenceId": "...",
"eventVersion": "1.0",
"eventSource": "aws:sagemaker",
"eventName": "InferenceResult"
}
There might be different reasons for the service to throw error such as CUDA OOM errors or an assertion error that I have thrown. I would love to see this information from the SNS message. However, the only way to see any additional information for the error is to see the SageMaker Endpoint logs.
Each time I receive an error from sagemaker service, The failureReason is the same. Is there a way to specify failureReason parameter in this message?
I have tried adding Exception messages to all exceptions that I raised in the code but the message never changed. I have no access to SNS topic during the execution. I have created an SNS client using boto3 and sent a notification before raising any assertion error but I don't know any way to stop executing the sagemaker execution without throwing an error, which sends another failure message automatically.
After setting the EventBridge, S3 put object event still cannot trigger the StepFuction.
However, I tried to change the event rule to EC2 status. It's working !!!
I also try to change the rule to S3 all event, but it still not working.
Amazon EventBridge:
Event pattern:
{
"source": ["aws.s3"],
"detail-type": ["AWS API Call via CloudTrail"],
"detail": {
"eventSource": ["s3.amazonaws.com"],
"eventName": ["PutObject"],
"requestParameters": {
"bucketName": ["MY_BUCKETNAME"]
}
}
Target(s):
Type:Step Functions state machine
ARN:arn:aws:states:us-east-1:xxxxxxx:stateMachine:MY_FUNCTION_NAME
Reference:https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-cloudwatch-events-s3.html
Your step function isn't being triggered because the PutObject events aren't being published to cloudtrail. S3 operations are classified as data events so you must enable Data events when creating your cloudTrail. The tutorial says next next and create which seems to suggest no additional options need to be selected. By default, Data Events on the next step (step 2 - Choose log events - as of this writing) is not checked. You have to check it and fill up the bottom part to specify if all buckets/events are to be logged.
We can set up event rules to trigger an ECS task, but I don't see if the triggering event is passed to the runing ECS task and in the task how to fetch the content of this event. If a Lambda is triggered, we can get it from the event variable, for example, in Python:
def lambda_handler(event, context):
...
But in ECS I don't see how I can do things similar. Going to the cloudtrail log bucket doesn't sound to be a good way because it has around 5 mins delay for the new log/event to show up, which requires ECS to be waiting and additional logic to talk to S3 and find & read the log. And when the triggering events are frequent, this sounds hard to handle.
One way to handle this is to set two targets In the Cloud watch rule.
One target will launch the ECS task
One target will push same event to SQS
So the SQS will contain info like
{
"version": "0",
"id": "89d1a02d-5ec7-412e-82f5-13505f849b41",
"detail-type": "Scheduled Event",
"source": "aws.events",
"account": "123456789012",
"time": "2016-12-30T18:44:49Z",
"region": "us-east-1",
"resources": [
"arn:aws:events:us-east-1:123456789012:rule/SampleRule"
],
"detail": {}
}
So when the ECS TASK up, it will be able to read event from the SQS.
For example in Docker entrypoint
#!/bin/sh
echo "Starting container"
echo "Process SQS event"
node process_schdule_event.sj
#or if you need process at run time
schdule_event=$(aws sqs receive-message --queue-url https://sqs.us-west-2.amazonaws.com/123456789/demo --attribute-names All --message-attribute-names All --max-number-of-messages 1)
echo "Schdule Event: ${schdule_event}"
# one process done, start the main process of the container
exec "$#"
After further investigation, I finally worked out another solution that is to use S3 to invoke Lambda and then in that Lambda I use ECS SDK (boto3, I use Python) to run my ECS task. By this way I can easily pass the event content to ECS and it is nearly real-time.
But I still give credit to #Adiii because his solution also works.
I have lambda which is triggered by cloudwatch event when VPN tunnels are down or up. I searched online but can't find a way to trigger this cloudwatch event.
I see an option for test event but what can I enter in here for it to trigger an event that tunnel is up or down?
You can look into CloudWatchEventsandEventPatterns
Events in Amazon CloudWatch Events are represented as JSON objects.
For more information about JSON objects, see RFC 7159. The following
is an example event:
{
"version": "0",
"id": "6a7e8feb-b491-4cf7-a9f1-bf3703467718",
"detail-type": "EC2 Instance State-change Notification",
"source": "aws.ec2",
"account": "111122223333",
"time": "2017-12-22T18:43:48Z",
"region": "us-west-1",
"resources": [
"arn:aws:ec2:us-west-1:123456789012:instance/ i-1234567890abcdef0"
],
"detail": {
"instance-id": " i-1234567890abcdef0",
"state": "terminated"
}
}
Also log based on event, you can pick your required event from AWS CW EventTypes
I believe in your scenario, you don't need to pass any input data as you must have built the logic to test the VPN tunnels connectivity within the Lamda. You can remove that JSON from the test event and then run the test.
If you need to pass in some information as part of the input event then follow the approach mentioned by #Adiii.
EDIT
The question is more clear through the comment which says
But question is how will I trigger the lambda? Lets say I want to
trigger it when tunnel is down? How will let lambda know tunnel is in
down state? – NoviceMe
This can be achieved by setting up a rule in Cloudwatch to schedule the lambda trigger at a periodic interval. More details here:
Tutorial: Schedule AWS Lambda Functions Using CloudWatch Events
Lambda does not have an invocation trigger right now that can monitor a VPN tunnel, so the only workaround is to poll the status through lamda.
I set an SNS notification to send me an email whenever there is a change regarding the IAM policies. When a change occurs, CloudTrail sends a Log to CloudWatch which triggers an alarm attached to an SNS topic. More details in this link.
Here is an example of what I get by mail:
Alarm Details:
- Name: PolicyAlarm
- Description: This alarm is to monitor IAM Changes
- State Change: INSUFFICIENT_DATA -> ALARM
- Reason for State Change: Threshold Crossed: 1 datapoint [1.0 (31/08/17 09:15:00)] was greater than or equal to the threshold (1.0).
- Timestamp: Thursday 31 August, 2017 09:20:39 UTC
- AWS Account: 00011100000
Threshold:
- The alarm is in the ALARM state when the metric is GreaterThanOrEqualToThreshold 1.0 for 300 seconds.
The only relevant information here is the AWS Account ID. Is there a way to also include the change? Who made it, when and where? Or maybe send little information from the cloudwatch log like the "eventName" ?
There are two ways to trigger notifications from an AWS CloudTrail:
Configure Amazon CloudWatch Logs to look for specific strings. When found, it increments a metric. Then, create an alarm that triggers when the metric exceeds a particular value over a particular period of time. When the notification is sent, only information about the alarm is sent. OR...
Create a rule in Amazon CloudWatch Events to look for the event. Set an Amazon SNS topic as the target. When the notification is sent, full details of the event are passed through.
You should use # 2, since it provides full details of the event.
Here's what I did to test:
Created an Amazon SQS queue in us-east-1 (where all IAM events take place)
Created an Amazon CloudWatch Events rule in us-east-1 with:
Service Name: IAM
Event Type: AWS API Call via CloudTrail
Specific Operations: PutUserPolicy
Edited an IAM policy
Within a short time, the event appeared in SQS:
Here's the relevant bits of the policy that came through:
{
"detail-type": "AWS API Call via CloudTrail",
"source": "aws.iam",
"region": "us-east-1",
"detail": {
"eventSource": "iam.amazonaws.com",
"eventName": "PutUserPolicy",
"awsRegion": "us-east-1",
"requestParameters": {
"policyDocument": "{\n \"Version\": \"2012-10-17\",\n ... }",
"policyName": "my-policy",
"userName": "my-user"
},
"eventType": "AwsApiCall"
}
}
I sent the message to SQS, but you could also send it to SNS to then forward via email.