I have set up my SageMaker endpoint using Async Endpoint and provided SNS Topic ARN for the SuccessTopic and ErrorTopic parameters. I am receiving success and failure messages from these SNS Topics without error but I want the failure messages to be more verbose. The failure message looks like this:
{
"awsRegion": "...",
"eventTime": "...",
"receivedTime": "...",
"invocationStatus": "Failed",
"failureReason": "ClientError: Received server error (500) from model. See the SageMaker Endpoint logs in your account for more information.",
"requestParameters": {
"endpointName": ...,
"inputLocation": ...
},
"inferenceId": "...",
"eventVersion": "1.0",
"eventSource": "aws:sagemaker",
"eventName": "InferenceResult"
}
There might be different reasons for the service to throw error such as CUDA OOM errors or an assertion error that I have thrown. I would love to see this information from the SNS message. However, the only way to see any additional information for the error is to see the SageMaker Endpoint logs.
Each time I receive an error from sagemaker service, The failureReason is the same. Is there a way to specify failureReason parameter in this message?
I have tried adding Exception messages to all exceptions that I raised in the code but the message never changed. I have no access to SNS topic during the execution. I have created an SNS client using boto3 and sent a notification before raising any assertion error but I don't know any way to stop executing the sagemaker execution without throwing an error, which sends another failure message automatically.
Related
I am trying to create a cloud formation stack using AWS Events to trigger an API call on a schedule. Most of the stack is working, however, the AWS::Events::ApiConnection is failing to create and I am not sure why.
This is the CF snippet that is failing: (Note, The API doesn't have any authentication yet, however, cloud formation requires the AuthParameters property)
"CronServerApiConnection": {
"Type": "AWS::Events::Connection",
"Properties": {
"Name": "api-connection",
"AuthorizationType": "API_KEY",
"AuthParameters": {
"ApiKeyAuthParameters": {
"ApiKeyName": "foo",
"ApiKeyValue": "bar"
}
}
}
},
In the cloud formation console this fails to create with the following error:
Resource handler returned message: "Error occurred during operation 'AWS::Events::Connection'." (RequestToken: xxxxxxxxxxxxxxxxx, HandlerErrorCode: GeneralServiceException)
I can't for the life of me figure this one out. from what I can see my CF snippet matches exactly what AWS specify in their docs here https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-events-connection.html
I ran into this issue myself a few weeks ago, and while looking for an answer I found this question unresolved so I thought I would share the answer. The events API is not descriptive at all with any of the errors, in my case the issues were permissions related. While is not clear in the documentation the AWS::Events::Connection not only needs permissions for the events API but also for the secretsmanager API since it will create some secrets for you under the hood. I solved this by adding full API permissions to the role creating the stack but of course I scoped the permissions by the resource to avoid security issues, something like:
effects: "Allow"
actions: [
"events:*",
"secretsmanager:*"
]
resources: [
"arn:aws:secretsmanager:<your region>:<your-account-id>:secret:events!connection/<yoursecretnameprefix>-*"
]
I will leave the addition of the event resource to you, but essentially is the same just scope by the arn of your resource. The above is just an example please replace the placeholders with the correct values.
I have trying to publish sms from AWS SNS console. It show a success result. But the message is not geting.
Every requests were noted as failure in the console
The response when i publish text message :
SMS message published to phone number +91XXXXXXXXXX successfully.
Message "ID": e3d2bc39-2792-5b2e-adcc-e4733a800795
I was facing the same issue and found I need to generate a support ticket to use SNS SMS.
Below is link for generating supporting ticket, explain your use case for SNS SMS
SNS support ticket link
You can activate Delivery status logging.
From Viewing Amazon CloudWatch metrics and logs for SMS deliveries - Amazon Simple Notification Service:
On the Text messaging (SMS) page, in the Text messaging preferences section, choose Edit.
On the Edit text messaging preferences page, in the Delivery status logging section, do the following:
Sample rate: 100%
Service role: Create a new service role (or choose an existing one if it is there)
You can then send an SMS direct from the Text message (SMS) page. It will show a Delivery Statistics graph to indicate success/failure.
Also, for each message, there will be a log entry in Amazon CloudWatch Logs (go to CloudWatch / Logs / then choose the SNS log). It will look similar to this:
{
"notification": {
"messageId": "xxx",
"timestamp": "2020-12-09 08:40:19.536"
},
"delivery": {
"phoneCarrier": "Optus Mobile Pty Ltd",
"mnc": 2,
"numberOfMessageParts": 1,
"destination": "+61455555555",
"priceInUSD": 0.03809,
"smsType": "Promotional",
"mcc": 505,
"providerResponse": "Message has been accepted by phone carrier",
"dwellTimeMs": 524,
"dwellTimeMsUntilDeviceAck": 2453
},
"status": "SUCCESS"
}
This log will give you the most detail of whether an SMS was sent to the phone carrier, so that you can determine where it might be failing.
tl;dr: I'm trying to figure out what about the messages below could cause SQS to fail to process them and trigger the redrive policy which sends them to a Dead Letter Queue. The AWS documentation for DLQs says:
Sometimes, messages can’t be processed because of a variety of possible issues, such as erroneous conditions within the producer or consumer application or an unexpected state change that causes an issue with your application code. For example, if a user places a web order with a particular product ID, but the product ID is deleted, the web store's code fails and displays an error, and the message with the order request is sent to a dead-letter queue.
The context here is that my company uses a Cloud Formation setup to run a virus scanner against files which users upload to our S3 buckets.
The buckets have bucket events which publish PUT actions to an SQS queue.
An EC2 instance subscribes to that queue and runs files which get uploaded to those buckets through a virus scanner.
The messages which enter the queue are coming from S3 bucket events, so it seems like that rules out "erroneous conditions within the producer." Could an SQS redrive policy get fired if a subscriber to the queue fails to process the message?
This is one of the messages which was sent to the DLQ (I've changed letters and numbers in each of the IDs):
{
"Records": [
{
"eventVersion": "2.1",
"eventSource": "aws:s3",
"awsRegion": "us-east-1",
"eventTime": "2019-09-30T20:21:13.762Z",
"eventName": "ObjectCreated:Put",
"userIdentity": {
"principalId": "AWS:AIDAIQ6ZKWSHYT34HC0X2"
},
"requestParameters": {
"sourceIPAddress": "52.161.96.193"
},
"responseElements": {
"x-amz-request-id": "9F500CA65B966D84",
"x-amz-id-2": "w1R6BLPAI68na+xNssfdscQjfOQk56gmof+Bp4nF/rY90jBWnlqliHLrnwHWx20329clJckCIzhI="
},
"s3": {
"s3SchemaVersion": "1.0",
"configurationId": "VirusScan",
"bucket": {
"name": "uploadcenter",
"ownerIdentity": {
"principalId": "A2CSGHOAZOCNTU"
},
"arn": "arn:aws:s3:::sharingcenter"
},
"object": {
"key": "Packard/f43edeee-6d58-118f-f8b8-4ec57f9cdb54Transformers/Transformers.mp4",
"size": 1317070058,
"eTag": "4a828a976dbdfe6fe1931f8e96437e2",
"sequencer": "005D20633476B28AE7"
}
}
}
]
}
I've been puzzling over this message and similar ones trying to figure out what may have triggered the redrive policy. Could it have been caused by the EC2 instance failing to process the message? There's nothing in Ruby script on the instance which would publish a message to the DLQ. Each of these files is uncommonly large. Is it possible that something in the process choked on the file because of its size, and that caused the redrive? If it's not possible for the EC2 failure to have caused the redrive, what is it about the message which would cause SQS to send it to the DLQ?
Amazon SQS is typically used as follows:
Something publishes a message to a queue (in your case, an S3 PUT event)
Worker(s) request a message from the queue and process the message
The message becomes "invisible" so that other workers cannot see it
If the message was processed successfully, the worker tells SQS to delete the message
If the worker does not respond within the invisibility timeout period, then SQS puts the message back on the queue
If a message fails more than a configured number of times (that is, if the workers do not delete the message), then the message is moved to a nominated Dead Letter Queue
Please note that there are no "subscribers" to SQS queues. Rather, applications call the SQS API and request a message.
The fact that you are getting messages in the DLQ indicates that the worker (virus checker) is not deleting the message within the invisibility period.
It is possible that the virus checker requires more time to scan large files, in which case you could increase the invisibility timeout on the queue to give it more time.
The workers can also signal back to SQS that they are still working on the message, which will refresh the timeout. This will need some modification to the virus checker to send such a signal at regular intervals.
Bottom line: The worker (virus checker) is not completing the task within the timeout period.
I have created an application where I am adding users from aws lambda function to Cognito and also mapping the users to a group.
I didn't get any error, While creating users on aws cognito.
I have configured aws cognito to send sms when a new user created.
Sms is not received by some numbers but checking the logs it's marked as delivered.
Please have a look at the below log which confirms that message is received by the user but it's not really delivered.
Cognito Region: US WEST(Oregon)
{
"notification": {
"messageId": "8e7158eb-64dd-53f6-82aa-xxxxxxxxxxxx", // I have replaced original id characters by x
"timestamp": "2019-06-04 16:18:29.681"
},
"delivery": {
"phoneCarrier": "AT&T",
"mnc": 180,
"destination": "+1310600xxxx", // I have replaced last 4 digit with x here to show code.
"priceInUSD": 0.00645,
"smsType": "Transactional",
"mcc": 311,
"providerResponse": "Message has been accepted by phone",
"dwellTimeMs": 381,
"dwellTimeMsUntilDeviceAck": 890698
},
"status": "SUCCESS"
}
AWS cognito MFA and Verifications:
AWS cognito Message Customizations:
Many reasons led me to believe that this "issue" seems like an issue only due to AWS's poor logging and response mechanism. The failure and reason should be indicated in the response.
After trying to isolate the issue I understood that applying a request for SNS spending limit increase should solve the issue.
You are right, there's no indication that limit exceeding is the true issue, though multiple posts in the subject are pointing to that solution.
I created an API by AWS API Gateway and Lambda that is same 'https://github.com/aws-samples/simple-websockets-chat-app'. But the API not working trust. I get an error when i try to connect. Its message is "WebSocket connection to 'wss://b91xftxta9.execute-api.eu-west-1.amazonaws.com/dev' failed: Error during WebSocket handshake: Unexpected response code: 500"
My Connection Code
var ws= new WebSocket("wss://b91xftxta9.execute-api.eu-west-1.amazonaws.com/dev");
ws.onopen=function(d){
console.log(d);
}
Try adding $context.error.validationErrorString and $context.integrationErrorMessage to the logs for the stage.
I added a bunch of stuff to the Log Format section, like this:
{ "requestId":"$context.requestId", "ip": "$context.identity.sourceIp",
"requestTime":"$context.requestTime", "httpMethod":"$context.httpMethod",
"routeKey":"$context.routeKey", "status":"$context.status",
"protocol":"$context.protocol", "errorMessage":"$context.error.message",
"path":"$context.path",
"authorizerPrincipalId":"$context.authorizer.principalId",
"user":"$context.identity.user", "caller":"$context.identity.caller",
"validationErrorString":"$context.error.validationErrorString",
"errorResponseType":"$context.error.responseType",
"integrationErrorMessage":"$context.integrationErrorMessage",
"responseLength":"$context.responseLength" }
In early development this allowed me to see this type of error:
{
"requestId": "QDu0QiP3oANFPZv=",
"ip": "76.54.32.210",
"requestTime": "21/Jul/2020:21:37:31 +0000",
"httpMethod": "POST",
"routeKey": "$default",
"status": "500",
"protocol": "HTTP/1.1",
"integrationErrorMessage": "The IAM role configured on the integration
or API Gateway doesn't have permissions to call the integration.
Check the permissions and try again.",
"responseLength": "35"
}
try using wscat -c wss://b91xftxta9.execute-api.eu-west-1.amazonaws.com/dev in a terminal. This should allow you to connect it. If you don't have wscat installed, just do a npm install -g wscat
To get more details, enable logging for your API: Stages -> Logs/Tracing -> CloudWatch Settings -> Enable CloudWatch Logs. Then, send a connection request again and monitor your API logs in CloudWatch. In my case, I had the next error:
Execution failed due to configuration error: API Gateway does not have permission to assume the provided role {arn_of_my_role}
So, I added API Gateway to my role's Trust Relationships, as it's mentioned here and it fixed the problem.