Mysterious 500 error with AWS Lambda; unable to debug - amazon-web-services

I have an API that I host using Lambda (nodejs), with API-gateway. I'm using serverless to deploy.
Generally things have been fine, but while I was working on a specific function today, I started to receive HTTP 500 errors when hitting the endpoint. However, while there were still API-Gateway access logs for the end point, there were no Cloudwatch logs for the lambda functions getting hit. I was able to verify that the Authorizer was getting hit successfully, and not returning any issue (if it was, it would have been a 401). After using CLI tools to invoke the function from the command line, the 500 error went away and I was able to successfully hit the endpoints again.
Has anyone ever ran into this before? If I'm missing a debug step, I would really like to know. It was really concerning that my API could be generating 500 errors with no paper trail to help me understand what was happening.

You can check your role and permissions ,this link could help you https://aws.amazon.com/premiumsupport/knowledge-center/api-gateway-lambda-stage-variable-500/
Also you can debug further with X-ray : https://docs.aws.amazon.com/lambda/latest/dg/services-xray.html

Related

Troubleshooting error 503 on Google Cloud Run

I am running a container on google cloud run. For each request a new instance is started. The requests need around 15 minutes to get processed. I modified the default timeout and everything is working fine. But sometimes, around 10% of the request, I get an error
The request failed because either the HTTP response was malformed or
connection to the instance had an error. Additional troubleshooting
documentation can be found at:
https://cloud.google.com/run/docs/troubleshooting#timeout-503
When I re-run the exact same request, I get no errors. I tried to put try catch every where, but I am not able to figure out what is happening. I checked the CPU, memory usage ... Everything looks fine, he maximum reached is 50%. Any advice on how can I get more information about the problem?

Media Tailor ad returning 504 error in AWS

I'm using AWS Media Tailor to test an ad inserting demo. The demo page is this one: https://github.com/aws-samples/aws-media-services-simple-vod-workflow/tree/master/12-AdMarkerInsertion.
When I place my manifest into a TheoPlayer I always get an 504 error. My manifes page is: https://ebf348c58b834d189af82777f4f742a6.mediatailor.us-west-2.amazonaws.com/v1/master/3c879a81c14534e13d0b39aac4479d6d57e7c462/MyTestCampaign/llama.m3u8.
I have also tried with: https://ebf348c58b834d189af82777f4f742a6.mediatailor.us-west-2.amazonaws.com/v1/master/3c879a81c14534e13d0b39aac4479d6d57e7c462/MyTestCampaign/llama_with_slates.m3u8.
The specific error is:
{"message":"failed to generate manifest: Unable to obtain template playlist. sessionId:[c915d529-3527-4e37-89e0-087e393e75de]"}
I have read about this error: https://docs.aws.amazon.com/mediatailor/latest/ug/playback-errors-examples.html
But don't know how to fix it.
Maybe I did something wrong or do I need a quote in AWS?
Any idea?
Thanks for the inquiry!
The following example shows the result when a timeout occurs between AWS Elemental MediaTailor and either the ad decision server (ADS) or the origin server.
An HTTP 504 error is known as a Gateway Timeout meaning that a resource was unresponsive and prevented the request from completing successfully. In this case since MediaTailor is returning an HTTP 504 this means that either the ADS or Origin failed to respond within the timeout period.
To troubleshoot this you will need to determine which dependency is failing to respond to MediaTailor and correct it. Typically the issue is the ADS failing to respond to a VAST request performed by MediaTailor which you can confirm by reviewing your CloudWatch logs.
https://docs.aws.amazon.com/mediatailor/latest/ug/monitor-cloudwatch-ads-logs.html
Make sure that your ADS follows the guidelines listed below for integrating with MediaTailor.
https://docs.aws.amazon.com/mediatailor/latest/ug/vast-integration.html

Webhook call failed: URL_REJECTED error in DialogFlow v2 Fulfillments

Error description
Upon calling DialogFlow v2 detectIntent API, we randomly get an internal error with status code 13:
Webhook call failed. Fetch failure with no HTTP status code. Status: State: URL_REJECTED Reason: 67
This error seems to happen randomly. The same request can succeed or fail.
Interesting point, the service has been deteriorating since Friday 23th August 2019, to fail on almost every call today.
Our investigation
We didn't find anything at all about URL_REJECTED with DialogFlow or Google on internet.
But we found the meaning of the status code 13 on this page:
Internal errors. This means that some invariants expected by the underlying system have been broken. This error code is reserved for serious errors.
We also checked that we aren't banning Google IP, our that our load-balancing is not messed up (we thought of that since it would make sense with random fails).
The webhook is up and running, and we can call it ourselves. The problem seems to happen in Google's infra, as the error code 13 seems to show.
(I answer immediatly because we fixed it before posting the question. But I posted nevertheless because it may be useful for others)
The problem was that the webhook was called using http.
Setting https solved the problem.
It seems that Google activated a webhook policy of rejecting unsecure calls in their servers.
It may have been deployed gradually on their cluster, which would explain the gradual degradation.
We know that we should have migrated to https a long time ago, but still we didn't find any mention of the application of this policy on the net.
Thank you for posting this. I came across the same issue. Changed my webhook to HTTPS seems to fix the problem.

Authentication with Cognito - where to find logs

We have 2 React Native app are using AWS Cognito for authentication. We use library react-native-aws-cognito-js in our code. The apps are working fine until these 2 days. Apps are experiencing intermittent "Internal Server Error".
How can I find more information about this error? Any tool can help us pinpoint the cause?
Update
From CloudTrail, each API call has an event "CreateNetworkInterface". Many of such API calls have error code "Client.NetworkInterfaceLimitExceeded". What is the cause and solution to this?
According to this AWS Doc (in Chinese), CloudWatch will not write to log when error is due to insufficient IP/ENI. That explains the increase in error number but no logs in CloudWatch.
Upate 2
We have found a scheduled Lambda job which may exhausted IP addresses. We stopped the batch job. But still can't have too many user login to server due to "Client.NetworkInterfaceLimitExceeded" error. I realized that there are many "CreateNetworkInterface" event and few "DeleteNetworkInterface" event. How can I "clean up / reset" all network interface in VPC?
Short answer: Cloud Trail.
Long answer with a suggestion
Assuming your application code is fine, most likely the cause of your 500 error is based on Cognito's initial limitations (e.g., number of calls per user): https://docs.aws.amazon.com/cognito/latest/developerguide/limits.html.
AWS suggests to use Cloud Trail, for logging Api calls.
However I would suggest, to prove the limitations first, add some logs around the api call yourself, and in development you could call your app/api with a high number of calls; and most likely you will see the 500 error due to the limitations.
You could do the following in the terminal:
for i in `seq 1 1000`; do curl --cookie SecureCookie=TokenValueFromAWS http://localhost:desirablePort/SecuredPath; done

AWS Lambda cloudwatch logs throw "Failed to load events" error

I have a lambda function with the following permissions
whenever the lambda function is triggered, I see a log file being added to the logstream in cloudwatch.
When I try to open any of the logs, it throws Failed to load events
Unexpected error loading events error. Please help with fixing the issue.
I had the same error. The issue was: my 'Authorization' header has been modified. In my case — I added it using chrome extension for the testing purposes
For what it's worth, I saw the same error using Firefox but got the expected results using Chrome. So use Chrome to view CloudWatch.
This just happened to me. I suspect it had something to do with AWS's recent DNS issues. I had to go into my router's settings and change from dynamic DNS to static DNS. I used servers 1.0.0.1 and 1.1.1.1. Hope this helps someone!