Django's infinite streaming response logs 500 in apache logs - django

I have a Django+Apache server, and there is a view with infinite streaming response
def my_view(request):
try:
return StreamingHttpResponse(map(
lambda x: f"{dumps(x)}\n",
data_stream(...) # yields dicts forever every couple of seconds
))
except Exception as e:
print_exc()
return HttpResponse(dumps({
"success": False,
"reason": ERROR_WITH_CLASSNAME.format(e.__class__.__name__)
}), status=500, content_type="application/json")
When client closes the connection to the server, there is no cleanup to be done. data_stream will yield one more message which won't get delivered. No harm done if that message is yielded and not received as there are no side-effects. Overhead from processing that extra message is negligible on our end.
However, after that last message fails to deliver, apache logs 500 response code (100% of requests). It's not getting caught by except block, because print_exc doesn't get called (no entries in error log), so I'm guessing this is apache failing to deliver the response from django and switching to 500 itself.
These 500 errors are triggering false positive alerts in our monitoring system and it's difficult to differentiate an error due to connection exit vs an error in the data_stream logic.
Can I override this behavior to log a different status code in the case of a client disconnect?

From what I understand about the StreamingHttpResponse function is that any exceptions raised inside it are not propagated further. This has to do with how WSGI server works. If you start handling an exception and steal the control, the server will not be able to to finish the HTTP response. So the error is handled by the server and printed in the terminal. If you attach the debugger to this and see how the exception is handled you will be able to find a line in wsgiref/handlers.py where your exception is absorbed and taken care of.
I think in this file- https://github.com/python/cpython/blob/main/Lib/wsgiref/handlers.py

Related

Troubleshooting error 503 on Google Cloud Run

I am running a container on google cloud run. For each request a new instance is started. The requests need around 15 minutes to get processed. I modified the default timeout and everything is working fine. But sometimes, around 10% of the request, I get an error
The request failed because either the HTTP response was malformed or
connection to the instance had an error. Additional troubleshooting
documentation can be found at:
https://cloud.google.com/run/docs/troubleshooting#timeout-503
When I re-run the exact same request, I get no errors. I tried to put try catch every where, but I am not able to figure out what is happening. I checked the CPU, memory usage ... Everything looks fine, he maximum reached is 50%. Any advice on how can I get more information about the problem?

"LAMBDA_RUNTIME" Error on high-volume Lambda Function

I'm currently using a Lambda Function written in Javascript that is setup with an SQS event source to automatically pull messages from an SQS Queue and do some basic processing on the message contents. I cannot show the code but the summary of the lambda function's execution is basically:
For each message in the batch it receives as part of the event:
It parses the body, which is a JSON string, into a Javascript object.
It reads an object from S3 that is listed in the object using getObject.
It puts a record into a DynamoDB table using put.
If there were no errors, it deletes the individual SQS message that was processed from the Queue using deleteMessage.
This SQS queue is high-volume and receives messages in-bulk, regularly building up a backlog of millions of messages. The Lambda is normally able to scale to process hundreds of thousands of messages concurrently. This solution has worked well for me with other applications in the past but I'm now encountering the following intermittent error that reliably begins to appear as the Lambda scales up:
[ERROR] [#############] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 400.
I've been unable to find any information anywhere about what this error means and what causes it. There appears to be not discernible pattern as to which executions encounter it. The function is usually able to run for a brief period without encountering the error and scale to expected levels. But then, as you can see, the error starts to appear quite suddenly and completely destroys the Lambda throughput by forcing it to auto-scale down:
Does anyone know what this "LAMBDA_RUNTIME" error means and what might cause it? My Lambda Function runtime is Node v12.
Your function is being invoked asynchronously, so when it finishes it signals the caller if it was sucessful.
You should have an error some milliseconds earlier, probably an unhandled exception not being logged. If that's the case, your functions ends without knowing about the exception and tries to post a success response.
I have this error only that I get:
[ERROR] [1638918279694] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 413.
I went to the lambda function on aws console and ran the test with a custom event I build and the error I got there was:
{
"errorMessage": "Response payload size exceeded maximum allowed payload size (6291556 bytes).",
"errorType": "Function.ResponseSizeTooLarge"
}
So this is the actual error that cloudwatch doesn't return but the testing section of the lambda function console do.
I think I'll have to return info to an S3 file or something, but that's another matter.

API Gateway occasionally spikes 5XX errors in Production

Our API Gateway and Lambdas are regularly used and work just fine most of the time, however we see spikes in 5XX errors now and then which causes a spike in customer complaints and other issues. When I look at the logs during this time I see a flood of the following error:
Execution failed due to configuration error: Malformed Lambda proxy response
There are no other details beyond this. After 10 or 15 minutes it will go away along with customer complaints. I've read that it may happen if you exceed your concurrency limit, but looking at the dashboard and it doesn't look like we ever break above 150 concurrent executions.
The calls themselves being hit work consistently as well, aside from these random spikes in 5XXs.
What else might be causing this inconsistency?
Looking through logs to try and get this figured out. I have made the logs as verbose as possible and there is nothing there. We'll have a normal call with a success response then a few minutes later this error comes up with no other logging, just the error alone. Then a few minutes after that we have logs starting for the next successful call.
10:25:42 Successfully completed execution
10:25:42 Method completed with status: 200
10:42:01 Execution failed due to configuration error: Malformed Lambda
proxy response
12:21:21 Successfully completed execution
12:21:21 Method completed with status: 200
Logging can't go further because the lambdas are never even executed. So we have no details on the payload sent to it, or any internal logging for the call, etc. It just immediately fails at the API Gateway level.
Edit: We still get these spikes but we are working on splitting the lambdas out more. We have an ExpressJS app that handles the lion's share of all requests. So we are breaking more off, especially high traffic requests, into their own lambdas to see if this helps. In-case there is an issue where a container gets too backlogged or times-out because it was handling long running requests (that takes upwards of 20s) as well as being hammered by requests that finish < 500ms.
Other theory is that maybe there is an error that gets triggered somewhere that kills the process or something else and that container is bad until it gets destroyed and respawned. As these spike and then go away in a few minutes. So breaking the lambdas up more should reduce the odds of errors from one cascading and impacting all other requests.
We also increase the resources of the lambda to see if that would help with it handling so many requests.
This usually happens when there is a timeout with your call and if there is a delay with your lambda execution.
If you are accessing an external resource such as RDS or an external network call, wrap that with a promise and handle with a timeout. This way you can identify which resource is having a bottleneck or taking a long time to execute.
exports.handler = function(event, context, callback) {
var response = {}; // set the response object
var err = "An error occured";
setTimeout(function () {
callback(err, response);
}, 3000); // 3000 ms is the timeout
}
// Actual code here
};
Also, check for any missing callbacks. That will also cause this issue.
Hope this helps.

Google pubsub 88% of requests come back as 503

Question on why pubsub requests seem to trigger such a high number of 503 errors? Is this something common? It seems other people see something similar but a majority of my requests end up that way
Similar to
Google Pubsub: UNAVAILABLE: The service was unable to fulfill your request
Catch error code from GCP pub/sub
This is expected behavior. Streaming pull, which is used by the client libraries, creates a bidirectional stream for receiving messages and sending back acknowledgements. These streams stay open for long periods of time and don't close with a successful response code when messages are received, they terminate with an error condition when the stream disconnects, perhaps due to a restart on the part of the server receiving the request or because of brief network blip. Therefore, even if you are receiving messages successfully, you'll still see error response codes for all of the streams themselves. The new streaming pull docs address this question directly.

What to do when Celery broker is down?

I have a Celery server with a RabbitMQ broker. I use it to run background tasks in my Django project.
In one of my views a signal is triggered which then calls a celery task like this:
create_something.delay(pk)
The task is defined like this:
#task
def create_something(donation_pk):
# do something
Everything works great, but:
If RabbitMQ is down when I am calling the task no error is thrown during the create_something.delay(pk) call. But the view throws this error:
[Errno 111] Connection refused
(The stack trace is kind of useless, I think this is because of the signals used)
The question now is: How can I prevent this error? Is there a possibility to perform retries of the create_something.delay(pk) when the broker is down?
Thanks in advance for any hints!
Celery tasks has a .run() method, which will execute the task as it were part of the normal code flow.
create_something.run(pk)
You could catch the exception and execute .run() if needed.
Is there a possibility to perform retries of the create_something.delay(pk) when the broker is down?
The exception thrown when you call the .delay() method and you cannot connect can be caught just like any other exception:
try:
foo.delay()
except <whatever exception is actually thrown>:
# recover
You could build a loop around this to retry but you should take care not to keep the request up for very long. For instance, if it takes a whole second for your connectivity problem to get resolved, you don't want to hold up the request for a whole second. An option here may be to abort quickly but use the logging infrastructure so that an email is sent to the site administrators. A retry loop would be the last thing I'd do once I've identified what causes the connectivity issue and I have determined it cannot be helped. In most cases, it can be helped, and the retry loop is really a bandaid solution.
How can I prevent this error?
By making sure your broker does not go down. To get a more precise answer, you'd have to give more diagnostic information in your question.
By the way, Celery has a notion of retrying tasks but that's for when the task is already known to the broker. It does not apply to the case where you cannot contact the broker.