We have a pretty standard API server written in Go. The http handler unmarshals the request body (JSON) into a protobuf struct, and sending it off to be processed.
The service is deployed as ECS containers on AWS, fronted by ALB. The service has a pretty high request volume, and we observed that about 0.2% requests fail with messages like this:
read tcp $ECS_CONTAINER_IP:$PORT->$REMOTE_IP:$REMOTE_RANDOM_PORT: i/o timeout
We tracked it down that the error is returned from jsonpb.Unmarshal method. Our tracing tells us that all these i/o timeout requests take 5s.
Within the container, we did ss -t | wc -l to see in flight requests, and the number is quite reasonable (about 200-300ish, which is far lower than the no-file ulimit).
Some quick awk/sort/uniq tells us that the number of inflight requests coming from the ALBs are roughly balanced.
Any idea how do we go on from here?
Related
If I understood it well, Google Cloud Run will make an API publicly available. Once a request is received, an instance is started and the job is processed. Once the job done, the instance will be terminated. Is this right?
If so, I presume that Google determine when the instance should be shutdown when the HTTP response in sent back to the client. Is that also right?
In my case the process will run from 10 to 20 Minutes. Can I still send the HTTP response after so much time? Any Advice on how to implement that?
Frankly, all of this is well documented in the cloud run docs:
Somewhat, but this depends on how you configured your scaling https://cloud.google.com/run/docs/about-instance-autoscaling
Also see above, but a request is considered "done" when the HTTP connection is closed (either by you or the client), yes
60 mins is the limit, see:
https://cloud.google.com/run/docs/configuring/request-timeout
Any Advice on how to implement that?
You just keep the connection open for 20mins, but do note the remark on long living connections in the link above.
I have developed an application that does simple publish-subscrible messages with AWS IOT Core service.
As per the requirement of AWS IoT SDK, I need to call aws_iot_mqtt_yield() frequently.
Following is description of function aws_iot_mqtt_yield
Called to yield the current thread to the underlying MQTT client. This
time is used by the MQTT client to manage PING requests to monitor the
health of the TCP connection as well as periodically check the socket
receive buffer for subscribe messages. Yield() must be called at a
rate faster than the keepalive interval. It must also be called at a
rate faster than the incoming message rate as this is the only way the
client receives processing time to manage incoming messages. This is
the outer function which does the validations and calls the internal
yield above to perform the actual operation. It is also responsible
for client state changes
I am calling this function at a period of 1 second.
As it is sending PING on tcp connection, it creates too much internet data usage in long run when system is IDLE for most of the time.
My system works on LTE as well and paying more money for IDLE time is not acceptable for us.
I tried to extend period from 1 second to 30 seconds to limit our data usage but it adds 30 seconds latency in receiving messages from cloud.
My requirement is to achieve fast connectivity with low additional data usage in maintaining connection with AWS.
I have an HTTP API Gateway with a HTTP Integration backend server on EC2. The API has lots of queries during the day and looking at the logs i realized that the API is returning sometimes a 503 HTTP Code with a body:
{ "message": "Service Unavailable" }
When i found out this, i tried the API and running the HTTP requests many times on Postman, when i try twenty times i get at least one 503.
I then thought that the HTTP Integration Server was busy but the server is not loaded and i tried going directly to the HTTP Integration Server and i get 200 responses all the times.
The timeout parameter is set to 30000ms and the endpoint average response time is 200ms so timeout is not a problem. Also the HTTP 503 is not after 30 seconds of the request but instantly.
Can anyone help me?
Thanks
I solved this issue by editing the keep-alive connection parameters of my internal integration server. The AWS API Gateway needs the keep alive parameters on a standard configuration, so I started tweaking my NGINX server parameters until I solved the issue.
Had the same issue on a selfmade Microservice with Node that was integrated into AWS API-Gateway. After some reconfiguration of the Cloudwatch-Logs I got further indicator on what is wrong: INTEGRATION_NETWORK_FAILURE
Verify your problem is alike - i.e. through elaborated log output
In API-Gateway - Logging add more output in "Log format"
Use this or similar content for "Log format":
{"httpMethod":"$context.httpMethod","integrationErrorMessage":"$context.integrationErrorMessage","protocol":"$context.protocol","requestId":"$context.requestId","requestTime":"$context.requestTime","resourcePath":"$context.resourcePath","responseLength":"$context.responseLength","routeKey":"$context.routeKey","sourceIp":"$context.identity.sourceIp","status":"$context.status","errMsg":"$context.error.message","errType":"$context.error.responseType","intError":"$context.integration.error","intIntStatus":"$context.integration.integrationStatus","intLat":"$context.integration.latency","intReqID":"$context.integration.requestId","intStatus":"$context.integration.status"}
After using API-Gateway Endpoint and failing consult the logs again - should be looking like that:
Solve in NodeJS Microservice (using Express)
Add timeouts for headers and keep-alive on express servers socket configuration when upon listening.
const app = require('express')();
// if not already set and required to advertise the keep-alive through HTTP-Response you might want to use this
/*
app.use((req: Request, res: Response, next: NextFunction) => {
res.setHeader('Connection', 'keep-alive');
res.setHeader('Keep-Alive', 'timeout=30');
next();
});
*/
/* ..you r main logic.. */
const server = app.listen(8080, 'localhost', () => {
console.warn(`⚡️[server]: Server is running at http://localhost:8080`);
});
server.keepAliveTimeout = 30 * 1000; // <- important lines
server.headersTimeout = 35 * 1000; // <- important lines
Reason
Some AWS Components seem to demand a connection kept alive - even if server responding otherwise (connection: close). Upon reusage in API Gateway (and possibly AWS ELBs) the recycling will fail because other-side most likely already closed hence the assumed "NETWORK-FAILURE".
This error seems intermittent - since at least the API-Gateway seems to close unused connections after a while providing a clean execution the next time. I can only assume they do that for high-performance and not divert to anything less.
We are working on setting up an API Management portal for one of our Web API. We are using eventhubs for logging the events and we are transferring the event messages to Azure Blob storage using Azure functions.
We would like to know how can we find the Time taken by API Management portal for providing the response for a message (we are capturing the time taken at the back end api layer but not from the API Management layer).
Regards,
John
The simpler solution is to enable Azure Monitor Diagnostic Logs for the Apimanagement service. You will get raw logs for each request including
durationMs - interval between receiving request line and headers from a client and writing last chunk of response body to a client. All writes and reads include network latency.
BackendTime - time spent waiting on backend response
ClientTime - time spent with client for request and response
CacheTime - time spent on fetching from cache
You can also refer this video.
Not the correct way of doing this, but still get an idea of how much time each request is taking. We can actually use the context variable to set the start time in the inbound policy node and then calculate the end time in the outbound node.
I've just deployed a simple Java/Tomcat based application into Elastic Beanstalk (using the java8/tomcat8 config). Mostly the application works fine.
However, all HEAD requests seem to take 60 seconds. Feels like a timeout of some kind. I can't seem to find any settings regarding filtering or delaying particular types of requests. These requests work fine when I run locally. GET requests to the same URL work fine.
I've confirmed that both the Tomcat and the Apache instance on the server log the HEAD request instantly (which indicates they are done with it, right?).
I've confirmed (using telnet) that the client is not receiving any response header bytes until very late. This isn't a problem of the client waiting for a payload or something like that.
Furthermore, the delay is clearly tied to the load balancer's "Idle Timeout" setting. If I push that down to 5 seconds, then the HEAD requests take about 5 seconds, if I set the idle-timeout to 20 seconds then the HEAD requests take just about 20 seconds (always a few ms over). The default is 60s.
What could be causing all HEAD requests (even those returning a 401 unauthorized error, no processing) to clog up the works like that?
Turns out the problem was a firewall issue at the local site. AWS ElasticBeanstock was returning the responses in a timely manner, but they were getting clogged up in a local firewall. Grr..