AWS HTTP API Gateway 503 Service Unavailable - amazon-web-services

I have an HTTP API Gateway with a HTTP Integration backend server on EC2. The API has lots of queries during the day and looking at the logs i realized that the API is returning sometimes a 503 HTTP Code with a body:
{ "message": "Service Unavailable" }
When i found out this, i tried the API and running the HTTP requests many times on Postman, when i try twenty times i get at least one 503.
I then thought that the HTTP Integration Server was busy but the server is not loaded and i tried going directly to the HTTP Integration Server and i get 200 responses all the times.
The timeout parameter is set to 30000ms and the endpoint average response time is 200ms so timeout is not a problem. Also the HTTP 503 is not after 30 seconds of the request but instantly.
Can anyone help me?
Thanks

I solved this issue by editing the keep-alive connection parameters of my internal integration server. The AWS API Gateway needs the keep alive parameters on a standard configuration, so I started tweaking my NGINX server parameters until I solved the issue.

Had the same issue on a selfmade Microservice with Node that was integrated into AWS API-Gateway. After some reconfiguration of the Cloudwatch-Logs I got further indicator on what is wrong: INTEGRATION_NETWORK_FAILURE
Verify your problem is alike - i.e. through elaborated log output
In API-Gateway - Logging add more output in "Log format"
Use this or similar content for "Log format":
{"httpMethod":"$context.httpMethod","integrationErrorMessage":"$context.integrationErrorMessage","protocol":"$context.protocol","requestId":"$context.requestId","requestTime":"$context.requestTime","resourcePath":"$context.resourcePath","responseLength":"$context.responseLength","routeKey":"$context.routeKey","sourceIp":"$context.identity.sourceIp","status":"$context.status","errMsg":"$context.error.message","errType":"$context.error.responseType","intError":"$context.integration.error","intIntStatus":"$context.integration.integrationStatus","intLat":"$context.integration.latency","intReqID":"$context.integration.requestId","intStatus":"$context.integration.status"}
After using API-Gateway Endpoint and failing consult the logs again - should be looking like that:
Solve in NodeJS Microservice (using Express)
Add timeouts for headers and keep-alive on express servers socket configuration when upon listening.
const app = require('express')();
// if not already set and required to advertise the keep-alive through HTTP-Response you might want to use this
/*
app.use((req: Request, res: Response, next: NextFunction) => {
res.setHeader('Connection', 'keep-alive');
res.setHeader('Keep-Alive', 'timeout=30');
next();
});
*/
/* ..you r main logic.. */
const server = app.listen(8080, 'localhost', () => {
console.warn(`⚡️[server]: Server is running at http://localhost:8080`);
});
server.keepAliveTimeout = 30 * 1000; // <- important lines
server.headersTimeout = 35 * 1000; // <- important lines
Reason
Some AWS Components seem to demand a connection kept alive - even if server responding otherwise (connection: close). Upon reusage in API Gateway (and possibly AWS ELBs) the recycling will fail because other-side most likely already closed hence the assumed "NETWORK-FAILURE".
This error seems intermittent - since at least the API-Gateway seems to close unused connections after a while providing a clean execution the next time. I can only assume they do that for high-performance and not divert to anything less.

Related

Go: i/o timeout reading a request proxied by AWS ALB

We have a pretty standard API server written in Go. The http handler unmarshals the request body (JSON) into a protobuf struct, and sending it off to be processed.
The service is deployed as ECS containers on AWS, fronted by ALB. The service has a pretty high request volume, and we observed that about 0.2% requests fail with messages like this:
read tcp $ECS_CONTAINER_IP:$PORT->$REMOTE_IP:$REMOTE_RANDOM_PORT: i/o timeout
We tracked it down that the error is returned from jsonpb.Unmarshal method. Our tracing tells us that all these i/o timeout requests take 5s.
Within the container, we did ss -t | wc -l to see in flight requests, and the number is quite reasonable (about 200-300ish, which is far lower than the no-file ulimit).
Some quick awk/sort/uniq tells us that the number of inflight requests coming from the ALBs are roughly balanced.
Any idea how do we go on from here?

Client Returns Network Error, but Successful Server POST Request

RESPONSE HEADER
Why am I receiving a network error? Does anyone have a clue what layer this is occurring / how I can resolve this issue?
What I've Tried
(1) Checked CORS... everything seems to be ok.
(2) Tried to add timeouts in YAML file as annotations in my LB.
(Note) The request seems to be timing out after 60 seconds
Process:
(1) Axios POST request triggered from front via button click.
(2) Flask server (back) receives POST request and begins to process.
[ERROR OCCURS HERE] (3) Flask server is still processing request on the back; however the client receives a 504 timeout, and there is also some CORS origin mention (don't think this is the issue though, as I've set my CORS settings properly, and this doesn't pop up for any other requests...).
(4) Server responds with a 200 and successfully sets data.
Current stack:
(1) AWS EKS / Kubernetes for deployment (relevant config shown).
(2) Flask backend.
(3) React frontend.
My initial thoughts are that this has to do with the deployment... works perfectly fine in a local context, but I think that there is some timeout setting; however, I'm unsure where this is / how I can increase the timeout. For additional context, this doesn't seem to happen with short-lived requests... just this one particular that takes more time.
If it's failing specifically for long running calls then you may have to adjust your ELB idle timeout. It's 60 seconds by default. Check out the following resource for reference:
https://aws.amazon.com/blogs/aws/elb-idle-timeout-control/
Some troubleshooting tips here.

AWS Elastic Beanstalk: Looooooooong HEAD requests

I've just deployed a simple Java/Tomcat based application into Elastic Beanstalk (using the java8/tomcat8 config). Mostly the application works fine.
However, all HEAD requests seem to take 60 seconds. Feels like a timeout of some kind. I can't seem to find any settings regarding filtering or delaying particular types of requests. These requests work fine when I run locally. GET requests to the same URL work fine.
I've confirmed that both the Tomcat and the Apache instance on the server log the HEAD request instantly (which indicates they are done with it, right?).
I've confirmed (using telnet) that the client is not receiving any response header bytes until very late. This isn't a problem of the client waiting for a payload or something like that.
Furthermore, the delay is clearly tied to the load balancer's "Idle Timeout" setting. If I push that down to 5 seconds, then the HEAD requests take about 5 seconds, if I set the idle-timeout to 20 seconds then the HEAD requests take just about 20 seconds (always a few ms over). The default is 60s.
What could be causing all HEAD requests (even those returning a 401 unauthorized error, no processing) to clog up the works like that?
Turns out the problem was a firewall issue at the local site. AWS ElasticBeanstock was returning the responses in a timely manner, but they were getting clogged up in a local firewall. Grr..

Inconsistent AWS "Signature not current" errors from Cloudformation API

I have a ruby client (fog) that makes a call to the AWS CloudFormation API. The client runs on an AWS EC2 instance. For months, the client has been running without issue, but in the last 2 weeks, I've been getting random authorization failures because of "Signature not current".
Here's some cherry-picked debug details from excon (the underlying library used by fog to make http calls).
request:
:headers => {
"User-Agent" => "fog/1.24.0"
"x-amz-date" => "20150326T152500Z"
}
excon.error.response
:headers => {
"Date" => "Thu, 26 Mar 2015 15:19:28 GMT"
}
ERROR: Fog::AWS::CloudFormation::Error: SignatureDoesNotMatch => Signature not yet current: 20150326T152500Z is still later than 20150326T152429Z (20150326T151929Z + 5 min.)
Looks to me like a time sync error: the CFN API is responding with a 15:19:28 timestamp while the request on the client side (ec2 instance) has a time of 13:25:00 - just over 5 minutes ahead...
Assuming this is something that needs to be addressed by AWS... any suggestions for a workaround?
Your server has some clock drift that is causing the request signature to be invalid, or at least, not valid yet.
if Linux, Please check your ntp server is running in system or not.
service ntp start
service ntp status

web service best practice - server timeout longer than http client timeout

I am trying to build a web service on top of hbase, so the code looks roughly like:
#GET
#Path("/blabla")
#Override
public List<String> getEvents($$$params$$$) {
......
//calling hbase query the events
......
}
When Hbase service is down, the hbase Java API keeps retrying to connect to Hbase region server util eventually it times out and throws a RT Exception:
NoServerForRegionException: Unable to find region for event,,99999999999999 after 10 tries.
The logic has no problem, my issue here is that the HttpClient times out way before hbase times out the retries. Then my web service API consumer gets no response, ugly.
Question -
What's the best practice here if you have server's timeout potentially longer than the http connection itself? How to have the web service respond to client gracefully in this case?
set the cashing for you scan object to some reasonable value. another thing, since you are using a web service to show the results to your users, i am assuming that you must be showing only a few rows(or records) at a time. you can use Hbase PageFilter so that you get only a specified no of rows each time and don't have to wait to get all the rows in one shot.