PCF - cf push - 502 Bad Gateway - cloud-foundry

We are trying to perform an EAR deployment in PCF, the deployment bundle's size is around 200 MB. The buildpack we use is WAS Liberty buildpack. While we do a cf push we are consistently getting this below error,
HTTP/1.1 502 Bad Gateway
X-Cf-Routererror: endpoint_failure (context deadline exceeded)
io: read/write on closed pipe
Is there a specific reason for this behavior apart from network bandwidth, latency etc.,?

Related

Resolving intermittent 502 Bad Gateway with Cloudfront pulling from S3

We have an AWS production setup with pieces including EC2, S3, and Cloudfront (among others). The website on EC2 generates XML feeds which includes a number of images for each item (in total, over 300k images). The XML feed is consumed by a third-party, which processes the feed and downloads any new images. All image links point to Cloudfront with an S3 bucket as its origin.
Reviewing third-party logs, many of those images are successfully downloaded. But there are still many images that fail. They are getting 502 Bad Gateway messages. Looking at Cloudfront logs, all I'm seeing is OriginError with no indication of what's causing the error. Most discussions I've found about Cloudfront 502 errors point to SSL issues and seem to be people getting a 502 with every request. SSL isn't a factor here, and most requests successfully process, so it's an intermittent issue - and I haven't been able to manually replicate the issue.
I'm suspecting something with S3 rate limiting, but even with that many images, I don't think the third-party is grabbing images anywhere near fast enough to trigger rate limiting. But I could be wrong. Either way, I can't figure out what's causing the issue - so I can't figure out how to fix it - since I'm not getting a more specific error from S3/CloudFront. Below is one row from the Cloudfront log, broken down.
ABC.2021-10-21-21.ABC:2021-10-21
21:09:47
DFW53-C1
508
ABCIP
GET
ABC.cloudfront.net
/ABC.jpg
502
-
ABCUA
-
-
Error
ABCID
ABC.cloudfront.net
https
294
4.045
-
TLSv1.2
ECDHE-RSA-AES128-GCM-SHA256
Error
HTTP/1.1
-
-
11009
4.045
OriginError
application/json
36
-
-

AWS HTTP API Gateway 503 Service Unavailable

I have an HTTP API Gateway with a HTTP Integration backend server on EC2. The API has lots of queries during the day and looking at the logs i realized that the API is returning sometimes a 503 HTTP Code with a body:
{ "message": "Service Unavailable" }
When i found out this, i tried the API and running the HTTP requests many times on Postman, when i try twenty times i get at least one 503.
I then thought that the HTTP Integration Server was busy but the server is not loaded and i tried going directly to the HTTP Integration Server and i get 200 responses all the times.
The timeout parameter is set to 30000ms and the endpoint average response time is 200ms so timeout is not a problem. Also the HTTP 503 is not after 30 seconds of the request but instantly.
Can anyone help me?
Thanks
I solved this issue by editing the keep-alive connection parameters of my internal integration server. The AWS API Gateway needs the keep alive parameters on a standard configuration, so I started tweaking my NGINX server parameters until I solved the issue.
Had the same issue on a selfmade Microservice with Node that was integrated into AWS API-Gateway. After some reconfiguration of the Cloudwatch-Logs I got further indicator on what is wrong: INTEGRATION_NETWORK_FAILURE
Verify your problem is alike - i.e. through elaborated log output
In API-Gateway - Logging add more output in "Log format"
Use this or similar content for "Log format":
{"httpMethod":"$context.httpMethod","integrationErrorMessage":"$context.integrationErrorMessage","protocol":"$context.protocol","requestId":"$context.requestId","requestTime":"$context.requestTime","resourcePath":"$context.resourcePath","responseLength":"$context.responseLength","routeKey":"$context.routeKey","sourceIp":"$context.identity.sourceIp","status":"$context.status","errMsg":"$context.error.message","errType":"$context.error.responseType","intError":"$context.integration.error","intIntStatus":"$context.integration.integrationStatus","intLat":"$context.integration.latency","intReqID":"$context.integration.requestId","intStatus":"$context.integration.status"}
After using API-Gateway Endpoint and failing consult the logs again - should be looking like that:
Solve in NodeJS Microservice (using Express)
Add timeouts for headers and keep-alive on express servers socket configuration when upon listening.
const app = require('express')();
// if not already set and required to advertise the keep-alive through HTTP-Response you might want to use this
/*
app.use((req: Request, res: Response, next: NextFunction) => {
res.setHeader('Connection', 'keep-alive');
res.setHeader('Keep-Alive', 'timeout=30');
next();
});
*/
/* ..you r main logic.. */
const server = app.listen(8080, 'localhost', () => {
console.warn(`⚡️[server]: Server is running at http://localhost:8080`);
});
server.keepAliveTimeout = 30 * 1000; // <- important lines
server.headersTimeout = 35 * 1000; // <- important lines
Reason
Some AWS Components seem to demand a connection kept alive - even if server responding otherwise (connection: close). Upon reusage in API Gateway (and possibly AWS ELBs) the recycling will fail because other-side most likely already closed hence the assumed "NETWORK-FAILURE".
This error seems intermittent - since at least the API-Gateway seems to close unused connections after a while providing a clean execution the next time. I can only assume they do that for high-performance and not divert to anything less.

Google Cloud Tasks enforcing rate limit on forwarding to Cloud Functions?

Cloud Tasks is saying:
App Engine is enforcing a processing rate lower than the maximum rate for this queue either because your application is returning HTTP 503 codes or because currently there is no instance available to execute a request.
However, I am forwarding the tasks to a cloud function using an HTTP POST request, similar to the one outlined in this tutorial. I don't see any 503s in my logs for the cloud function that it forwards to.
My queue.yaml is:
queue:
- name: task-queue-1
rate: 3/s
bucket_size: 500
max_concurrent_requests: 100
retry_parameters:
task_retry_limit: 1
min_backoff_seconds: 120
task_age_limit: 7d
The problem seems to come from any exception, even though only 503 is listed. If the cloud function responds with any error the task queue slows down the rate, and you have no control over that.
My solution was to swallow any errors to prevent that propagating up to Google's automatic check.

Scrapy + Splash returns a lot of 504 Time Out errors

I have followed Splash's FAQ for production setups and my system currently looks like this:
1 Scrapy Container with 6 concurrency requests.
1 HAProxy Container that load balance to splash containers
2 Splash Containers with 3 slots each.
I use docker stats to monitor my setup and I never get more than 7% CPU usage or more than 55% Memory usage.
I still get a lot of
DEBUG: Retrying <GET https://the/url/ via http://haproxy:8050/execute> (failed 1 times): 504 Gateway Time-out
For every successful request I get 6-7 of these timeouts.
I have experimented with changing the slots of the splash containers and the amount of concurrency requests. I've also tried running with a single splash container behind the HAProxy. I keep getting these errors.
I'm running on a AWS EC2 t2.micro instance which have 1gb memory.
I suspect that the issue is still related to the splash instance getting flooded. Is there any advice you can give me to reduce the load of the Splash instances? Is there a good ratio between slots and concurrency requests? Should I throttle requests?

502 Server Error sometime on Google Compute Engine

I set up a server on Google Compute Engine with Apache server on Ubuntu 16.04.4 LTS. It's protected with IAP.
It was fine all along for about 6 months but now some of the users encounter 502 Server Error.
I already checked the following links
Some 502 errors in GCP HTTP Load Balancing [Changed the Apache KeepAliveTimeout to 620]
502 response coming from errors in Google Cloud LoadBalancer [Removed ajax requests]
But the problem is still there.
Here is the error message from one of the log.
{
httpRequest: {…}
insertId: "170sg34g5fmld90"
jsonPayload: {
#type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"
statusDetails: "failed_to_pick_backend"
}
logName: "projects/sggc-web01/logs/requests"
receiveTimestamp: "2018-03-14T07:21:55.807802906Z"
resource: {…}
severity: "WARNING"
spanId: "44a49bf1b3893412"
timestamp: "2018-03-14T07:21:53.048717425Z"
trace: "projects/sggc-web01/traces/f35119d8571f20df670b0d53ab6b3210"
}
Please help me to trace and fix the issue. Thank you!
The error is not being caused by the server but the load balancer.
For the error we can see in the statusDetails "failed_to_pick_backend" it is being caused because all the instances were unhealthy (or still are) when it tries to establish the connection.
This can be because:
1 - The CPU usage of the instances were too high and they weren't able to answer the health check request from the load balancer showing as unhealthy to it.
2 - The health checks aren't being allowed in the firewall (I doubt this can be the reason if it worked before)