ELB Keeps Inconsistently Failing Health Checks, but EC2 Status Checks OK - amazon-web-services

Servers are accessible normally. Checking /
(default page)
We will have some sort of load, these will respond a little slower than it likes. Then Take down the Instances from our load balancer.
Because the Application doesnt fail, I cannot "reboot" the instances via Ec2. I can often access the webpage / IP direct myself when it's "out of service"
This isn't a general failure or a misconfiguration, it can be up for 12-2400 hours, but then randomly fail 3x in 3 hrs. Under medium-low load.
Server set to 10s response timeouts, 30s intervals, 5x to Fail; 2x to say its ok.
Any ideas?
Health check logs are responding normal, and nothing in ERRORS. Heres a sample from access:
10.0.100.30 - - [25/Nov/2016:06:49:22 +0000] "GET /index.html HTTP/1.1" 200 11415 "-" "ELB-HealthChecker/1.0"
::1 - - [25/Nov/2016:06:49:26 +0000] "OPTIONS * HTTP/1.0" 200 126 "-" "Apache/2.4.20 (Ubuntu) (internal dummy connection)"

Related

Suspicious network activity on amazon EC2 instance

I have created an amazon ec2 instance and I am hosting a flask server (the public ip of the server is known only to another server, it is not meant to be used by clients but only by another computer).
For some reason, I am receiving a weird network activity:
From the logs:
162.142.125.10 - - [18/Apr/2022 19:45:39] "GET / HTTP/1.1" 200 -
118.123.105.85 - - [18/Apr/2022 20:06:30] "GET / HTTP/1.0" 200 -
198.235.24.20 - - [18/Apr/2022 22:37:16] "GET / HTTP/1.1" 200 -
128.14.209.250 - - [19/Apr/2022 01:24:07] "GET / HTTP/1.1" 200 -
128.14.209.250 - - [19/Apr/2022 01:24:15] code 400, message Bad request version ('À\x14À')
128.14.209.250 - - [19/Apr/2022 07:05:32] "▬♥☺ ±☺ ♥♥Ýfé$0±6nu♀¤♫ëe éSV∟É#☼ß↨♠\ VÀ◄ÀÀ‼À À¶À" HTTPStatus.BAD_REQUEST -
I have looked all these IPs and they are across the globe.
Why am I getting these kind of requests ? What are they probably trying to achieve ?
[EDIT]
162.142.125.10 -> https://about.censys.io/
118.123.105.85 -> ChinaNet Sichuan Province Network
198.235.24.20 -> Palo Alto Networks Inc
128.14.209.250 -> zl-dal-us-gp1-wk123.internet-census.org
As others said, it's common that bots and (ethical?) hackers around the world scan your machine if it's on a public network.
Your assumption that "the public ip of the server is known only to another server" simply isn't true.
If you want to achieve that, you should place your server inside a private VPC subnet and/or allow the traffic only from the specific server via Security Group configuration.

Slack Bot deployed in Cloud Foundry returns 502 Bad Gateway errors

In Slack, I have set up an app with a slash command. The app works well when I use a local ngrok server.
However, when I deploy the app server to PCF, it is returning 502 errors:
[CELL/0] [OUT] Downloading droplet...
[CELL/SSHD/0] [OUT] Exit status 0
[APP/PROC/WEB/0] [OUT] Exit status 143
[CELL/0] [OUT] Cell e6cf018d-0bdd-41ca-8b70-bdc57f3080f1 destroying container for instance 28d594ba-c681-40dd-4514-99b6
[PROXY/0] [OUT] Exit status 137
[CELL/0] [OUT] Downloaded droplet (81.1M)
[CELL/0] [OUT] Cell e6cf018d-0bdd-41ca-8b70-bdc57f3080f1 successfully destroyed container for instance 28d594ba-c681-40dd-4514-99b6
[APP/PROC/WEB/0] [OUT] ⚡️ Bolt app is running! (development server)
[OUT] [APP ROUTE] - [2021-12-23T20:35:11.460507625Z] "POST /slack/events HTTP/1.1" 502 464 67 "-" "Slackbot 1.0 (+https://api.slack.com/robots)" "10.0.1.28:56002" "10.0.6.79:61006" x_forwarded_for:"3.91.15.163, 10.0.1.28" x_forwarded_proto:"https" vcap_request_id:"7fe6cea6-180a-4405-5e5e-6ba9d7b58a8f" response_time:0.003282 gorouter_time:0.000111 app_id:"f1ea0480-9c6c-42ac-a4b8-a5a4e8efe5f3" app_index:"0" instance_id:"f46918db-0b45-417c-7aac-bbf2" x_cf_routererror:"endpoint_failure (use of closed network connection)" x_b3_traceid:"31bf5c74ec6f92a20f0ecfca00e59007" x_b3_spanid:"31bf5c74ec6f92a20f0ecfca00e59007" x_b3_parentspanid:"-" b3:"31bf5c74ec6f92a20f0ecfca00e59007-31bf5c74ec6f92a20f0ecfca00e59007"
Besides endpoint_failure (use of closed network connection), I also see:
x_cf_routererror:"endpoint_failure (EOF (via idempotent request))"
x_cf_routererror:"endpoint_failure (EOF)"
In PCF, I created an https:// route for the app. This is the URL I put into my Slack App's "Redirect URLs" section as well as my Slash command URL.
In Slack, the URLs end with /slack/events
This configuration all works well locally, so I guess I missed a configuration point in PCF.
Manifest.yml:
applications:
- name: kafbot
buildpacks:
- https://github.com/starkandwayne/librdkafka-buildpack/releases/download/v1.8.2/librdkafka_buildpack-cached-cflinuxfs3-v1.8.2.zip
- https://github.com/cloudfoundry/python-buildpack/releases/download/v1.7.48/python-buildpack-cflinuxfs3-v1.7.48.zip
instances: 1
disk_quota: 2G
# health-check-type: process
memory: 4G
routes:
- route: "kafbot.apps.prod.fake_org.cloud"
env:
KAFKA_BROKER: 10.32.17.182:9092,10.32.17.183:9092,10.32.17.184:9092,10.32.17.185:9092
SLACK_BOT_TOKEN: ((slack_bot_token))
SLACK_SIGNING_SECRET: ((slack_signing_key))
command: python app.py
When x_cf_routererror says endpoint_failure it means that the application has not handled the request sent to it by Gorouter for some reason.
From there, you want to look at response_time. If the response time is high (typically the same value as the timeout, like 60s almost exactly), it means your application is not responding quickly enough. If the value is low, it could mean that there is a connection problem, like Gorouter tries to make a TCP connection and cannot.
Normally this shouldn't happen. The system has a health check in place that makes sure the application is up and listening for requests. If it's not, the application will not start correctly.
In this particular case, the manifest has health-check-type: process which is disabling the standard port-based health check and using a process-based health check. This allows the application to start up even if it's not on the right port. Thus when Gorouter sends a request to the application on the expected port, it cannot connect to the application's port. Side note: typically, you'd only use process-based health checks if your application is not listening for incoming requests.
The platform is going to pass in a $PORT env variable with a value in it (it is always 8080, but could change in the future). You need to make sure your app is listening on that port. Also, you want to listen on 0.0.0.0, not localhost or 127.0.0.1.
This should ensure that Gorouter can deliver requests to your application on the agreed-upon port.

How to identify the object that consumes the most bandwidth in an AWS S3 bucket?

What is the best way to identify the object that consumes the most bandwidth in a S3 bucket with thousands of other objects ?
By "bandwidth" I will assume that you mean the bandwidth consumed by delivering files from S3 to some place on the Internet (as when you use S3 to serve static assets).
To track this, you'll need to enable S3 access logs, which creates logfiles in a different bucket that show all of the operations against your primary bucket (or a path in it).
Here are two examples of logged GET operations. The first is from anonymous Internet access using a public S3 URL, while the second uses the AWS CLI to download the file. I've redacted or modified any identifying fields, but you should be able to figure out the format from what remains.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx com-example-mybucket [04/Feb/2020:15:50:00 +0000] 3.4.5.6 - XXXXXXXXXXXXXXXX REST.GET.OBJECT index.html "GET /index.html HTTP/1.1" 200 - 90 90 9 8 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0" - xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx - ECDHE-RSA-AES128-GCM-SHA256 - com-example-mybucket.s3.amazonaws.com TLSv1.2
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx com-example-mybucket [05/Feb/2020:14:51:44 +0000] 3.4.5.6 arn:aws:iam::123456789012:user/me XXXXXXXXXXXXXXXX REST.GET.OBJECT index.html "GET /index.html HTTP/1.1" 200 - 90 90 29 29 "-" "aws-cli/1.17.7 Python/3.6.9 Linux/4.15.0-76-generic botocore/1.14.7" - xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx SigV4 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader com-example-mybucket.s3.amazonaws.com TLSv1.2
So, to get what you want:
Enable logging
Wait for a representative amount of data to be logged. At least 24 hours unless you're a high-volume website (and note that it can take up to an hour for log records to appear).
Extract all the lines that contain REST.GET.OBJECT
From these, extract the filename and the number of bytes (in this case, the file is 90 bytes).
For each file, multiply the number of bytes by the number of times that it appears in a given period.
Beware: because every access is logged, the logfiles can grow quite large, quite fast, and you will pay for storage charges. You should create a life-cycle rule on the destination bucket to delete old logs.
Update: you could also use Athena to query this data. Here's an AWS blog post that describes the process.

ELB Intermittently Return 504 GATEWAY_TIMEOUT

I've seen this asked here, here, and here - but without any good answers and was hoping to maybe get some closure on the issue.
I have an ELB connected to 6 instances all running Tomcat7. Up until Friday there were seemingly no issues at all. However, starting about five days ago we started getting around two 504 GATEWAY_TIMEOUT from the ELB per day. That's typically 2/2000 ~ .1%. I turned on logging and see
2018-06-27T12:56:08.110331Z momt-default-elb-prod 10.196.162.218:60132 - -1 -1 -1 504 0 140 0 "POST https://prod-elb.us-east-1.backend.net:443/mobile/user/v1.0/ HTTP/1.1" "BackendClass" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2
But my Tomcat7 logs don't have any 504s present at all, implying that the ELB is rejecting these requests without even communicating with the Tomcat.
I've seen people mention setting the Tomcats timeout to be greater than the ELB's timeout - but if that were what were happening (i.e. Tomcat times out and then ELB shuts down), then shouldn't I see a 504 in the Tomcat logs?
Similarly, nothing has changed in the code in a few months. So, this all just started seemingly out of nowhere, and is too uncommon to be a bigger issue. I checked to see if there were some pattern in the timeouts (i.e. tomcat restarting or same instance etc.) but couldn't find anything.
I know other people have run into this issue, but any and all help would be greatly appreciated.

Why doesn't CloudFront return the cached version of these identical URL's?

I have a server on eb (running a tomcat application), I also have a CloudFront cache setup to cache duplicate requests so that they dont go to the server.
I have two behaviours set up
/artist/search
/Default(*)
and Default(*) is set to:
Allowed Http Methods :GET,PUT
Forward Headers :None
Headers :Customize
Timeout :84,0000
Forward Cookies :None
Forward Query Strings :Yes
Smooth Streaming :No
Restricted View Access:No
so there is no timeout and the only thing it forwards are queries strings
Yet I can see from looking at the localhost_access_log file that my server id receiving duplicate requests:
127.0.0.1 - - [22/Apr/2015:10:58:28 +0000] "GET /artist/cee3e39e-fb10-414d-9f11-b50fa7d6fb7a HTTP/1.1" 200 1351114
127.0.0.1 - - [22/Apr/2015:10:58:29 +0000] "GET /artist/cee3e39e-fb10-414d-9f11-b50fa7d6fb7a HTTP/1.1" 200 1351114
127.0.0.1 - - [22/Apr/2015:10:58:38 +0000] "GET /artist/cee3e39e-fb10-414d-9f11-b50fa7d6fb7a HTTP/1.1" 200 1351114
I can also see from my CloudFront Popular Objects page there are many objects that hit sometimes and miss sometimes including these artist urls, I was expecting only one miss and there all the rest to be hits
Why would this be ?
Update
Looking more carefully it seems (although not sure about this) that less likely to be cached as the size of the artist page increases, but extra weirdly even if the main artist page is larger it also seems to reget everthing referenced in that page such as icons (pngs) but not when the artist page is small. This is the worst outcome for me because it is the large artist pages that need more processing to create on the server - this is why I using cloudfront to try and avoid the recreation of these pages in the first place.
What you are seeing is a combination of two reasons:
Each individual CloudFront POP requests object separately, so if your viewers are in different locations you can expect multiple queries to your origin server (and they will be misses)
I'm not sure about report date range you are looking at, but CloudFront eventually evicts less popular objects to make room in cache for new objects