How to resolve AWS load balancer DNS address more quickly?

How to resolve AWS load balancer DNS address more quickly? - amazon-web-services

Given I setup an AWS load balancer (via kubectl expose deployment),
when I run nslookup *.us-east-1.elb.amazonaws.com, then the first few minutes it returns server can't find *.us-east-1.elb.amazonaws.com: NXDOMAIN. Similarly, when running curl *.us-east-1.elb.amazonaws.com, it returns Could not resolve host.
Simply waiting 5-10 minutes resolves both issues, but this wait time is quite a pain in e.g. CI pipelines.
Is there a way to speed up this resolving? For example, would pointing to an AWS DNS server help?

Related

SSH beanstalk from terminal using DNS

I am running an app in AWS Beanstalk, I use jenkins to do automatic deploys, manage crons, ecc, jenkins connects to the EC2 behind Beanstalk using the public ip.
The problem arises when the instance scales, since the IP of the EC2 will be different, I have to manually update Jenkins every time.
One of the simplest options would be to open the port 22 in the loadbalancer, but since I am using the recommended application loadbalancer, it only allows me to open the port 80/443. I was wondering if there is a way to create a dns record in route 53, that will automatically point to the right IP every time it scales?
I would like to avoid changing load balancer, because there are at least 20 environments that will need to be reconfigured.
I tried to look but no-one seems to have this issue, so either I have the wrong architecture, or it is too easy to fix.

Service not responding to ping command

My service (myservice.com) which is hosted in EC2 is up and running. I could see java process running within the machine but not able to reach the service from external machines. Tried the following option,
dns +short myservice.com
ping myservice.com
(1) is resolving and giving me ip address. ping is causing 100% packet loss. Not able to reach the service.
Not sure where to look at. Some help to debug would be helpful.
EDIT:
I had an issue with previous deployment due to which service was not starting - which I've fixed and tried to update - but the deployment was blocked due to ongoing deployment (which might take ~3hrs to stabilize). So I tried enabling Force deployment option from the console
Also tried minimising the "Number of Tasks" count to 0 and reverted it back to 1 (Reference: How do I deploy updated Docker images to Amazon ECS tasks?) to stop the ongoing deployment.
Can that be an issue?

You probably need to allow ICMP protocol in the security group.
See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-rules-reference.html#sg-rules-ping

AWS EC2 instance fails consistently at 30 seconds on long page load

I am running an ECS instance on EC2 with an application load balancer, a route53 domain, and a RDS db. This is an internal business application that I have restricted IP access to.
I have ran this app for 3 weeks with no issues. However, today the data that the web app ingests is an abnormally large size. This is not a mistake. Due to this though, a webpage is taking approximately 4 minutes to complete which I verified on my local machine it completes. However, running the same operation on AWS fails at precisely 30 seconds every time.
I have connected the app running on my local machine to my production RDS db and am able to download and upload the data with no issue. So there is no issue with the RDS db. In addition, this same functionality has worked previously and only failed today due to the large amount of data.
I spent hours with Amazon support to solve this issue but we couldn't figure it out. I am assuming it is a setting for one the AWS services I am using that has a TTL or timeout set to 30 seconds, but I couldn't find it in any of the services I am using:
route53
RDS
ECS
ECR
EC2
Load Balancer
Target Group

You have a backend instance timeout, likely in the web server config.
Right now your ELB has a timeout of 60 seconds, but your assets are failing at 30.
There are only a couple assets on AWS with hardcoded timeouts like that. I'm thinking (because this is the first time it's happened), you have one of the following:
Size limits in the upstream, or
Time limits on connection keep-alive
Look at your website server software (httpd/nginx). Nginx has something called "upstream.conf" where you can set upstream timeouts. I'm not sure of httpd does as well.
Resources:
https://serverfault.com/questions/414987/nginx-proxy-timeout-while-uploading-big-files

From the NLB documentation, maybe relevant
EC2 instances must respond to a new request within 30 seconds in order to establish a return path.
I don't actually know what a return path is, nor what a 'response' is in this context since NLB has no concept of requests or responses.
- https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout
EDIT: Disregard, this must have to do with UDP NATing. 'Response' here is probably a packet going back from the EC2 instance to the client

Single EC2 behind ELB creates massive latency

I have a Flask API running on a single EC2 instance. I need to add SSL (https) to my app and the docs seem to indicate that the best way is to use ELB.
The problem is when I set up the EC2 behind the ELB and with a Route 53 pointing to the load balancer, I get latency issues like never before. API calls that should take 150ms, take 31s. If I ping the EC2 directly, there is no issue, but pinging the Route53/ELB takes too long.
I've looked at past responses such as these:
The reason for the delay is because you have the ELB setup for
multi-az without any application instances in the other 2 AZ's
configured. Without instances in those AZ's requests will tend to fail
because the ELb still returns IP addresses for those AZ's even if
there are no active application instances. Please disable the other
AZ's for now and continue your tests.
So I deleted the subnets and AZs that are not related to my EC2 instance, but now I have:
503 errors: Backend server is at capacity all over the place.
About to pull my hair out. All I am trying to do it setup SSL for my app...

For anyone else struggling with this, using the Application Load Balancer instead of the Classic Load Balancer has resolved this latency issue.

Seems like some connectivity issue. Follow this step to drill down the issue. As its happening from load balancer only. So first do the NSLOOKUP like nslookup name. You will get minimum two IP.
Try to individual curl LB IP's n check the response time. You may use below command or perhaps any of your fav tool.
curl -w "dns_resolution: %{time_namelookup}, \
tcp_established: %{time_connect}, \
ssl_handshake_done: %{time_appconnect}, \
TTFB: %{time_starttransfer}\n" \
-o /dev/null -s "http://<ELB IP>/<path>"
This way you be figure out whether is the issue. It be due to routing issue on one specific AZ subnet, or some other issue.

How to prevent Google Cloud Load balancer to forward the traffic to newly created auto scaled Instance without being ready?

I will need to host a PHP Laravel application on Google Cloud Compute Engine with auto scaling and load balancing. I tried to setup and configure following:
I Created instance template, where I have added startup script to install apache2, PHP, cloning the git repository of my project, Configuring the Cloud SQL proxy, and configure all settings required to run this Laravel project.
Created Instance group, Where I have configured a rule when CPU reaches certain percent it start creating other instances for auto scale.
Created Cloud SQL instance.
Created Storage bucket, in my application all of the public contents like images will be uploaded into storage bucket and it will be served from there.
Created Load Balancer and assigned the Public IP to load balancer, configured the fronted and backed correctly for load balancer.
As per my above configuration, everything working fine, When a instance reaches a defined CPU percentage, Auto scaling start creating another instances and load balancer start routing the traffic to new instance.
The issue I'm getting, to configure and setup my environment(the startup script of instance template) takes about 20-30 minutes to configure and start ready to serve the content from the newly created instance. But when the load balancer detects if the newly created machine is UP and running it start routing the traffic to new VM instance which is not being ready to serve the content from it.
As a result, when load balancer routes the traffic to not ready machine, it obviously send me 404 error, and some other errors.
How to prevent to happen it, is there any way that the instance that created through auto scaling service send some information to load balancer after this machine is ready to serve the content and then only the load balancer route the traffic to the newly created instance?

How to prevent Google Cloud Load balancer to forward the traffic to
newly created auto scaled Instance without being ready?
Google Load Balancers use the parameter Cool Down to determine how long to wait for a new instance to come online and be 100% available. However, this means that if your instance is not available at that time, errors will be returned.
The above answers your question. However, taking 20 or 30 minutes for a new instance to come online defeats a lot of the benefits of autoscaling. You want instances to come online immediately.
Best practices mean that you should create an instance. Configure the instance with all the required software applications, etc. Then create an image of this instance. Then in your template specify this image as your baseline image. Now your instances will not have to wait for software downloads and installs, configuration, etc. All you need to do is run a script that does the final configuration, if needed, to bring an instance online. Your goal should be 30 - 180 seconds from launch to being online and running for a new instance. Rethink / redesign anything that takes longer than 180 seconds. This will also save you money.

John Hanley answer is pretty good, I'm just completing it a bit.
You should take a look at packer to create your preconfigured google images, this will help you when you need to add a new configuration or do updates.
The cooldown is a great way, but in your case you can't really be sure that your installation won't take a bit more time sometimes due to updates as you should do an apt-get update && apt-get upgrade at instance startup to be up to date it will only take more and more time...
Load balancers normally should have a health check configured and should not route traffic unless the instance is detected as healthy. In your case as you have apache2 installed I suppose you have a HC on the port 80 or 443 depending on your configuration on a /healthz path.
A way to use the health check correctly would be to create a specific vhost for the health check and you add a fake domain in the HC, let's say health.test, that would give a vhost listening for health.test and returning a 200 response on /healthz path.
This way if you don't change you conf, just activate the health vhost last so the loadbalancer don't start routing traffic before the server is really up...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js