EC2 server can't resolve hostnames - amazon-web-services

When trying to resolve a hostname (i.e. using dig), the server almost always fails, saying ;; connection timed out; no servers could be reached. Around one in ten attempts works, usually after a long waiting time.
Strange thing is that the same behavior happens also if I'm querying a different DNS server (Google's).
My default nameserver is Amazon's, # 172.31.0.2 . I get this one automatically when the server connects using DHCP.
Pinging the IPs (8.8.8.8 & 172.31.0.2) also usually fails.
I've tried checking the VPC settings and security group settings, but found nothing. Also the fact it works every once in a while makes me even more confused.

The problem disappeared by itself after around 48 hours. I don't know how to further analyize the issue so I'm closing this question. I can't think of anything about the server or AWS configuration that could have caused this, so I assume it was something with AWS's infrastructure.
Thanks

Related

GCP IPSEC Issues

Since HA VPNs were introduced it's a nightmare to get tunnels up and working again, the errors do not make sense.
Adding the tunnel to the gateway I get:
"Invalid value for field 'resource.destRange': ''. is not a valid IPv4 address range"
since when is 192.168.1.0/24 not CIDR? Looking at the logs it ends of with:
"establishing IKE_SA failed, peer not responding"
yet I am looking at the traffic flowing, what did they do?
Created 10 tunnels, 5 work 5 don't.
Can someone explain what's going on?
Seems that an internal issue is going on when trying to configure IKEv1 + Route-based VPN tunnels. This will be fixed in the next few days.
If this is your case, as workaround, you can try using either IKEv2 or policy-based VPN.

Webpage resource request stalled for nearly a minute in Chrome

A resource on my webapp takes nearly a minute to load after a long stall. This happens consistently. As shown below, only 3 requests on this page actually hit the server itself, the rest hit the memory or disk cache. This problem only seems to occur on Chrome, both Safari and Firefox do not exhibit this behavior.
I have implemented the Cache-Control: no-store suggestion in this SO question but the problem persists. request stalled for a long time occasionally in chrome
Also included below is an example of what the response looks like once it finally does come in.
My app is hosted in AWS behind a Network Load Balancer which proxies to an EC2 instance running nginx and the app itself.
Any ideas what is causing this?
I encountered the exact same problem. We are using Elastic Beanstalk with Network Load Balancer (NLB) with TLS termination at NLB.
The feedback I got from AWS support is:
This problem can occur when a client connects to a TLS listener on a Network Load Balancer and does not send data immediately after completing the TLS handshake. The root cause is an edge case in the handling of new connections. Note that this only occurs if the Target Group for the TLS listener is configured to use the TCP protocol without Proxy Protocol v2 enabled
They are working on a fix for this issue now.
Somehow this problem can only be noticed when you are using Chrome browser.
In the meantime, you have these 2 options as workaround:
enable Proxy Protocol v2 on the Target Group OR
configure the Target Group to use TLS protocol for routing traffic to the targets
I know it's a late answer but I write it for someone seeking a solution.
TL;DR: In my case, enabling cross-zone load balancing attribute of NLB solved the problem.
With investigation using WireShark I figured out there were two different IPv4 addresses Chrome communicated with.
Sending packets to one of them always succeeded and to the other always failed.
Actually the two addresses delegated two Availability Zones.
By default, cross-zone load balancing is disabled if you choose NLB (on the contrary the same attribute of ALB is enabled by default).
Let's say there are two AZs; AZ-1 / AZ-2.
When you attach both AZs to a NLB, it has a node for each AZ.
The node belongs to AZ-1 just routes traffic to instances which also belong to AZ-1. AZ-2 instances are ignored.
My modest app (hosted on Fargate) has just one app server (ECS task) in AZ-2 so that the NLB node in AZ-1 cannot route traffic to anywhere.
I'm not familiar with TCP/IP or Browser implementation but in my understanding, your browser somehow selects the actual ip address after DNS lookup.
If the AZ-2 node is selected in the above case then everything goes fine, but if the AZ-1 is selected your browser starts stalling.
Maybe Chrome has a random strategy to select ip while Safari or Firefox has a sticky one, so that the problem only appears on Chrome.
After enabling cross-zone load balancing the ECS task on AZ-2 is visible from the AZ-1 NLB node, and it works fine with Chrome browser too.
(Please feel free to correct my poor English. Thank you!)
I see two things that could be responsible for delays:
1) Usage of CDNs
If the resources that load slow are loaded from CDNs (Content Delivery Networks) you should try to download them to the server and deliver directly.
Especially if you use http2 this can be a remarkable gain in speed, but also with http1. I've no experience with AWS, so I don't know how things are served there by default.
It's not shown clearly in your screenshot if the resources are loaded from CDN but as it's about scripts I think that's a reasonable assumption.
2) Chrome’s resource scheduler
General description: https://blog.chromium.org/2013/04/chrome-27-beta-speedier-web-and-new.html
It's possible or even probable that this scheduler has changed since the article was published but it's at least shown in your screenshot.
I think if you optimize the page with help of the https://www.webpagetest.org and the chrome web tools you can solve any problems with the scheduler but also other problems concerning speed and perhaps other issues too. Here is the link: https://developers.google.com/web/tools/
EDIT
3) Proxy-Issue
In general it's possible that chrome has either problems or reasons to delay because of the proxy-server. Details can't be known before locking at the log-files, perhaps you've to adjust that log-files are even produced and that the log-level is enough to tell you about any problems (Level Warning or even Info).
After monitoring the chrome net-export logs, it seems as though I was running into this issue: https://bugs.chromium.org/p/chromium/issues/detail?id=447463.
I still don't have a solution for how to fix the problem though.

My google cloud instance is no longer able to resolve external hostnames

Yesterday I had to revert to a recent snapshot of my vm. This vm was working flawlessly at the time I took it.
But now I can no longer resolve any url from this host. All git pull commands, all curl requests, host lookups, etc.. are failing. For instance:
# host www.google.com
; connection timed out; no servers could be reached
Yet this host is reachable from the outside world, as I can ssh to it, and http requests coming in are being serviced.
What am I forgetting?
Turns out that the file /etc/resolv.conf has been automagically populated roughly 18 hours after spinning up the instance.
Not super convenient, but glad it is resolved.
Had I known at the time, I think I would have been able to resolve the issue by adding this to /etc/resolve.conf:
domain c.[Project ID].internal
search c.[Project ID].internal.google.internal.
nameserver 169.254.169.254
This is the expected behavior, the hostname of an instance in the GCP is provided by the metadata server. Every time an instance boots up, it will get the hostname from the metadata server, therefore resetting any changes made on the instance level, please see 1 and 2.

AWS, Load Balancer 504 error after a few requests

I am repeating a question that I posted at https://forums.aws.amazon.com/thread.jspa?threadID=275855&tstart=0
to reach out more people.
Hi,
I am trying to deploy a REST service in AWS. The current architecture is:
Domain name (Route 53) -> Load Balancer -> Single EC2 instance (bound to an Elastic IP). And I use TLS/SSL certificate issued by a Certificate Manager.
The instance is Ubuntu 16.04 machine, and the service is implemented with (bare) Vert.X (==no proxy server).
However, 504 Error (gateway timeout) occurs after a few different requests (each of which takes <1s) in a series, and then it does not respond. The requests do not reach the server instance after a few requests. I checked that it happens in the same way when I access both the domain name and the load balancer directly. I have confirmed that the exact same scenario is working with direct URL.
I run up a dummy server returning "hello world" and it's working okay with the load balancer. The problem should be caused by something no coherent between the load balancer and the server code, but I can't get where to start.
I have checked several threads complaining the 504 errors, and followed some of the instructions, but they do not work. Especially I set keep-alive option in Vert.x and set the idle time longer than the balancer's. As the delays are not longer than the idel time with the direct communication, I believe it is not the problem anyway. I have checked the Security Groups also and confirmed the right ports are open. (The first few requests are working, so it must not be the problem also.)
Does any of you have a sense where I should start looking at? Even better, know the source of the problem?
Thanks in advance.
EDIT: I just found the issue in some of the code. I've answered myself below. Thanks for reading!
Found the issue in my code. Some of the APIs (implemented by my colleague...) was not flushing the buffer of HTTP responses in the server.
In Vert.X Java, it was resp.end().
It was somehow working with direct access probably the buffer was flushed at some point, but that flush seems not caught by the load balancer.
Hope nobody experiences this, but in case...

EC2 Database through Laravel Forge has stopped being accessable

I've been running an instance EC2 through Laravel forge for about 2000 hours and this morning got this error while trying to reach it:
SQLSTATE[08006] [7] could not connect to server: Connection refused Is
the server running on host "172...***" and accepting TCP/IP
connections on port 5432?
After SSHing into the server I've getting a similar error when trying to run a command. I've dug through AWS but don't see any errors being throw. I double checked the ip address for the instance to make sure the IP hadn't changed for any reason. Of course I'm a little behind on my backups for the application so I'm hoping someone might have some ideas why else I can do to try and access this data. I haven't made any changes to the app in about 10 days, but found the error while I was pushing an update. I have six other instances of the same app that weren't affected (thankfully) but makes me even more confused with the cause of the issue.
In case anyone comes across a similar issue, here's what had happened. I had an error running in the background which had filled up the EC2 harddrive's log. Since the default Larvel/Forge image has a DB running within in the EC2 instance, once it ran out of room everything stopped working. I was able to SSH in and delete the log though, and everything started working again.
To prevent the issue from happening again I then created an amazon RDS and used that rather than the EC2 instance. It's about three or four times the price of just an EC2 instance, but still not that much and the confidence I now have in the system is well worth it.