My google cloud instance is no longer able to resolve external hostnames - google-cloud-platform

Yesterday I had to revert to a recent snapshot of my vm. This vm was working flawlessly at the time I took it.
But now I can no longer resolve any url from this host. All git pull commands, all curl requests, host lookups, etc.. are failing. For instance:
# host www.google.com
; connection timed out; no servers could be reached
Yet this host is reachable from the outside world, as I can ssh to it, and http requests coming in are being serviced.
What am I forgetting?

Turns out that the file /etc/resolv.conf has been automagically populated roughly 18 hours after spinning up the instance.
Not super convenient, but glad it is resolved.
Had I known at the time, I think I would have been able to resolve the issue by adding this to /etc/resolve.conf:
domain c.[Project ID].internal
search c.[Project ID].internal.google.internal.
nameserver 169.254.169.254

This is the expected behavior, the hostname of an instance in the GCP is provided by the metadata server. Every time an instance boots up, it will get the hostname from the metadata server, therefore resetting any changes made on the instance level, please see 1 and 2.

Related

Cannot connect vSphere ESXi 7 with Web client

I am installing VMware vSphre ESXi 7.0.2. But I cannot use web client (http://<ip_address>/ui)
When installed first time, I can connect with https://<IP_address> (It will be redirect to https://<IP_address>/ui ) and can create VM. But I found I cannot use some SDD/HDD. So I have re-installed ESXi after created the RAID partitions.
Re-Install was look OK, and I can see DCUI and set IP, DNS etc... After all set, I've tried to use https://<IP_address>. But it was timed out. (I have checked several things, then I found the ping does not work.)
I restarted the server then ping is OK. But when I try to connect with https://<IP_address> then the ping became "Destination net unreachable". (I have confirmed it with "-t" option.)
I thought it is firewall settings. So, I changed "--default-action" and "--enabled" but it still not working. Just in case, I have stop to use RAID disks and re-install it again (it is same as first installation), but it was same results.
There's likely still a networking-related misconfiguration. Use DCUI to verify IP/subnet mask/gateway/VLAN tag (if necessary) and that the appropriate NIC has been configured.
If those are set correctly, the DCUI also has some built-in testing options which allows you to do some outbound ping testing. By default it will check 3 hosts, including the gateway and usually two DNS names, but those can be changed to other options.

What keeps accessing Google Cloud metadata on my instance

I have a Google Cloud compute instance running with Ubuntu 18. We had wireshark running tracking another problem and we noticed that every minute something is accessing the meta data server. Three requests every minute:
GET /computeMetadata/v1/instance/virtual-clock/drift-token?alt=json&last_etag=XXXXXXXXXXXXXXXX&recursive=False&timeout_sec=60&wait_for_change=True
GET /computeMetadata/v1/instance/network-interfaces/?alt=json&last_etag=XXXXXXXXXXXXXXXX&recursive=True&timeout_sec=60&wait_for_change=True
GET /computeMetadata/v1/?alt=json&last_etag=XXXXXXXXXXXXXXXX&recursive=True&timeout_sec=77&wait_for_change=True
In call cases, the wireshark says the source is the IP of my instance, and the destination is the 169.254.169.254 which is the Google metadata server.
I don't have any code we have written that is accessing the server. The first one makes me think that this is some Google specific software that is accessing the meta data? But I haven't been able to prove that. What is worrisome is that the response for the third one contains ssh keys. Also, every minute seem excessive.
I see another post talking about scripts in /usr/share/google, but I don't have that directory. I do see that google-fluent is installed. I also see a installed snap for google-cloud-sdk. Could one of those be it? I don't recall installing them, AFAIK, I am not using it, so if that is it, what is the harm in uninstalling it?
You do not have a problem to worry about. The metadata server is private to your instance. The Google VM guest environment software and Stackdriver (fluentd) are making requests to the metadata server to get credentials, detect changes (new SSH keys), set the clock, etc.
The IP address 169.254.169.254 is an IPv4 Link Local Address. Only your VM has a route to that network.
Compute Engine Guest Environment
Do not attempt to uninstall the Guest Environment. You can remove Stackdriver, but I do not recommend that. Stackdriver provides logging and monitoring features that are very useful.

AWS, Load Balancer 504 error after a few requests

I am repeating a question that I posted at https://forums.aws.amazon.com/thread.jspa?threadID=275855&tstart=0
to reach out more people.
Hi,
I am trying to deploy a REST service in AWS. The current architecture is:
Domain name (Route 53) -> Load Balancer -> Single EC2 instance (bound to an Elastic IP). And I use TLS/SSL certificate issued by a Certificate Manager.
The instance is Ubuntu 16.04 machine, and the service is implemented with (bare) Vert.X (==no proxy server).
However, 504 Error (gateway timeout) occurs after a few different requests (each of which takes <1s) in a series, and then it does not respond. The requests do not reach the server instance after a few requests. I checked that it happens in the same way when I access both the domain name and the load balancer directly. I have confirmed that the exact same scenario is working with direct URL.
I run up a dummy server returning "hello world" and it's working okay with the load balancer. The problem should be caused by something no coherent between the load balancer and the server code, but I can't get where to start.
I have checked several threads complaining the 504 errors, and followed some of the instructions, but they do not work. Especially I set keep-alive option in Vert.x and set the idle time longer than the balancer's. As the delays are not longer than the idel time with the direct communication, I believe it is not the problem anyway. I have checked the Security Groups also and confirmed the right ports are open. (The first few requests are working, so it must not be the problem also.)
Does any of you have a sense where I should start looking at? Even better, know the source of the problem?
Thanks in advance.
EDIT: I just found the issue in some of the code. I've answered myself below. Thanks for reading!
Found the issue in my code. Some of the APIs (implemented by my colleague...) was not flushing the buffer of HTTP responses in the server.
In Vert.X Java, it was resp.end().
It was somehow working with direct access probably the buffer was flushed at some point, but that flush seems not caught by the load balancer.
Hope nobody experiences this, but in case...

EC2 Database through Laravel Forge has stopped being accessable

I've been running an instance EC2 through Laravel forge for about 2000 hours and this morning got this error while trying to reach it:
SQLSTATE[08006] [7] could not connect to server: Connection refused Is
the server running on host "172...***" and accepting TCP/IP
connections on port 5432?
After SSHing into the server I've getting a similar error when trying to run a command. I've dug through AWS but don't see any errors being throw. I double checked the ip address for the instance to make sure the IP hadn't changed for any reason. Of course I'm a little behind on my backups for the application so I'm hoping someone might have some ideas why else I can do to try and access this data. I haven't made any changes to the app in about 10 days, but found the error while I was pushing an update. I have six other instances of the same app that weren't affected (thankfully) but makes me even more confused with the cause of the issue.
In case anyone comes across a similar issue, here's what had happened. I had an error running in the background which had filled up the EC2 harddrive's log. Since the default Larvel/Forge image has a DB running within in the EC2 instance, once it ran out of room everything stopped working. I was able to SSH in and delete the log though, and everything started working again.
To prevent the issue from happening again I then created an amazon RDS and used that rather than the EC2 instance. It's about three or four times the price of just an EC2 instance, but still not that much and the confidence I now have in the system is well worth it.

EC2 server can't resolve hostnames

When trying to resolve a hostname (i.e. using dig), the server almost always fails, saying ;; connection timed out; no servers could be reached. Around one in ten attempts works, usually after a long waiting time.
Strange thing is that the same behavior happens also if I'm querying a different DNS server (Google's).
My default nameserver is Amazon's, # 172.31.0.2 . I get this one automatically when the server connects using DHCP.
Pinging the IPs (8.8.8.8 & 172.31.0.2) also usually fails.
I've tried checking the VPC settings and security group settings, but found nothing. Also the fact it works every once in a while makes me even more confused.
The problem disappeared by itself after around 48 hours. I don't know how to further analyize the issue so I'm closing this question. I can't think of anything about the server or AWS configuration that could have caused this, so I assume it was something with AWS's infrastructure.
Thanks