What logs can I use to diagnose connectivity issues on a GCE network? - google-cloud-platform

On Google Cloud we are using the following:
A Cloud Function that connects to
A service running on a GCE VM
via a VPC Access Connector
Its been running fine for months then all of a sudden it stopped working and all attempts to connect to the service cause the following error in our Cloud Function logs:
Connection to 10.X.X.X timed out. (connect timeout=10)
That IP address is the IP address of the VM.
At this point I'm not sure how to go about diagnosing the problem as GCE networking is unfamiliar to me. What should I be searching for in Cloud Logging to try and determine the root cause of the problem?

You can use Cloud Logging to check the logs of your GCE Networking. You check it on:
Navigation Menu > Logging > Log Explorer
On the right upper part of your GCP console, click Resource.
Scroll down and choose GCE Network.
Once you click GCE Network, it will show you the Network ID of your VPC network
Then choose where your VM instance is located then click “apply”.
For more information about Cloud Logging you can explore this link.

I've discovered that much more detailed logs are available by filtering on
resource.type="gce_subnetwork"
These logs provide much more detailed information about the network traffic.

Related

dataproc hadoop/spark job can not connect to cloudSQL via Private IP

I am facing this issue of setting up private ip access between dataproc and cloud sql with vpc network and peering setup, would really appreciate help since not able to figure this our since last 2 days of debugging, after following pretty much all the docs.
so far the setup i tried ( with internal IP only )
enabled "private google access" to default subnet and used the default subnetwork for the dataproc and SQL.
created the new VPX network/subnetwork and used that to create dataproc and updated cloud sql to use that network.
created ip range and "private service connection" to "google cloud platform" service provider -- enabled it as well. Along with vpc network peering to "servicenetworking"
explicitly added sql client role to default dataproc compute service account ( event though I didnt needed this for other VM connectivity to cloud sql, using the same role, because its a admin ("editor") role anyway. )
All according to the doc : https://cloud.google.com/sql/docs/mysql/private-ip and other links there
Problem:
when I submit spark job on dataproc that connects to this cloud sql, it fails with following error: Communications link failure....
Caused by: java.net.ConnectException: Connection refused (Connection refused)
Test & debug:
connectivity test all passes from the exact internal IP address on both side ( dataproc node and cloud sql node )
mysql command line client can connect fine from dataproc master node
checked cloud logging does not show any deny or issue in connecting mysql
screenshot for the connectivity test on both default and new vpc network.
other stackoverflow questions I referred on using private ip:
Cannot connect to Cloud SQL from Cloud Run after enabling private IP and turning off public iP
How to access Cloud SQL from dataproc?
ps: I want to avoid cloud proxy route to connect to cloud SQL from dataproc so dont want to install cloud_proxy service via initialization.
A "Connection refused" normally means that nothing is listening on the other end. The logs also contain hints that the database connection is attempted to localhost, port 3307. This is the right port for the CloudSQL proxy, one higher than the usual MySQL port.
Check whether the metadata configuration for your cluster is correct:
Workaround 1 :
Check the proxy is a different version in the cluster that is having issues version 1.xx. The difference in SQL proxy version seems to be in this issue. You can pin the suitable version of Cloud SQL proxy to 1.xx.
Workaround 2:
Run the command : journalctl -r -u cloud-sql-proxy.service | grep -i err,
Based on the logs check which sql proxy causes issues.
Check if the root cause may be the Data project was hitting "sql query per 100 sec per user" quota.
Actions:
Increase the Quota and restart the affected cloud sql proxy services (by monitoring jobs running on the master nodes that failed)
this is similar to the link but with the quota error preventing the startup instead of network errors in the link. With the updated quota, the cloud sql proxy should not have this reoccur.
here's a recommended set of next steps:
Reboot any nodes that appear to have a defunct/broken cloudsql proxy -- systemd won't report the truth, but running "mysql --host ... --port ..." trying to connect to the cloudsql proxy on the bad nodes would detect this.
Bump up API quota immediately - in Cloud Console just go to "IAM and Admin", go to "Quotas", search for the "Cloud SQL Admin API", click through it: then click on the pencil to "edit" and should be able to bump to 300 as self service without approval needed. If you want it to be more than 300 per 100s you might need to file an approval request.
If you look at the quota usage, if it's approaching 100 per 100s from time to time, update the quota to 300.
It's possible that the extra cloudsql proxy instances on the worker nodes are causing more load than is necessary just running cloudsql proxy on the master node. If the cluster is only using a driver that runs on a master node, then the other worker nodes don't need to run the proxy.
To find the nodes which are broken, you can see which are responding to the cloud sql proxy port.
You can loop over each hostname and ssh to it and run this command:
nc -zv localhost 3307 || sudo systemctl restart cloud-sql-proxy
or you could check the logs on each to see which ones have logged a quota message like this:
grep cloud_sql_proxy /var/log/syslog | tail
and see if the very last message they see says "Error 429: Quota exceeded for quota group 'default' and limit 'USER-100s' of service
'sqladmin.googleapis.com' for consumer ..."
The nodes which aren't running cloud sql proxy could be rebooted to start from scratch, or restart the proxy with this command on each:
"sudo systemctl restart cloud-sql-proxy"

Why are outbound SSH connections from Google CloudRun to EC2 instances unspeakably slow?

I have a Node API deployed to Google CloudRun and it is responsible for managing external servers (clean, new Amazon EC2 Linux VM's), including through SSH and SFTP. SSH and SFTP actually work eventually but the connections take 2-5 MINUTES to initiate. Sometimes they timeout with handshake timeout errors.
The same service running on my laptop, connecting to the same external servers, has no issues and the connections are as fast as any normal SSH connection.
The deployment on CloudRun is pretty standard. I'm running it with a service account that permits access to secrets, etc. Plenty of memory allocated.
I have a VPC Connector set up, and have routed all traffic through the VPC connector, as per the instructions here: https://cloud.google.com/run/docs/configuring/static-outbound-ip
I also tried setting UseDNS no in the /etc/ssh/sshd_config file on the EC2 as per some suggestions online re: slow SSH logins, but that has not make a difference.
I have rebuilt and redeployed the project a few dozen times and all tests are on brand new EC2 instances.
I am attempting these connections using open source wrappers on the Node ssh2 library, node-ssh and ssh2-sftp-client.
Ideas?
Cloud Run works only until you have a HTTP request active.
You proably don't have an active request during this on Cloud Run, as outside of the active request the CPU is throttled.
Best for this pipeline is Cloud Workflows and regular Compute Engine instances.
You can setup a Workflow to start a Compute Engine for this task, and stop once it finished doing the steps.
I am the author of article: Run shell commands and orchestrate Compute Engine VMs with Cloud Workflows it will guide you how to setup.
Executing the Workflow can be triggered by Cloud Scheduler or by HTTP ping.

Can't SSH into Google Cloud VM

I was able to successfully SSH into the Google Cloud VM I had set up yesterday, but today for some reason I can't, and I didn't mess with any of the settings, especially not the Firewall settings. It keeps giving me these errors now:
Connection via Cloud Identity-Aware Proxy Failed
Code: 4003
Reason: failed to connect to backend
You may be able to connect without using the Cloud Identity-Aware Proxy.
Then when I click on "Connect without Identity-Aware Proxy" I get the following error:
Connection Failed
We are unable to connect to the VM on port 22. Learn more about possible causes of this issue.
I don't know what happened. Yesterday it was working fine and now it's not.
At first, try to disable Cloud Identity-Aware Proxy and connect to the VM instance via web Console.
After that, check logs:
Go to Compute Engine -> VM instances -> click on NAME_OF_YOUR_VM -> at the VM instance details find section Logs and click on Serial port 1 (console)
Reboot your VM instance.
Check full boot log for any errors or/and warnings.
If your VM instance doesn't start up verify that your disk has a valid file system and a valid master boot record (MBR) by following the documentation General troubleshooting.
If you found errors/warning related to disk space you can try to resize it accordingly to the documentation Resizing a zonal persistent disk, also accordingly to the article Recovering an inaccessible instance or a full boot disk:
If an instance is completely out of disk space or if it is not running
a Linux guest environment, then automatically resizing your root
filesystem isn't possible, even after you've increased the size of the
persistent disk that backs it. If you can't connect to your instance,
or your boot disk is full and you can't resize it, you must create a
new instance and recreate the boot disk from a snapshot to resize it.
Otherwise try get access to your VM instance via serial console :
Enable serial console connection with gcloud command:
gcloud compute instances add-metadata NAME_OF_YOUR_VM_INSTANCE \
--metadata serial-port-enable=TRUE
or go to Compute Engine -> VM instances -> click on NAME_OF_YOUR_VM_INSTANCE -> click on EDIT -> go to section Remote access and check Enable connecting to serial ports
Create temporary user and password to login: shutdown your VM and set a startup script by adding at the section Custom metadata key startup-script and value:
useradd --groups google_sudoers tempuser
echo "tempuser:password" | chpasswd
and then start your VM.
Connect to your VM via serial port with gcloud command:
gcloud compute connect-to-serial-port NAME_OF_YOUR_VM_INSTANCE
or go to Compute Engine -> VM instances -> click on NAME_OF_YOUR_VM_INSTANCE -> and click on Connect to serial console
Check what went wrong.
Disable access via serial port with gcloud command:
gcloud compute instances add-metadata NAME_OF_YOUR_VM_INSTANCE \
--metadata serial-port-enable=FALSE
or go to Compute Engine -> VM instances -> click on NAME_OF_YOUR_VM_INSTANCE -> click on EDIT -> go to section Remote access and uncheck Enable connecting to serial ports. Keep in mind that accordingly to the documentation Interacting with the serial console:
Caution: The interactive serial console does not support IP-based access
restrictions such as IP whitelists. If you enable the interactive
serial console on an instance, clients can attempt to connect to that
instance from any IP address. Anybody can connect to that instance if
they know the correct SSH key, username, project ID, zone, and
instance name. Use firewall rules to control access to your network
and specific ports.
If you weren't able to connect via serial console, try follow the documentation Troubleshooting SSH section Inspect the VM instance without shutting it down and inspect the disk of your VM on another VM. Same way you can transfer your data to another working VM instance.
I had had the same issue while running composer update.
In my case an rebooting of the VM-Instance has solved it.
Beased on these error messages, I guess that your project has Identity-Aware Proxy (IAP) enabled, which sometimes may affect the ability to SSH into an instance, depending on the configuration.
In order to rule out this, you may try the following:
Create the firewall rules for allowing IAP to connect to your instances
Grant the necessary permissions to use IAP
Tunnel the SSH connection through IAP

How i can configure Google Cloud Platform with Cloudflare-Only?

I recently start using GCP but i have one thing i can't solve.
I have: 1 VM + 1 DB Instance + 1 LB. DB instance allow only conections from the VM IP. bUT THE VM IP allow traffic from all ip (if i configure the firewall to only allow CloudFlare and LB IP's the website crash and refuse conections).
Recently i was under attack, i activate the Cloudflare ddos mode, restart all and in like 6 h the attack come back with the Cloudflare activate. Wen i see mysql conections bump from 20-30 to 254 and all conections are from the IP of the VM so i think the problem are the public accesibility of the VM but i don't know how to solved it...
If i activate my firewall rules to only allow traffic from LB and Cloudflare the web refuses all conections..
Any idea what i can do?
Thanks.
Cloud Support here, unfortunately, we do not have visibility into what is installed on your instance or what software caused the issue.
Generally speaking you're responsible for investigating the source of the vulnerability and taking steps to mitigate it.
I'm writing here some hints that will help you:
Make sure you keep your firewall rules in a sensible manner, e.g. is not a good practice to have a firewall rule to allow all ingress connections on port 22 from all source IPs for obvious reasons.
Since you've already been rooted, change all your passwords: within the Cloud SQL instance, within the GCE instance, even within the GCP project.
It's also a good idea to check who has access to your service accounts, just in case people that aren't currently working for you or your company still have access to them.
If you're using certificates revoke them, generate new ones and share them in a secure way and with the minimum required number of users.
Securing GCE instances is a shared responsability, in general, OWASP hardening guides are really good.
I'm quoting some info here from another StackOverflow thread that might be useful in your case:
General security advice for Google Cloud Platform instances:
Set user permissions at project level.
Connect securely to your instance.
Ensure the project firewall is not open to everyone on the internet.
Use a strong password and store passwords securely.
Ensure that all software is up to date.
Monitor project usage closely via the monitoring API to identify abnormal project usage.
To diagnose trouble with GCE instances, serial port output from the instance can be useful.
You can check the serial port output by clicking on the instance name
and then on "Serial port 1 (console)". Note that this logs are wipped
when instances are shutdown & rebooted, and the log is not visible
when the instance is not started.
Stackdriver monitoring is also helpful to provide an audit trail to
diagnose problems.
You can use the Stackdriver Monitoring Console to set up alerting policies matching given conditions (under which a service is considered unhealthy) that can be set up to trigger email/SMS notifications.
This quickstart for Google Compute Engine instances can be completed in ~10 minutes and shows the convenience of monitoring instances.
Here are some hints you can check on keeping GCP projects secure.

Denial of service attack in Google Compute Engine running Ubuntu

I noticed that my VM in the google cloud platform is generating DOS and wondering where that may be coming from. On further search, I noticed a file that wasn't created by me and deleted the file.
So far, I have changed the ssh port but I'm still getting This project appears to be committing denial of service attacks
I would like suggestions on what else I can do to prevent this in the future.
I'm leaving here some interesting resources you can check to secure your Google Compute Engine instance:
Ubuntu SSH Guard manpage
ArchLinux SSH guard guide (guides you through installation and setup)
Apache hardening guide from geekflare
PHP security cheatsheet from OWASP
MySQL security guidelines
General security advice for Google Cloud Platform instances:
Set user permissions at project level.
Connect securely to your instance.
Ensure the project firewall is not open to everyone on the internet.
Use a strong password and store passwords securely.
Ensure that all software is up to date.
Monitor project usage closely via the monitoring API to identify abnormal project usage.
To diagnose trouble with GCE instances, serial port output from the instance can be useful.
You can check the serial port output by clicking on the instance name
and then on "Serial port 1 (console)". Note that this logs are wipped
when instances are shutdown & rebooted, and the log is not visible
when the instance is not started.
Stackdriver monitoring is also helpful to provide an audit trail to
diagnose problems.
Here are some hints you can check on keeping GCP projects secure.