VM connections problems - fluctuating/high ping - google-cloud-platform

In a VM with openvpn we are having connection problems. Pinging to the ips that manage to connect, the ping varies from, for example, 100ms to 6000ms. When there are no problems the ping is normal.
This problem occurred on 04/13/2021 at approximately 15:40h (Spain time) and lasted about 15-20 minutes. This same problem also occurred on 1/4/2021 in the morning and lasted several hours.
Has anyone else had this same problem or a similar problem? Is it normal that Google does not give information about these incidents?

You can check status of Google Cloud services with Google Cloud Status Dashboard.
To check your current latency to GCP regions use this tool - link.
From what I can see, there has been no disruptions on 13th.
I would recommend setting up monitoring or using tools like traceroute to locate the issue.

Related

GCP VM shut itself down, won't restart

Bit panicky here because I can't troubleshoot the error on a production site and it appears to be completely down.
GCP - Compute Engine VM - N1-standard on the US-West-3C zone running a Bitnami Multisite Wordpress deployment
About 2 hours ago my VM stopped responding (as far as I could tell with monitoring tools) and I was unable to SSH into it or connect in any way. I've experienced this occasionally in the past so my process was to grab a snapshot and restart the VM. I did manage to get the snapshot, however it stopped the VM by itself and I'm now stuck where I can't restart the VM.
The error I'm getting is:
Failed to start name-of-vm: A n1-standard-1 VM instance is currently unavailable in the us-west3-c zone. Alternatively, you can try your request again with a different VM hardware configuration or at a later time. For more information, see the troubleshooting documentation.
I tried changing my configuration (it used to be a custom VM) but that didn't do anything.
Searching for similar errors I've found threads about certain Zones running out of resources, but as far as I can tell this error doesn't specifically say 'run out of resources' and the status of the US-West-3C zone is fine. I can't imagine it would run out in a way where it can't even start a measly n1 vm.
Unfortunately due to some mismanagement this project isn't umbrella'd in our Google Workspace/Organization so I can't request technical support for it.
Any assistance or help pointing to some resources would be greatly appreciated.
currently unavailable in a specific zone would also mean that the zone run out of resources for the certain machine type.
You can try to restore the snapshot you had created on a different machine type e2-standard or n2-standard machine type configuration

How long does it take for AWS Amazon Inspector to complete a full EC2 Scan?

I enabled AWS Amazon Inspector (2) for a single EC2 instance that I have. It's an ubuntu with php and apache, nothing special, and the status shows Scanning for the last 3 hours.
I look at the htop of this machine, and I see that the /snap/amazon-ssm-agent/####/amazon-ssm-agent is running and that several /snap/amazon-ssm-agent/####/ssm-agent-worker are running. Still.... 3 hours passed, and I have no results.
Is it working? isn't it working? is there a more verbose status?
Also, if someone have experience with this, can you share the avarage time you waited for results?
I've been in a similar situation - do inspector scans on EC2 as well as ECR. ECR was pretty quick for scans but for EC2 - it took about 4.5hrs to get to INITIAL_SCAN_COMPLETE state. Very concerning it takes this amount of time but noticed it was doing about 470 vulnerability checks.
here's are the document contains the status information.
https://docs.aws.amazon.com/inspector/latest/user/assessing-coverage.html
Scanning – Amazon Inspector is continuously monitoring and scanning the instance.
It won't just scan and leave it but instead continuously monitor the instance for future vulnerabilities too. Hence the status shows Scanning.
You need to get into findings tab to look into what's going on with the vulnerabilities. Findings -> By instance -> Select your instance to see findings related to your instance. Hope that helps.

ec2 Instance Status Check Failed

I am currently running a process on an ec2 server that needs to run consistently in the background. I tried to login to the server and I continue to get a Network Error: Connection timed out prompt. When I check the instance, I get the following message:
Instance reachability check failed at February 22, 2020 at 11:15:00 PM UTC-5 (1 days, 13 hours and 34 minutes ago)
To troubleshoot, I have tried rebooting the server but that did not correct the problem. How do I correct this and also prevent it from happening again?
An instance status check failure indicates a problem with the
instance, such as:
Failure to boot the operating system
Failure to mount volumes correctly
File system issues
Incompatible drivers
Kernel panic
Severe memory pressures
You can check following for troubleshooting
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/TroubleshootingInstancesStopping.html
For future reprting and auto recovery you can create a CloudWatch
Alarm
For second part
Nothing you can do to stop its occurrence, but for up-time and availability YES you can create another EC2 and add ALB on the top of both instances which checks the health of instance, so that your users/customers/service might be available during recovery time (from second instance). You can increase number of instances as more as you want for high availability (obviously it involves cost)
I've gone through the same problem
and then once looking at the EC2 dashboard could see that something wasn't right with it
but for me rebooting
and waiting for a 2-3 minutes solved it and then was able to SSH to the instance just fine
If that becomes a recurrent problem, then I'll follow through with Jeremy Thompson's advice
... put the EC2's in an Auto Scaling Group. The ALB does a health check and it fails will no longer route traffic to that EC2, then the ASG will send a Status check and take the unresponding server out of rotation.

Amazon EC2 Servers getting frozen sporadically

I've been working with Amazon EC2 servers for 3+ years and I noticed a recurrent behaviour: some servers get frozen sporadically (between 1 to 5 times by year).
When this fact ocurs, I can't connect to server (tried http, mysql and ssh connections) till server be restarted.
The server back to work after a restart.
Sometimes the server goes online by 6+ months, sometimes the server get frozen about 1 month after a restart.
All servers I noticed this behavior were micro instances (North Virginia and Sao Paulo).
The servers have an ordinary Apache 2, Mysql 5, PHP 7 environment, with Ubuntu 16 or 18. The PHP/MySQL Web application is not CPU intensive and is not accessed by more than 30 users/hour.
The same environment and application on Digital Ocean servers does NOT reproduce the behaviour (I have two digital ocean servers running uninterrupted for 2+ years).
I like Amazon EC2 Servers, mainly because Amazon has a lot of useful additional services (like SES), but this behaviour is really frustating. Sometimes I got customers calls complaining about systems down and I just need an instance restart to solve the problem.
Does anybody have a tip about solving this problem?
UPDATE 1
They are t2.micro instances (1Gb RAM, 1 vCPU).
MySQL SHOW GLOBAL VARIABLES: pastebin.com/m65ieAAb
UPDATE 2
There is a CPU utilization peak in the logs, near the time when server was down. It was at 3AM. At this time there is a daily crontab task to make a database backup. But, considering this task runs everyday, why just sometimes it would make server get frozen?
I have not seen this exact issue, but on any cloud platform I assume any instance can fail at any time, so we design for failure. For example we have autoscaling on all customer facing instances. Anytime an instance fails, it is automatically replaced.
If a customer is calling to advise you a server is down, you may need to consider more automated methods of monitoring instance health and take automated action to recover the instance.
CloudWatch also has server recovery actions available that can be trigger if certain metric thresholds are reached.

GCP instance : ping high latency

Since 2019-12-12 18:00 GMT+8 , instance on asia-east1-b ping latency is higher than a & b
ping asia-east1-c inatance, latency 7ms
ping asia-east1-b instance, latency 283ms
how to contact gcp technology support to resolve this problem ?
If you notice this is a temporary problem related to cloud networking please check the Google Cloud services status dashboard
They might already acknowledged the problem and working on fixing it, however you can always report problems in the services / infrastructure by contacting support. In your GCP project go to Help (the question mark icon in the top) > Contact Support.
You can find more information inside that page.