Google Cloud compute engine randomly becomes unaccessable - google-cloud-platform

I have a problem with a custom (1 vCPU, 2 GB memory) compute instance running Apache and 3 python scripts which essentially waits for messages and runs some SQL queries and creates reports. Once in a while, the entire instance becomes unresponsive for Apache, SSH and even access to the serial console. It looks like the entire instance is frozen. The only solution for this is to actively log in to my Google Cloud account and restart the instance.
I have checked the disk space because Google suggested in one of their pages that it might lead to the instance freezing but I still have 6GB available disk space so it shouldn't be an issue.
I have added logs from "Serial port 1 (console)" in case it might help with diagnosing the issue.
Could someone please assist me with finding out why this is happening? Thank you in advance.
Serial console logs output:
https://pastebin.com/raw/Z9gADmCn
Nov 18 19:14:24 web-server systemd[1]: Stopping System Logging Service...
Nov 18 19:14:24 web-server systemd[1]: Stopped System Logging Service.
Nov 18 19:14:24 web-server systemd[1]: Starting System Logging Service...
Nov 18 19:14:24 web-server systemd[1]: Started System Logging Service.
Nov 18 19:14:25 web-server dhclient[558]: bound to 10.166.0.10 -- renewal in 1434 seconds.
Nov 18 19:14:25 web-server ifup[516]: bound to 10.166.0.10 -- renewal in 1434 seconds.

This question should better be asked on Serverfault to get attention from Sysadmins instead of Devs here.
Before you use the suggestion above in Kolban's comment, i'd recommend checking some simple things.
1- Check if instance was under maintenance (in instance details page you can find your maintenance window)
2- Also under instance details page, you should be able to check CPU and Memory utilization and see if there was a spike at the time of the freeze. That should put you in the right direction.
3- Check system/apps logs: I'd recommend checking /var/log/syslog and if applicable /var/log/nginx/error.log for example.

I faced the same issue in one of my google compute engine instance where it was becoming freeze after some time from starting .
When i resetting the instance ,then again it started working fine .
So the issue i found was of less CPU/RAM on the instance and the processes on that instance require more CPU/RAM .So when in changed the CPU/RAM from 1CPU/3.75 GB RAM to 4 CPU/16 GB RAM it started working fine permanently.
On the core of this issue,the machine was created from disk snapshot in which different application like tomcat,postgres was configured for high CPU/Memory etc .So it look like when the machine become fully operational then it faced less memory for required processes which lead to slowness and freeze in the instance .

Related

Problems connecting ssh to GCP's compute engine

I paused and changed the cpu to improve the performance of the compute engine (ubuntu 18.04 ).
However, after executing after setting, ssh connection is not possible at all in console, vs code.
When ssh connection is attempted, the log of the gcp serial port is as follows.
May 25 02:07:52 nt-ddp-jpc GCEGuestAgent[1244]: 2021-05-25T02:07:52.4696Z GCEGuestAgent Info: Adding existing user root to google-sudoers group.
May 25 02:07:52 nt-ddp-jpc GCEGuestAgent[1244]: 2021-05-25T02:07:52.4730Z GCEGuestAgent Error non_windows_accounts.go:152: gpasswd: /etc/group.1540: No space left on device# 012gpasswd: cannot lock /etc/group; try again later.#012.
Also, when I try ssh in vs code I get permission denied error.
What is the exact cause and resolution of the problem?
Thanks all the time for your help.
No space left on device error.
To solve this issue, as John commented, you may follow this official guide of GCP in order to increase space on a full boot disk. It will be possible to log in through SSH after that procedure of increase size of boot disk.
As a best practice you may create a snapshot first, and keep in mind that increasing boot disk size and/or saving a snapshot could slightly increase the cost of your project.

Increase shutdown time of windows instance with gcloud cli

I would like to run a shutdown script that waits potentially up to 5 minutes before really shutting down a windows instance.
I know how to run a shutdown script but not how to prevent GCP to kill the instance after a certain time. In this documentation, it is mentioned that there is a (non reliable) limit of 90 seconds before the instance is completely shut down by GCP.
Is it possible to increase that limit ?
Unfortunately 90 seconds limit is something that will not be changed. There are several feature requests for this, but unlikely that will be implemented soon.

Google Cloud Engine -- kernel:NMI watchdog: BUG: soft lockup

I am using GCE with hierarchical VMs (installing OpenStack, so its KVM). So inside the GCE instance, I have 4 VMs. and in one of those VMs, I have 3 VMs.
Whenever the GCE VM CPU load goes above say 60%, the system gets flooded with these nasty messages. And then 50:50 chance, whether the VM recovers or gets corrupted.
Any ideas what's going on? And why the VM gets corrupted?
Thanks,
Message from syslogd#xxxx at Mar 23 21:00:51 ...
kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s
! [CPU 6/KVM:10791]

Can't rerun meteor leaderboard on AWS EC2 micro T1 instance after failing keepalive

I'm unable to run a Meteor leaderboard demo after a failed keepalive error on an AWS EC2 micro.T1 instance. If I start from a freshly booted Amazon Machine Instance (AMI) I'm able to run the leaderboard demo at localhost:3000 from Firefox when I'm connected with a VNC client (TightNVC Viewer). It runs very, very slowly, but it runs.
If I fail to interact with it soon enough however I get these messages
I2051-00:03:03.173(0)?Failed to receive keepalive! Exiting.
=> Exited with code:1
=> Meteor server restarted
From that point forward everything on that instance runs at a glacial pace. Switching back to the Firefox window takes 3 minutes. when I try to connect to //localhost:3000 Firefox I usually get a message about a script no longer running and eventually the terminal window adds this to what I wrote above:
I2051-00:06:02.443(0)?Failed to receive keepalive! Exiting.
=> Exited with code:1
=> Meteor server restarted
I2051-00:08:17.227(0)?Failed to receive keepalive! Exiting.
=> Exited with code:1
=> Your application is crashing. Waiting for file change.
Can anyone translate for me what is happening?
I'm wondering whether the t1.micro instance I'm running is just too under-powered or because it's not shutting down meteor properly thereby leaving an instance of MongoDB running and trying to launch another.
I'm using Amazon Machine Image ubuntu-precise-12.04-amd64-server-20130411.1 (ami-70f96e40) which says this about it's configuration:
Size: t1.micro
ECUs: up to 2
vCPUs: 1
Memory (GiB): 0.613
Instance Storage (GiB): EBS only
EBS-Optimized Available: -
Netw. Performance: -Very Low
Micro instances
Micro instances are a low-cost instance option, providing a small amount of CPU resources. They are suited for lower throughput applications, and websites that require additional compute cycles periodically, but are not appropriate for applications that require sustained CPU performance. Popular uses for micro instances include low traffic websites or blogs, small administrative applications, bastion hosts, and free trials to explore EC2 functionality.
If my guess is right, can anyone suggest an AMI suitable for Meteor development?
Thanks
check this answer
Try to remove meteor remove autopublish
How are you running the app on ec2? I have been able to run apps on a micro instance so I don't see why this should be an issue.
If you are running it by using 'meteor' as you would locally that's probably the issue. You get way better performance when running it as a node app, this typically isn't an issue when developing locally but may be too much for a ec2 micro.
What you want to do is 'meteor bundle example.tgz', upload that to the server and run it as a node app.
Here is a guide that I remember using a while ago to get it done on ec2:
http://julien-c.fr/2012/10/meteor-amazon-ec2/
You shouldn't need to use VNC either, you can access it from your own computer in a browser using the public address your instance gets assigned.
If you get a node fibers error message which is pretty common then cd into bundle/program/server do 'npm uninstall fibers' and then 'npm install fibers'

Jrun ColdFusion service intermittently fails to start

We occasionaly have a problem where we attempt to start the Jrun service and it fails with the following two errors:
error JRun Naming Service unable to start on port 2902
java.net.BindException: Port in use by another service or process: 2902
info No JDBC data sources have been configured for this server (see jrun-resources.xml)
error java.net.BindException: Port in use by another service or process: 8300
We then have to reboot the machine and Jrun comes up with no problem. This is very intermittent - happens perhaps one out of every 10 times we restart Jrun services.
I saw another reference on StackOverflow that if Windows Services take longer than 30 seconds to restart Windows shuts down the startup proccess. Perhaps that is the issue here? The logs indeed indicate that these errors are thrown about 37+ seconds after the restart command is issued.
We are on a 64bit platform on WinServer 2008.
Thanks!
We've been experiencing a similar problem on some of our servers. Unfortunately, netstat never indicated any sort of actual port conflict for us. My suspicion is that it's related to our recent deployment of a ColdFusion "cumulative hotfix" to our servers. We use the multi-server edition of CF 8.0.1 enterprise with a large number of instances on each machine -- each with its own JVM and its own distinct set of ports. Each CF instance is attached to its own IIS website and runs as its own Windows Service.
Within the past few weeks, we started getting similar "port in use" exceptions on startup, on our 32-bit machines as well as our 64-bit machines, all of which are running Windows Server 2003. I found several possible culprits and tried the following:
In jrun-jms.xml for each CF instance, there's an entry for the RMI transport layer that reads <port>0</port> -- which, according to the JRun documentation, means "choose a random port." I made that non-random and distinct per instance (in the 2600-2650 range) and restarted each instance. Things improved temporarily, perhaps coincidentally.
In the same file, under the entry for the TCPIP transport later, every instance defaulted to <port>2522</port> -- so I changed those to distinct ports per instance in the 2500-2550 range and restarted each instance. That didn't seem to help at all.
I tried researching whether ports in the 2500-3000 range might be used for any other purpose, and I couldn't find anything obvious, and besides, netstat wasn't telling me that any of my choices were in use.
I found something online about Windows designating ports from 1024 to 5000 as the "dynamic port" range, so I added 10000 to the port numbers I had set in jrun-jms.xml and restarted each instance again. Still didn't help.
I tried changing the port in jndi.properties, also by adding 10000 to the port numbers. Unfortunately this meant wiping out all my wsconfig connections to IIS and creating them again from scratch. I had to edit wsconfig_jvm.config as well, adding -DWSConfig.PortScanStartPort=12900 to java.args, so it could detect my CF instances. (By default it only scans ports 2900-3000. See bpurcell.org for details. It's an old post but still relevant.) So far so good!
My best guess is that Adobe (or MS Windows) changed the way some of its code grabs "random" ports. But all I know for sure so far is that the steps outlined above appear to have fixed the problem.
Have you verified that the services are in fact stopping? Task manager should show no instances of jrun.exe. You can also check to see what is bound to that port by opening a command window and running
netstat -a -b
This will list all your open ports, plus what program is using them. You can also use
netstat -a -o
Which does the same thing as the above, but will list the process id instead of the program name. You can then cross-reference those with task manager. You'll need to enable showing the PIDs in task manager by going to View->Select Columns and making sure PID is checked. My guess would be that the jrun processes are not shutting down in a timely fashion.