Google Cloud Engine -- kernel:NMI watchdog: BUG: soft lockup - google-cloud-platform

I am using GCE with hierarchical VMs (installing OpenStack, so its KVM). So inside the GCE instance, I have 4 VMs. and in one of those VMs, I have 3 VMs.
Whenever the GCE VM CPU load goes above say 60%, the system gets flooded with these nasty messages. And then 50:50 chance, whether the VM recovers or gets corrupted.
Any ideas what's going on? And why the VM gets corrupted?
Thanks,
Message from syslogd#xxxx at Mar 23 21:00:51 ...
kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s
! [CPU 6/KVM:10791]

Related

rsyslogd using 100% CPU Utilization on all RHEL EC2 Instances

Since past two days, rsyslogd is using 100% CPU Utilization on all RHEL EC2 Instances in my environment. I Stopped and started rsyslog service but still issue persists.
This is first time we are facing this kind of behaviour in multiple servers.
There is sufficient disk space/memory exists in all servers.
I checked kernel logs (/var/log/kernel) and server messages( /var/log/messages),but not find any useful info.
Following is the OS and Kernal versions of all servers.
Operating System: Red Hat Enterprise Linux Server 7.2 (Maipo)
Kernel: Linux 3.10.0-1160.42.2.el7.x86_64
Can someone please suggest on this.
Issue is due to cert expiry for rsyslog. Observed a lot of connection retry errors in system messages for all clients. Post renewing certs in rsyslog, cpu went down.

Google Cloud compute engine randomly becomes unaccessable

I have a problem with a custom (1 vCPU, 2 GB memory) compute instance running Apache and 3 python scripts which essentially waits for messages and runs some SQL queries and creates reports. Once in a while, the entire instance becomes unresponsive for Apache, SSH and even access to the serial console. It looks like the entire instance is frozen. The only solution for this is to actively log in to my Google Cloud account and restart the instance.
I have checked the disk space because Google suggested in one of their pages that it might lead to the instance freezing but I still have 6GB available disk space so it shouldn't be an issue.
I have added logs from "Serial port 1 (console)" in case it might help with diagnosing the issue.
Could someone please assist me with finding out why this is happening? Thank you in advance.
Serial console logs output:
https://pastebin.com/raw/Z9gADmCn
Nov 18 19:14:24 web-server systemd[1]: Stopping System Logging Service...
Nov 18 19:14:24 web-server systemd[1]: Stopped System Logging Service.
Nov 18 19:14:24 web-server systemd[1]: Starting System Logging Service...
Nov 18 19:14:24 web-server systemd[1]: Started System Logging Service.
Nov 18 19:14:25 web-server dhclient[558]: bound to 10.166.0.10 -- renewal in 1434 seconds.
Nov 18 19:14:25 web-server ifup[516]: bound to 10.166.0.10 -- renewal in 1434 seconds.
This question should better be asked on Serverfault to get attention from Sysadmins instead of Devs here.
Before you use the suggestion above in Kolban's comment, i'd recommend checking some simple things.
1- Check if instance was under maintenance (in instance details page you can find your maintenance window)
2- Also under instance details page, you should be able to check CPU and Memory utilization and see if there was a spike at the time of the freeze. That should put you in the right direction.
3- Check system/apps logs: I'd recommend checking /var/log/syslog and if applicable /var/log/nginx/error.log for example.
I faced the same issue in one of my google compute engine instance where it was becoming freeze after some time from starting .
When i resetting the instance ,then again it started working fine .
So the issue i found was of less CPU/RAM on the instance and the processes on that instance require more CPU/RAM .So when in changed the CPU/RAM from 1CPU/3.75 GB RAM to 4 CPU/16 GB RAM it started working fine permanently.
On the core of this issue,the machine was created from disk snapshot in which different application like tomcat,postgres was configured for high CPU/Memory etc .So it look like when the machine become fully operational then it faced less memory for required processes which lead to slowness and freeze in the instance .

Increase shutdown time of windows instance with gcloud cli

I would like to run a shutdown script that waits potentially up to 5 minutes before really shutting down a windows instance.
I know how to run a shutdown script but not how to prevent GCP to kill the instance after a certain time. In this documentation, it is mentioned that there is a (non reliable) limit of 90 seconds before the instance is completely shut down by GCP.
Is it possible to increase that limit ?
Unfortunately 90 seconds limit is something that will not be changed. There are several feature requests for this, but unlikely that will be implemented soon.

Google Compute Engine - Low on Resource Utilisation

I use a VM Instance provided by Google Compute Engine.
Machine Type: n1-standard-8 (8 vCPUs, 30 GB memory).
When I check for the CPU Utilisation, it never uses more than 12%. I use my VM for running Jupyter Notebook. I have tried loading dataframes which costed 7.5 GiB (And it takes a long time to process the data for simple operations). But still the utilisation is same
How can I utilise the CPU power ~ 100%?
Or Does my program use only 1 out of the 8 CPU (1/8)*100 =12.5%?
You can run stress command to impose a configurable amount of CPU, memory, I/O, and disk stress on the system.
Example to stress 4 cores for 90 seconds:
stress --cpu 4 --timeout 90
In the meantime go to your Google Cloud Console on your browser to check your CPU usage on your VM or open new SSH connection to your VM and run TOP command to see your CPU status.
After running those mentioned commands, if your CPU can reach over 99%, your instance is working fine and you have to check your application resources to know why it is restricted and cannot use CPU more than 12%.

Can't rerun meteor leaderboard on AWS EC2 micro T1 instance after failing keepalive

I'm unable to run a Meteor leaderboard demo after a failed keepalive error on an AWS EC2 micro.T1 instance. If I start from a freshly booted Amazon Machine Instance (AMI) I'm able to run the leaderboard demo at localhost:3000 from Firefox when I'm connected with a VNC client (TightNVC Viewer). It runs very, very slowly, but it runs.
If I fail to interact with it soon enough however I get these messages
I2051-00:03:03.173(0)?Failed to receive keepalive! Exiting.
=> Exited with code:1
=> Meteor server restarted
From that point forward everything on that instance runs at a glacial pace. Switching back to the Firefox window takes 3 minutes. when I try to connect to //localhost:3000 Firefox I usually get a message about a script no longer running and eventually the terminal window adds this to what I wrote above:
I2051-00:06:02.443(0)?Failed to receive keepalive! Exiting.
=> Exited with code:1
=> Meteor server restarted
I2051-00:08:17.227(0)?Failed to receive keepalive! Exiting.
=> Exited with code:1
=> Your application is crashing. Waiting for file change.
Can anyone translate for me what is happening?
I'm wondering whether the t1.micro instance I'm running is just too under-powered or because it's not shutting down meteor properly thereby leaving an instance of MongoDB running and trying to launch another.
I'm using Amazon Machine Image ubuntu-precise-12.04-amd64-server-20130411.1 (ami-70f96e40) which says this about it's configuration:
Size: t1.micro
ECUs: up to 2
vCPUs: 1
Memory (GiB): 0.613
Instance Storage (GiB): EBS only
EBS-Optimized Available: -
Netw. Performance: -Very Low
Micro instances
Micro instances are a low-cost instance option, providing a small amount of CPU resources. They are suited for lower throughput applications, and websites that require additional compute cycles periodically, but are not appropriate for applications that require sustained CPU performance. Popular uses for micro instances include low traffic websites or blogs, small administrative applications, bastion hosts, and free trials to explore EC2 functionality.
If my guess is right, can anyone suggest an AMI suitable for Meteor development?
Thanks
check this answer
Try to remove meteor remove autopublish
How are you running the app on ec2? I have been able to run apps on a micro instance so I don't see why this should be an issue.
If you are running it by using 'meteor' as you would locally that's probably the issue. You get way better performance when running it as a node app, this typically isn't an issue when developing locally but may be too much for a ec2 micro.
What you want to do is 'meteor bundle example.tgz', upload that to the server and run it as a node app.
Here is a guide that I remember using a while ago to get it done on ec2:
http://julien-c.fr/2012/10/meteor-amazon-ec2/
You shouldn't need to use VNC either, you can access it from your own computer in a browser using the public address your instance gets assigned.
If you get a node fibers error message which is pretty common then cd into bundle/program/server do 'npm uninstall fibers' and then 'npm install fibers'