Changes to ignite cluster membership unexplainable - vmware

I am running a 12 node jvm ignite cluster. Eeach jvm runs on its own vmware node. I am using zookeeper to keep these ignite nodes in sync using tcp discovery. I have been seeing lot of node failures in zookeeper logs
although the java processes are running, I don't know why some ignite nodes leave the cluster with "node failed" kind of errors. Vmware uses vmotion to do something what they call as "migration".I am assuming that is some kind of filesystem sync process between vmware nodes.
I am also seeing pretty frequent "dumping pending object" and "Failed to wait for partition map exchange" kind of messages in the jvm logs for ignite.
My env setup is as follows:
Apache Ignite 1.9.0
RHEL 7.2 (Maipo) runs on each of the 12 nodes
Oracle Jdk1.8.
Zookeeper 3.4.9
Please let me know your thoughts.
TIA

There are generally two possible reasons:
Memory issues. For example, if a node goes to long GC pause, it can become unresponsive and therefore removed from topology. For more details read here: https://apacheignite.readme.io/docs/jvm-and-system-tuning
Network connectivity issues. Check if the network between your VMs is stable. You may also want to try increasing the failure detection timeout: https://apacheignite.readme.io/docs/cluster-config#failure-detection-timeout

VM Migrations sometimes involve suspending the VM. If the VM is suspended, it won't have a clean way to communicate with the rest of the cluster and will appear down.

Related

How does Kubernetes kubelet resource reservation work

I recently tried to bring up a Kubernetes cluster in AWS using kops. But when the worker node (Ubuntu 20.04) started, a docker load process on it kept getting OOMkilled even when it has enough memory (~14GiB). I tracked down the issue being I set kubelet's memory reservation too small (--kube-reserved=memory=100Mi...).
So now I have two questions related to the following paragraph in the documentation:
kube-reserved is meant to capture resource reservation for kubernetes system daemons like the kubelet, container runtime, node problem detector, etc.
https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#kube-reserved
First, I interpreted the "reservation" as "the amount of memory guaranteed", similar to the concept of a pod's .spec.resource.requests.memory. However, it seems like the flag acts like a limit as well? Does this mean Kubernetes intend to manage Kubernetes system daemons with "guaranteed" QoS class concept?
Also, my container runtime, docker, does not seem to be in /kube-reserved cgroup, instead, it is in /system.slice:
$ systemctl status $(pgrep dockerd) | grep CGroup
CGroup: /system.slice/docker.service
So why is it getting limited by /kube-reserved? It is not even kubelet talking to docker through CRI, but just my manual docker load command.
kube-reserved is a way to protect Kubernetes system daemons (which includes the Kubelet) from running out of memory should the pods consume too much. How is this achieved? The pods are limited by default to an "allocatable" value, equal to the memory capacity of the node minus several flag values defined in the URL you posted, one of which is kube-reserved. Here's what this looks like for a 7-GiB DS2_v2 node in AKS:
But it's not always the Kubernetes system daemons that have to be protected from either pods or even OS components consuming too much memory. It can very well be the Kubernetes system daemons that could consume too much memory and start affecting the pods or other OS components. To protect against this scenario, there's an additional flag defined:
To optionally enforce kube-reserved on kubernetes system daemons,
specify the parent control group for kube daemons as the value for
--kube-reserved-cgroup kubelet flag.
With this new flag in place, should the aggregated memory use of the Kubernetes system daemons exceed the cgroup limit, then the OOM killer will step in and terminate one of their processes. To apply this to the picture above, with the --kube-reserved-cgroup flag specified, the Kubernetes system daemons are prevented from going over 1,638 MiB.

Kubernetes liveness probes stop responding

I'm using kube-aws to set up a production Kubernetes cluster on AWS. Now I'm running into an issue I am unable to recreate in my dev environment.
When a pod is running heavy computation, which in my case happens in bursts, the liveness and readiness probes stop responding. In my local environment, where I run docker compose, they work as expected.
The probes use simple HTTP 204 No Content output using native Go functionality.
Has anyone seen this before?
EDIT:
I'd be happy to provide more information, but I am uncertain what I should provide as there is a tremendous amount of configuration and code involved. Primarily, I'm looking for help to troubleshoot and where to look to try to locate the actual issue.

How to debug unexpected instance termination on Google Cloud Computing

I have a mongo database running on a Google Cloud Computing instance. For the second time now (in a few months), the server unexpectedly shut down into mode "TERMINATED". How do I find the cause of the shutdown?
The serial console just says, "The resource 'projects/my-project/zones/europe-west1-b/instances/mongo-db' is not ready".
I looked into the database logs, seems it received an external signal to shut down ("got signal 15 (Terminated)").
Nothing suspicious in the syslogs or messages logs after spinning up a new instance on the same disk. Also, there was no planned maintenance as far as I'm aware.
Any idea where to look?
Since your mongo database actually received a terminate signal, your instance was probably shutdown gracefully somehow. It sounds like something related to automatic migrations, but there are a couple of things to look at to help narrow this down.
In the Google Developers Console go to Compute -> Compute Engine -> VM instances -> mongo-db. There should be a section called "Availability policies." Check "On host maintenance" to make sure "Migrate VM instance" is selected. Otherwise, the VM will shutdown instead of migrating for maintenance.
You can also look at the operations for an instance at Compute -> Compute Engine -> Operations. This has all the operations that you and the system performed for your instances. You may see something around the time that the process terminated. You can also see this with the gcloud CLI with gcloud compute operations list

What could be causing seemingly random AWS EC2 server to Crash? (Error couldn't establish database connection)

To begin, I am running a Wordpress site on an AWS EC2 Ubuntu Micro instance. I have already confirmed that this is NOT an error with Wordpress/mysql.
Seemingly at random the site will go down and I'll get the "Error establishing database connection" message. The server says that it is running just fine, and rebooting usually fixes the issue, however I'd like to figure out the cause and resolve the issue so this can stop happening (it's been the past 2 weeks now that it goes down almost every other day.)
It's not a spike in traffic, or at least Google Analytics hasn't shown the site as having any spikes in traffic (it averages about 300 visits per day.)
What's the cause, and how can this be fixed?
Sounds like you might be running into the throttling that is a limitation on t1.micro. If you use too much CPU cycles you will be throttled.
See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html#available-cpu-resources-during-spikes
The next time this happens I would check some general stats on the health of the instance. You can get a feel for the high-level health of the instance using the 'top' command (http://linuxaria.com/howto/understanding-the-top-command-on-linux?lang=en). Be sure to look for CPU and memory usage. You may find a process (pid) that is consuming a lot of resources and starving your app.
More likely, something within your application (how did you come to the conclusion that this is not a Wordpress/MySQL issue?) is going out of control. Possibly there is a database connection not being released? To see what your app is doing, find the process id (pid) for your app:
ps aux | grep "php"
and get a thread dump for that process: kill -3 to get java thread dump. This will help you see where your application's threads are stuck (if they are).
Typically it's good practice to execute two thread dumps a few seconds apart and compare trends in both. If there is an issue in the application, you should see a lot of threads stuck at the same point.
You might also want to checkout what MySQL is seeing (https://dev.mysql.com/doc/refman/5.1/en/show-processlist.html).
mysql> SHOW FULL PROCESSLIST
Hope this helps, let us know what you find!

Can't rerun meteor leaderboard on AWS EC2 micro T1 instance after failing keepalive

I'm unable to run a Meteor leaderboard demo after a failed keepalive error on an AWS EC2 micro.T1 instance. If I start from a freshly booted Amazon Machine Instance (AMI) I'm able to run the leaderboard demo at localhost:3000 from Firefox when I'm connected with a VNC client (TightNVC Viewer). It runs very, very slowly, but it runs.
If I fail to interact with it soon enough however I get these messages
I2051-00:03:03.173(0)?Failed to receive keepalive! Exiting.
=> Exited with code:1
=> Meteor server restarted
From that point forward everything on that instance runs at a glacial pace. Switching back to the Firefox window takes 3 minutes. when I try to connect to //localhost:3000 Firefox I usually get a message about a script no longer running and eventually the terminal window adds this to what I wrote above:
I2051-00:06:02.443(0)?Failed to receive keepalive! Exiting.
=> Exited with code:1
=> Meteor server restarted
I2051-00:08:17.227(0)?Failed to receive keepalive! Exiting.
=> Exited with code:1
=> Your application is crashing. Waiting for file change.
Can anyone translate for me what is happening?
I'm wondering whether the t1.micro instance I'm running is just too under-powered or because it's not shutting down meteor properly thereby leaving an instance of MongoDB running and trying to launch another.
I'm using Amazon Machine Image ubuntu-precise-12.04-amd64-server-20130411.1 (ami-70f96e40) which says this about it's configuration:
Size: t1.micro
ECUs: up to 2
vCPUs: 1
Memory (GiB): 0.613
Instance Storage (GiB): EBS only
EBS-Optimized Available: -
Netw. Performance: -Very Low
Micro instances
Micro instances are a low-cost instance option, providing a small amount of CPU resources. They are suited for lower throughput applications, and websites that require additional compute cycles periodically, but are not appropriate for applications that require sustained CPU performance. Popular uses for micro instances include low traffic websites or blogs, small administrative applications, bastion hosts, and free trials to explore EC2 functionality.
If my guess is right, can anyone suggest an AMI suitable for Meteor development?
Thanks
check this answer
Try to remove meteor remove autopublish
How are you running the app on ec2? I have been able to run apps on a micro instance so I don't see why this should be an issue.
If you are running it by using 'meteor' as you would locally that's probably the issue. You get way better performance when running it as a node app, this typically isn't an issue when developing locally but may be too much for a ec2 micro.
What you want to do is 'meteor bundle example.tgz', upload that to the server and run it as a node app.
Here is a guide that I remember using a while ago to get it done on ec2:
http://julien-c.fr/2012/10/meteor-amazon-ec2/
You shouldn't need to use VNC either, you can access it from your own computer in a browser using the public address your instance gets assigned.
If you get a node fibers error message which is pretty common then cd into bundle/program/server do 'npm uninstall fibers' and then 'npm install fibers'