Kubernetes node fails (CoreOS/AWS/Kubernetes stack) - amazon-web-services

We have a small testing Kubernetes cluster running on AWS, using CoreOS, as per the instructions here. Currently this consists of only a master and a worker node. In the past couple of weeks we've been running this cluster we've noticed that the worker instance occasionally fails. The first time this happened the instance was subsequently killed and restarted by the auto-scaling group it is in. Today the same thing happened, but we were able to login to the instance before it was shut down and retrieve some information, but it remains unclear to me exactly what has caused this problem.
The node failure seems to happen on an irregular basis, and there is no evidence that there is anything abnormal happening which would precipitate this (external load etc).
Subsquent to the failure (kubernetes node status Not Ready) the instance was still running, but had inactive kubelet and docker services (start failed with result 'dependency'). The flanneld service was running, but with a restart time after the time the node failure was seen.
Logs from around the time of the node failure don't seem to show anything clearly pointing to a cause of the failure. There's a couple of kubelet-wrapper errors at about the time the failure was seen:
`Jul 22 07:25:33 ip-10-0-0-92.ec2.internal kubelet-wrapper[1204]: E0722 07:25:33.121506 1204 kubelet.go:2745] Error updating node status, will retry: nodes "ip-10-0-0-92.ec2.internal" cannot be updated: the object has been modified; please apply your changes to the latest version and try again`
`Jul 22 07:25:34 ip-10-0-0-92.ec2.internal kubelet-wrapper[1204]: E0722 07:25:34.557047 1204 event.go:193] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"ip-10-0-0-92.ec2.internal.1462693ef85b56d8", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"4687622", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-0-92.ec2.internal", UID:"ip-10-0-0-92.ec2.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeHasSufficientDisk", Message:"Node ip-10-0-0-92.ec2.internal status is now: NodeHasSufficientDisk", Source:api.EventSource{Component:"kubelet", Host:"ip-10-0-0-92.ec2.internal"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63604448947, nsec:0, loc:(*time.Location)(0x3b1a5c0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63604769134, nsec:388015022, loc:(*time.Location)(0x3b1a5c0)}}, Count:2, Type:"Normal"}': 'events "ip-10-0-0-92.ec2.internal.1462693ef85b56d8" not found' (will not retry!)
Jul 22 07:25:34 ip-10-0-0-92.ec2.internal kubelet-wrapper[1204]: E0722 07:25:34.560636 1204 event.go:193] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"ip-10-0-0-92.ec2.internal.14626941554cc358", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"4687645", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-0-92.ec2.internal", UID:"ip-10-0-0-92.ec2.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeReady", Message:"Node ip-10-0-0-92.ec2.internal status is now: NodeReady", Source:api.EventSource{Component:"kubelet", Host:"ip-10-0-0-92.ec2.internal"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63604448957, nsec:0, loc:(*time.Location)(0x3b1a5c0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63604769134, nsec:388022975, loc:(*time.Location)(0x3b1a5c0)}}, Count:2, Type:"Normal"}': 'events "ip-10-0-0-92.ec2.internal.14626941554cc358" not found' (will not retry!)`
followed by what looks like some etcd errors:
`Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [WARNING][1305/140149086452400] calico.etcddriver.driver 810: etcd watch returned bad HTTP status topoll on index 5237916: 400
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [ERROR][1305/140149086452400] calico.etcddriver.driver 852: Error from etcd for index 5237916: {u'errorCode': 401, u'index': 5239005, u'message': u'The event in requested index is outdated and cleared', u'cause': u'the requested history has been cleared [5238006/5237916]'}; triggering a resync.
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [INFO][1305/140149086452400] calico.etcddriver.driver 916: STAT: Final watcher etcd response time: 0 in 630.6s (0.000/s) min=0.000ms mean=0.000ms max=0.000ms
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [INFO][1305/140149086452400] calico.etcddriver.driver 916: STAT: Final watcher processing time: 7 in 630.6s (0.011/s) min=90066.312ms mean=90078.569ms max=90092.505ms
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [INFO][1305/140149086452400] calico.etcddriver.driver 919: Watcher thread finished. Signalled to resync thread. Was at index 5237916. Queue length is 1.
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,743 [WARNING][1305/140149192694448] calico.etcddriver.driver 291: Watcher died; resyncing.`
and a few minutes later a large number of failed connections to the master (10.0.0.50):
`Jul 22 07:36:41 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:36:37,641 [WARNING][1305/140149086452400] urllib3.connectionpool 647: Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7700b85b90>: Failed to establish a new connection: [Errno 113] Host is unreachable',)': http://10.0.0.50:2379/v2/keys/calico/v1?waitIndex=5239006&recursive=true&wait=true
Jul 22 07:36:41 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:36:37,641 [INFO][1305/140149086452400] urllib3.connectionpool 213: Starting new HTTP connection (2): 10.0.0.50`
Although these errors are presumably related to the node/instance failure, these don't really mean a lot to me, and certainly don't seem to suggest the underlying cause - but if anyone can see anything here that would suggest a possible cause of the node/instance failure (and how we can go about rectifying this) that would be greatly appreciated!

Something in your description and log confuse me, you said that you use docker runtime which there is rkt in your log; you said that you use flannel in your cluster which there is calico in your log...
Anyway, from the log you provide, it's more like your etcd is down... which makes kubelet and calico can't update their state, and apiserver will regard they are down. There is not enough information here, I could only suggest that you need to backup etcd's log next time you see this...
Another suggestion is that better not use the same etcd for both kubenetes cluster and calico...

Related

Cant Upload image mikrotik-chr on google cloud

I start make image mikrotik-chr from my bucket but always error. I dontt know how to fix it
[inflate.import-virtual-disk]: 2021-08-16T05:39:39Z CreateInstances: Creating instance "inst-importer-inflate-6t2qt".
[inflate]: 2021-08-16T05:39:46Z Error running workflow: step "import-virtual-disk" run error: operation failed &{ClientOperationId: CreationTimestamp: Description: EndTime:2021-08-15T22:39:46.802-07:00 Error:0xc00007b770 HttpErrorMessage:SERVICE UNAVAILABLE HttpErrorStatusCode:503 Id:1873370325760361715 InsertTime:2021-08-15T22:39:40.692-07:00 Kind:compute#operation Name:operation-1629092379433-5c9a6a095186f-620afe4b-ba26ba50 OperationGroupId: OperationType:insert Progress:100 Region: SelfLink:https://www.googleapis.com/compute/v1/projects/circular-jet-322614/zones/asia-southeast2-a/operations/operation-1629092379433-5c9a6a095186f-620afe4b-ba26ba50 StartTime:2021-08-15T22:39:40.692-07:00 Status:DONE StatusMessage: TargetId:6947401086746772724 TargetLink:https://www.googleapis.com/compute/v1/projects/circular-jet-322614/zones/asia-southeast2-a/instances/inst-importer-inflate-6t2qt User:606260965808#cloudbuild.gserviceaccount.com Warnings:[] Zone:https://www.googleapis.com/compute/v1/projects/circular-jet-322614/zones/asia-southeast2-a ServerResponse:{HTTPStatusCode:200 Header:map[Cache-Control:[private] Content-Type:[application/json; charset=UTF-8] Date:[Mon, 16 Aug 2021 05:39:46 GMT] Server:[ESF] Vary:[Origin X-Origin Referer] X-Content-Type-Options:[nosniff] X-Frame-Options:[SAMEORIGIN] X-Xss-Protection:[0]]} ForceSendFields:[] NullFields:[]}:
Code: ZONE_RESOURCE_POOL_EXHAUSTED
Message: The zone 'projects/circular-jet-322614/zones/asia-southeast2-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.
[inflate]: 2021-08-16T05:39:46Z Workflow "inflate" cleaning up (this may take up to 2 minutes).
[inflate]: 2021-08-16T05:39:48Z Workflow "inflate" finished cleanup.
[import-image]: 2021-08-16T05:39:48Z Finished creating Google Compute Engine disk
[import-image]: 2021-08-16T05:39:49Z step "import-virtual-disk" run error: operation failed &{ClientOperationId: CreationTimestamp: Description: EndTime:2021-08-15T22:39:46.802-07:00 Error:0xc00007b770 HttpErrorMessage:SERVICE UNAVAILABLE HttpErrorStatusCode:503 Id:1873370325760361715 InsertTime:2021-08-15T22:39:40.692-07:00 Kind:compute#operation Name:operation-1629092379433-5c9a6a095186f-620afe4b-ba26ba50 OperationGroupId: OperationType:insert Progress:100 Region: SelfLink:https://www.googleapis.com/compute/v1/projects/circular-jet-322614/zones/asia-southeast2-a/operations/operation-1629092379433-5c9a6a095186f-620afe4b-ba26ba50 StartTime:2021-08-15T22:39:40.692-07:00 Status:DONE StatusMessage: TargetId:6947401086746772724 TargetLink:https://www.googleapis.com/compute/v1/projects/circular-jet-322614/zones/asia-southeast2-a/instances/inst-importer-inflate-6t2qt User:606260965808#cloudbuild.gserviceaccount.com Warnings:[] Zone:https://www.googleapis.com/compute/v1/projects/circular-jet-322614/zones/asia-southeast2-a ServerResponse:{HTTPStatusCode:200 Header:map[Cache-Control:[private] Content-Type:[application/json; charset=UTF-8] Date:[Mon, 16 Aug 2021 05:39:46 GMT] Server:[ESF] Vary:[Origin X-Origin Referer] X-Content-Type-Options:[nosniff] X-Frame-Options:[SAMEORIGIN] X-Xss-Protection:[0]]} ForceSendFields:[] NullFields:[]}: Code: ZONE_RESOURCE_POOL_EXHAUSTED; Message: The zone 'projects/circular-jet-322614/zones/asia-southeast2-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.
ERROR
ERROR: build step 0 "gcr.io/compute-image-tools/gce_vm_image_import:release" failed: step exited with non-zero status: 1
You will need to check if you have enough CPUs and other resources quota in 'projects/circular-jet-322614/zones/asia-southeast2-a'. Resource requirement can be found by looking at the deployment specs of the workload.

OpenStack HAproxy issues

I am using openshift-django17 to bootstrap my application on Openshift. Before I moved to Django 1.7, I was using authors previous repository for openshift-django16 and I did not have the problem which I will describe next. After running successfully for approximately 6h I get the following error:
Service Temporarily Unavailable The server is temporarily unable to
service your request due to maintenance downtime or capacity problems.
Please try again later.
After I restart the application it works without any problem for some hours, then I get this error again. Now gears should never enter idle mode, as I am posting some data every 5 minutes through RESTful POST API from outside of the app. I have run rhc tail command and I think the error lies in HAproxy:
==> app-root/logs/haproxy.log <== [WARNING] 081/155915 (497777) : config : log format ignored for proxy 'express' since it has no log
address. [WARNING] 081/155915 (497777) : Server express/local-gear is
DOWN, reason: Layer 4 connection problem, info: "Connection refused",
check duration: 0ms. 0 active and 0 backup servers left. 0 sessions
active, 0 requeued, 0 remaining in queue. [ALERT] 081/155915 (497777)
: proxy 'express' has no server available! [WARNING] 081/155948
(497777) : Server express/local-gear is UP, reason: Layer7 check
passed, code: 200, info: "HTTP status check returned code 200", ch eck
duration: 11ms. 1 active and 0 backup servers online. 0 sessions
requeued, 0 total in queue. [WARNING] 081/170359 (127633) : config :
log format ignored for proxy 'stats' si nce it has no log address.
[WARNING] 081/170359 (127633) : config : log format ignored for proxy
'express' since it has no log address. [WARNING] 081/170359 (497777) :
Stopping proxy stats in 0 ms. [WARNING] 081/170359 (497777) : Stopping
proxy express in 0 ms. [WARNING] 081/170359 (497777) : Proxy stats
stopped (FE: 1 conns, BE: 0 conns). [WARNING] 081/170359 (497777) :
Proxy express stopped (FE: 206 conns, BE: 312 co
I also run some CRON job once a day, but I am 99% sure it does not have to do anything with this. It looks like a problem on Openshift side, right? I have posted this issue on the github of the authors repository, where he suggested I try stackoverflow.
It turned out this was due to a bug in openshift-django17 setting DEBUG in settings.py to True even though it was specified in environment variables as False (pull request for fix here). The reason 503 Service Temporarily Unavailable appeared was because of Openshift memory limit violations due to DEBUG being turned ON as stated in Django settings documentation for DEBUG:
It is also important to remember that when running with DEBUG turned on, Django will remember every SQL query it executes. This is useful when you’re debugging, but it’ll rapidly consume memory on a production server.

Spark - Remote Akka Client Disassociated

I am setting up Spark 0.9 on AWS and am finding that when launching the interactive Pyspark shell, my executors / remote workers are first being registered:
14/07/08 22:48:05 INFO cluster.SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor#ip-xx-xx-xxx-xxx.ec2.internal:54110/user/
Executor#-862786598] with ID 0
and then disassociated almost immediately, before I have the chance to run anything:
14/07/08 22:48:05 INFO cluster.SparkDeploySchedulerBackend: Executor 0 disconnected,
so removing it
14/07/08 22:48:05 ERROR scheduler.TaskSchedulerImpl: Lost an executor 0 (already
removed): remote Akka client disassociated
Any idea what might be wrong? I've tried adjusting the JVM options spark.akka.frameSize and spark.akka.timeout, but I'm pretty sure this is not the issue since (1) I'm not running anything to begin with, and (2) my executors are disconnecting a few seconds after startup, which is well within the default 100s timeout.
Thanks!
Jack
I had a very similar problem, if not the same.
It started to work for me once the workers were connecting to master by using the very same name as the master thought it had.
My log messages were something like:
ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker#idc1-hrm1.heylinux.com:7078] -> [akka.tcp://sparkMaster#vagrant-centos64.vagrantup.com:7077]: Error [Association failed with [akka.tcp://sparkMaster#vagrant-centos64.vagrantup.com:7077]].
ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker#192.168.121.127:7078] -> [akka.tcp://sparkMaster#idc1-hrm1.heylinux.com:7077]: Error [Association failed with [akka.tcp://sparkMaster#idc1-hrm1.heylinux.com:7077]]
WARN util.Utils: Your hostname, idc1-hrm1 resolves to a loopback address: 127.0.0.1; using 192.168.121.187 instead (on interface eth0)
So check the log of the master and see what name it thinks it has.
Then use that very same name on the workers.

server1 instance in websphere shuts down regularly

i have a WSDL web service in the server1 instance of websphere.
this server1 instance shuts down regularly. there are no error logs being generated every time the shutdown occurs.
however, whenever the server1 instance of websphere is started, these errors and exceptions are generated:
The certificate (Owner: "CN=SOAPRequester, OU=TRL, O=IBM, ST=Kanagawa, C=JP") with alias "soaprequester" from keystore "D:\IBM\WEBSPH~1\APPSER~1\etc\ws-security\samples\dsig-sender.ks" has expired: java.security.cert.CertificateExpiredException: NotAfter: Sat Oct 01 19:24:06 CST 2011
The certificate (Owner: "CN=SOAPProvider, OU=TRL, O=IBM, ST=Kanagawa, C=JP") with alias "soapprovider" from keystore "D:\IBM\WEBSPH~1\APPSER~1\etc\ws-security\samples\dsig-receiver.ks" has expired: java.security.cert.CertificateExpiredException: NotAfter: Sat Oct 01 19:30:39 CST 2011
Method createManagedConnctionWithMCWrapper caught an exception during creation of the ManagedConnection for resource jms/BPECF, throwing ResourceAllocationException. Original exception: javax.resource.spi.ResourceAdapterInternalException: createQueueConnection failed
com.ibm.mqservices.MQInternalException: MQJE001: An MQException occurred: Completion Code 2, Reason 2063
MQJE027: Queue manager security exit rejected connection with error code 23
javax.jms.JMSSecurityException: MQJMS2013: invalid security authentication supplied for MQQueueManager
my questions are:
1. is MQ required by the WSDL service?
2. are any of these 5 errors possible for causing the frequent downtimes?
As far as I understand you have WebSphere Process Server configured with WebSphere MQ as message bus.
MQ Queue might be represented as JMS binding in SOAP over JMS configuration. IBM article.
Regarding errors:
First 2 errors are simple - certificates have expired. You should update it.
I assume 3 -5 exception are 1 error - there is answer to this question stackoverflow
2063 is security related problems.

Amazon SWF: at least one worker has to be running, why?

I've just started out using the AWS Ruby SDK to manage as simple workflow. One behavior I noticed right away is that at least one relevant worker and one relevant decider must be running prior to submitting a new workflow execution.
If I submit a new workflow execution before starting my worker and decider, then the tasks are never picked up, even when I'm still well within time-out limits. Why is this? Based on the description of how the HTTP long polling works, I would expect either app to receive the relevant tasks when the call to poll() is reached.
I encounter other deadlocking situations after a job fails (e.g. due to a worker or decider bug, or due to being terminated). Sometimes, re-running or even just starting an entirely new workflow execution will result in a deadlocked workflow execution. The initial decision tasks are shown in the workflow execution history in the AWS console, but the decider never receives them. Admittedly, I'm having trouble confirming/reducing this issue to a test case, but I suspect it is related to the above issue. This happens roughly 10 to 20% of the time; the rest of the time, everything works.
Some other things to mention: I'm using a single task list for two separate activity tasks that run in sequence. Both the worker and the decider are polling the same task list.
Here is my worker:
require 'yaml'
require 'aws'
config_file_path = File.join(File.dirname(File.expand_path(__FILE__)), 'config.yaml')
config = YAML::load_file(config_file_path)
swf = AWS::SimpleWorkflow.new(config)
domain = swf.domains['test-domain']
puts("waiting for an activity")
domain.activity_tasks.poll('hello-tasklist') do |activity_task|
puts activity_task.activity_type.name
activity_task.complete! :result => name
puts("waiting for an activity")
end
EDIT
Another user on the AWS forums commented:
I think the cause is in SWF not immediately recognizing a long poll connection shutdown. When you kill a worker its connection for some time can be considered open by the service. So it still can dispatch a task to it. To you it looks like the new worker never getting it. The way to verify it is to check the workflow history. You'll see activity task started event with identify field that contains host and pid of the dead worker. Eventually such task is going to time out and can be retried by the decider.
Note that such condition is common during unit tests that frequently terminate connections and is not really a problem for any production applications. The common workaround is to use different task list for each unit test.
This seems to be a pretty reasonable explanation. I'm going to try to confirm this.
You've raised two issues: one regarding start of an execution with no active deciders and the other regarding actors crashing in the middle of a task. Let me address them in order.
I have carried out an experiment based on your observations and indeed, when a new workflow execution starts and no deciders are polling SWF still thinks that a new decision task gets started. The following is my event log from the AWS console. Note what happens:
Fri Feb 22 22:15:38 GMT+000 2013 1 WorkflowExecutionStarted
Fri Feb 22 22:15:38 GMT+000 2013 2 DecisionTaskScheduled
Fri Feb 22 22:15:38 GMT+000 2013 3 DecisionTaskStarted
Fri Feb 22 22:20:39 GMT+000 2013 4 DecisionTaskTimedOut
Fri Feb 22 22:20:39 GMT+000 2013 5 DecisionTaskScheduled
Fri Feb 22 22:22:26 GMT+000 2013 6 DecisionTaskStarted
Fri Feb 22 22:22:27 GMT+000 2013 7 DecisionTaskCompleted
Fri Feb 22 22:22:27 GMT+000 2013 8 ActivityTaskScheduled
Fri Feb 22 22:22:29 GMT+000 2013 9 ActivityTaskStarted
Fri Feb 22 22:22:30 GMT+000 2013 10 ActivityTaskCompleted
...
The first decision task was immediately scheduled (which is expected) and started right away (i.e. allegedly dispatched to a decider, even though no decider was running). I started a decider in the meantime, but the workflow didn't move until the timeout of the original decision task, 5 minutes later. I can't think of a scenario where this would be the desired behavior. Two possible defenses against that: have deciders running before starting a new execution or set an acceptably low timeout on a decision task (these tasks should be immediate anyway).
The issue of crashing actor (either decider or worker) is one that I'm familiar with. A short background note first:
Both activity and decision tasks are recored by the service in 3 stages:
Scheduled = ready to be picked up by an actor.
Started = already picked up by an actor.
Completed/Failed or Timed out = the actor either or completed failed or not finished the task within deadline.
Once the actor picked up a task and crashed, it is obviously not going to report anything back to the service (unless it is able to recover and still remembers task token of the dispatched task - but most crashing actors wouldn't be that smart). The next time a decision task will be scheduled, will be upon time-out of the recently dispatched task, which is why all actors seem to be blocked for the duration of a task timeout. This is actually the desired behavior: The service can't know whether the task is being worked on or not as long as the worker still works within its deadline. There is a simple way to deal with this: fit your actors with a try-catch block and fail a task when an unexpected crash happens. I would discourage from using separate tasklists for each integ test. Instead, I'd recommend failing the task in the teardown() block. SWF allows to specify a reason for failing a task, which is one way of logging failures and viewing them later through the AWS console.