Is the Kafka strategy compatible with testing environments? - unit-testing

Following https://stackoverflow.com/a/36009859/9911256, it would appear that Kafka commits/autocommits could fail if the consumer suddenly dies. In fact, my Kafka application works fine in production, but during tests, SOMETIMES I get this recurrent issue (until rebooting kafka): the offset is the same.
My unit test (one java producer sending 10 packets to one java consumer, in one broker, one topic, one partition, one group) launches 10 packets, and I check them, starting from the first:
SENT: (0) NAME:person-001; UUID:352c1f8e-c141-4446-8ac7-18eb044a6b92
SENT: (1) NAME:person-001; UUID:81681a30-83e1-4f85-b07f-da140cfdb874
SENT: (2) NAME:person-001; UUID:3b9db497-460a-4a1c-86b9-f724af1a0449
SENT: (3) NAME:person-001; UUID:63c0edf9-ec00-4ef7-b81a-4b1b8919a42d
SENT: (4) NAME:person-001; UUID:346f265c-1964-4460-97de-1a7b43285c06
SENT: (5) NAME:person-001; UUID:2d1bb49c-03ce-4762-abb3-2bbb963e87d1
SENT: (6) NAME:person-001; UUID:3c8ddda0-6cb8-45b4-b1d2-3a99ba57a48a
SENT: (7) NAME:person-001; UUID:3f819408-41d5-4cad-ad39-322616a86b99
SENT: (8) NAME:person-001; UUID:1db09bc1-4c90-4a0d-8efc-d6ea8a791985
SENT: (9) NAME:person-001; UUID:705a3a3c-fd15-45a9-a96c-556350f1f79a
Exception in thread "Thread-2" org.opentest4j.AssertionFailedError: expected: <352c1f8e-c141-4446-8ac7-18eb044a6b92> but was: <6785fa5d-ef63-4fe6-85c5-c525bfc4ee12>
And if I run the test again:
SENT: (0) NAME:person-001; UUID:d171e7ee-fa73-4cb4-826e-f7bffdef9e92
SENT: (1) NAME:person-001; UUID:25da6b6e-57e9-4f8a-a3ff-1099f94fcaf5
SENT: (2) NAME:person-001; UUID:d05b4693-ba60-4db2-a5ae-30dcd44ce5b7
SENT: (3) NAME:person-001; UUID:fbd75ee7-6f34-4ab1-abda-d31ee91d0ff8
SENT: (4) NAME:person-001; UUID:798fe246-f10e-4fc3-90c9-df3e181bb641
SENT: (5) NAME:person-001; UUID:26b33a19-7e65-49ec-b54d-3379ef76b797
SENT: (6) NAME:person-001; UUID:45ecef46-69f5-4bff-99b5-c7c2dce67ec8
SENT: (7) NAME:person-001; UUID:464df926-cd66-4cfa-b282-36047522dfe8
SENT: (8) NAME:person-001; UUID:982c82c0-c669-400c-a70f-62c57e3552a4
SENT: (9) NAME:person-001; UUID:ecdbfce6-d378-496d-9e0b-30f16b7cf484
Exception in thread "Thread-2" org.opentest4j.AssertionFailedError: expected: <d171e7ee-fa73-4cb4-826e-f7bffdef9e92> but was: <6785fa5d-ef63-4fe6-85c5-c525bfc4ee12>
Notice that the message <6785fa5d-ef63-4fe6-85c5-c525bfc4ee12> is repeatedly sent to each attempt, despite they are different lanches, that I do manually.
I use:
properties.put("auto.offset.reset", "latest");
I've already tried the autocommit option, with equivalent results.
The worst, worst, is that this happens (or not) everytime I restart the kafka server:
sometimes, I restart it, and the tests could go fine, even if I repeat any number of times.
sometimes, it seems to enter in this failure state, and will ALWAYS fail.
As said, when consumers don't die, and massive flows of mesages are managed, this issue does not appear.
I've also noticed that if the server has been recently rebooted and all logs and data directories of kafka have been deleted, the first tests could delay and fail.
My logs show this:
2019-01-25T12:20:02.874119+01:00 TLS dockcompose: #033[32mkafka_1 |#033[0m [2019-01-25 11:20:02,850] INFO [GroupCoordinator 1001]: Preparing to rebalance group 0 in state PreparingRebalance with old generation 26 (__consumer_offsets-48) (reason: Adding new member consumer-1-a0b94a2a-0cae-4ba8-85f0-9a84030f4beb) (kafka.coordinator.group.GroupCoordinator)
2019-01-25T12:20:02.874566+01:00 TLS dockcompose: #033[32mkafka_1 |#033[0m [2019-01-25 11:20:02,851] INFO [GroupCoordinator 1001]: Stabilized group 0 generation 27 (__consumer_offsets-48) (kafka.coordinator.group.GroupCoordinator)
2019-01-25T12:20:02.874810+01:00 TLS dockcompose: #033[32mkafka_1 |#033[0m [2019-01-25 11:20:02,858] INFO [GroupCoordinator 1001]: Assignment received from leader for group 0 for generation 27 (kafka.coordinator.group.GroupCoordinator)
13 seconds after the test:
2019-01-25T12:20:15.894185+01:00 TLS dockcompose: #033[32mkafka_1 |#033[0m [2019-01-25 11:20:15,871] INFO [GroupCoordinator 1001]: Member consumer-2-79f97c80-294c-438b-8a8a-3745f4a57010 in group 0 has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
2019-01-25T12:20:15.894522+01:00 TLS dockcompose: #033[32mkafka_1 |#033[0m [2019-01-25 11:20:15,871] INFO [GroupCoordinator 1001]: Preparing to rebalance group 0 in state PreparingRebalance with old generation 27 (__consumer_offsets-48) (reason: removing member consumer-2-79f97c80-294c-438b-8a8a-3745f4a57010 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
2019-01-25T12:20:17.897272+01:00 TLS dockcompose: #033[32mkafka_1 |#033[0m [2019-01-25 11:20:17,865] INFO [GroupCoordinator 1001]: Member consumer-1-a0b94a2a-0cae-4ba8-85f0-9a84030f4beb in group 0 has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
2019-01-25T12:20:17.897579+01:00 TLS dockcompose: #033[32mkafka_1 |#033[0m [2019-01-25 11:20:17,866] INFO [GroupCoordinator 1001]: Group 0 with generation 28 is now empty (__consumer_offsets-48) (kafka.coordinator.group.GroupCoordinator)
What I can conclude, reading the previous topic is that kafka would work fine if the consumer is permanently connected (production). But this is not possible during tests! What's the problem here???
UPDATE: I've found that effectively, the CURRENT-OFFSET does not change in some cases (which is weird, because I continue receiving the message <6785fa5d-ef63-4fe6-85c5-c525bfc4ee12>), and of course, the LOG-END-OFFSET grows, but this is completely erratic...
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
t0001 0 160 1085 925 consumer-2-1a2cce59-c449-471e-bad0-3c3335f44e26 /10.42.0.105 consumer-2
after launching the test, again:
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
t0001 0 160 1153 993 consumer-2-dff28a54-b4e8-464a-a5e7-67c8cbad749f /10.42.0.105 consumer-2

Related

Find strings in a log file for zabbix monitoring

I need to find strings in a log file with regex and later send output to Zabbix monitoring server to fire triggers if needed.
For example here is a part of the log file:
===== Backup Failures =====
Description: Checks number of studies that their backup failed
Status: OK , Check Time: Sun Oct 30 07:31:13 2022
Details: [OK] 0 total backup commands failed during the last day.
===== Oracle queues =====
Description: Count Oracle queues sizes. The queues are used to pass information between the applications
Status: OK , Check Time: Sun Oct 30 07:31:04 2022
Details: [OK] All queues have less than 15 elements.
===== Zombie Services =====
Description: Checks for zombie services
Status: Error , Check Time: Sun Oct 30 07:31:30 2022, Script: <check_mvs_services.pl>
Details: [CRITICAL] 1 missing process(es) found. Failed killing 1 process(es)
===== IIS Application Pools Memory Usage =====
Description: Checks the memory usage of the application pools that run under IIS (w3wp.exe)
Status: OK , Check Time: Sun Oct 30 07:32:30 2022
Details: [OK] All processes of type w3wp.exe don't exceed memory limits
===== IIS Web Response =====
Description: Checks that the web site responds properly
Status: OK , Check Time: Sun Oct 30 07:32:34 2022
Details: [OK] All addresses returned 200
I need to find all items for monitoring and it's results.
If results not OK then Zabbix triggers should send alarm.
I found Zabbix can handle log file monitoring with similar command here but first need to find strings in the log file:
log[/path/to/the/file,"regex expression",,,,]
In this example I believe these items should find for Zabbix:
===== Backup Failures =====
Details: [OK] 0 total backup commands failed during the last day.
===== Oracle queues =====
Details: [OK] All queues have less than 15 elements.
===== Zombie Services =====
Details: [CRITICAL] 1 missing process(es) found. Failed killing 1 process(es)
===== IIS Application Pools Memory Usage =====
Details: [OK] All processes of type w3wp.exe don't exceed memory limits
===== IIS Web Response =====
Details: [OK] All addresses returned 200
Can you advise how possible to achieve this solution?
For any help I would be really appreciated.
Thanks in advance.

Kafka: Understanding Broker failure

I have a Kafka cluster with:
2 brokers b-1 and b-2.
2 topics with both: PartitionCount:1 ReplicationFactor:2 min.insync.replicas=1
Here is what happened:
%6|1613807298.974|FAIL|rdkafka#producer-2| [thrd:sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/2: Disconnected (after 3829996ms in state UP)
%3|1613807299.011|FAIL|rdkafka#producer-2| [thrd:sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/2: Connect to ipv4#172.31.18.172:9096 failed: Connection refused (after 36ms in state CONNECT)
%3|1613807299.128|FAIL|rdkafka#producer-2| [thrd:sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/2: Connect to ipv4#172.31.18.172:9096 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
%4|1613807907.225|REQTMOUT|rdkafka#producer-2| [thrd:sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/2: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 1 partially-sent requests
%3|1613807907.225|FAIL|rdkafka#producer-2| [thrd:sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-2.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/2: 1 request(s) timed out: disconnect (after 343439ms in state UP)
%5|1613807938.942|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: Timed out ProduceRequest in flight (after 60767ms, timeout #0)
%5|1613807938.943|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: Timed out ProduceRequest in flight (after 60459ms, timeout #1)
%5|1613807938.943|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: Timed out ProduceRequest in flight (after 60342ms, timeout #2)
%5|1613807938.943|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: Timed out ProduceRequest in flight (after 60305ms, timeout #3)
%5|1613807938.943|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: Timed out ProduceRequest in flight (after 60293ms, timeout #4)
%4|1613807938.943|REQTMOUT|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: Timed out 6 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
%3|1613807938.943|FAIL|rdkafka#producer-1| [thrd:sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.am]: sasl_ssl://b-1.xxxx.kafka.ap-northeast-1.amazonaws.com:9096/1: 6 request(s) timed out: disconnect (after 4468987ms in state UP)
Within the code, I got this error when my producer performed a poll around that time:
2021-02-20 07:59:08,174 - ERROR - Failed to deliver message due to error: KafkaError{code=REQUEST_TIMED_OUT,val=7,str="Broker: Request timed out"}
Broker b-2 logs have this:
[2021-02-20 07:57:24,781] WARN Client session timed out, have not heard from server in 15103ms for sessionid 0x2000190b5d40001 (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:24,782] WARN Client session timed out, have not heard from server in 12701ms for sessionid 0x2000190b5d40000 (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:24,931] INFO Client session timed out, have not heard from server in 12701ms for sessionid 0x2000190b5d40000, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:24,932] INFO Client session timed out, have not heard from server in 15103ms for sessionid 0x2000190b5d40001, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:32,884] INFO Opening socket connection to server INTERNAL_ZK_DNS/INTERNAL_IP. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:32,910] INFO Opening socket connection to server INTERNAL_ZK_DNS/INTERNAL_IP. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:33,032] INFO Socket connection established to INTERNAL_ZK_DNS/INTERNAL_IP, initiating session (org.apache.zookeeper.ClientCnxn)
[2021-02-20 07:57:33,032] INFO Socket connection established to INTERNAL_ZK_DNS/INTERNAL_IP, initiating session (org.apache.zookeeper.ClientCnxn
My understanding here is that (1) b-2 went down i.e. unable to connect to Zookeeper (2) Messages were produced to b-1 successfully during this time. (3) b-1 was also trying to forward messages to b-2during this downtime due to the replication factor set to 2 (4) All these forwarded messages (ProduceRequests) got timed-out after 600s
My question:
Is my understanding correct and how I can prevent this from happening again?
If I had 3 brokers here, would b-1 have tried to connect to b-3 right away rather than waiting for b-2? Is that a good workaround? (Assuming topic replication factor = 2 everywhere)

GCP IoT stops processing messages

We have multiple devices (rpi2) connected to IoT core with our self-engineered "firmware" that uses Python and Paho-mqtt client. Our steps to reproduce error:
Log into Google Cloud
Set device to DEBUG logging level.
Attempt to send command using Google's IoT website (e.g., remote firmware update or just anything other)
Observe immediate error responses (two cases below) from the Google control panel.
Observe the logs on the device; no record of contact from GCP.
And the two cases are:
Message couldn’t be sent because the device is not connected
This is true and not true, because we are seeing that our device gets disconnected from time to time without particular reason. We do have JWT token refreshing mechanism and it's working fine, but sometimes GCP just... disconnects device!
The command couldn't be sent because the device is not subscribed to the MQTT wildcard topic.
This is not true, all devices are subscribed (but if they are not connected then maybe they are not subscribed...)
Here are the things that we already checked:
1. The JWT are refreshed 10 minutes faster than one hour (the expiry is set to 3600, but we refresh every 3000 seconds)
2. Explicitly set MQTTv311 as the protocol to speak in Python Paho-mqtt client
3. Implemented an on_log() handler.
4. Implemented PINGREQ as IoT core documentation stated.
5. Checked devices' internet connection and it's just fine.
Here are some logs:
From device:
[2020-04-05 21:20:58,624] root - INFO - Trying to set date and time to: 2020-04-05 21:20:58.624278
[2020-04-05 21:20:59,239] root - INFO - Starting collecting device state
[2020-04-05 21:20:59,256] root - INFO - Connecting to the cloud
[2020-04-05 21:20:59,262] root - INFO - Device client_id is 'projects/XXX/locations/us-central1/registries/iot-registry/devices/XXX-000003'
[2020-04-05 21:20:59,549] root - INFO - Starting pusher service
[2020-04-05 21:21:00,050] root - INFO - on_connect: Connection Accepted.
[2020-04-05 21:21:00,051] root - INFO - Subscribing to /devices/XXX-000003/commands/#
[2020-04-05 21:22:04,827] root - INFO - Since 2020-04-05T21:20:59.549934, pusher sent: 469 messages, received confirmation for 469 messages, recorded 0 errors
[2020-04-05 21:23:07,426] root - INFO - Since 2020-04-05T21:22:04.828064, pusher sent: 6 messages, received confirmation for 6 messages, recorded 0 errors
[2020-04-05 21:24:09,815] root - INFO - Since 2020-04-05T21:23:07.427720, pusher sent: 6 messages, received confirmation for 6 messages, recorded 0 errors
[2020-04-05 21:25:12,036] root - INFO - Since 2020-04-05T21:24:09.816221, pusher sent: 6 messages, received confirmation for 6 messages, recorded 0 errors
[2020-04-05 21:26:14,099] root - INFO - Since 2020-04-05T21:25:12.037507, pusher sent: 6 messages, received confirmation for 6 messages, recorded 0 errors
[2020-04-05 21:27:16,052] root - INFO - Since 2020-04-05T21:26:14.100430, pusher sent: 6 messages, received confirmation for 6 messages, recorded 0 errors
[2020-04-05 21:28:17,807] root - INFO - Since 2020-04-05T21:27:16.053253, pusher sent: 6 messages, received confirmation for 6 messages, recorded 0 errors
[2020-04-05 21:29:19,416] root - INFO - Since 2020-04-05T21:28:17.808075, pusher sent: 6 messages, received confirmation for 6 messages, recorded 0 errors
From GCP:
GCP logs

Kubernetes node fails (CoreOS/AWS/Kubernetes stack)

We have a small testing Kubernetes cluster running on AWS, using CoreOS, as per the instructions here. Currently this consists of only a master and a worker node. In the past couple of weeks we've been running this cluster we've noticed that the worker instance occasionally fails. The first time this happened the instance was subsequently killed and restarted by the auto-scaling group it is in. Today the same thing happened, but we were able to login to the instance before it was shut down and retrieve some information, but it remains unclear to me exactly what has caused this problem.
The node failure seems to happen on an irregular basis, and there is no evidence that there is anything abnormal happening which would precipitate this (external load etc).
Subsquent to the failure (kubernetes node status Not Ready) the instance was still running, but had inactive kubelet and docker services (start failed with result 'dependency'). The flanneld service was running, but with a restart time after the time the node failure was seen.
Logs from around the time of the node failure don't seem to show anything clearly pointing to a cause of the failure. There's a couple of kubelet-wrapper errors at about the time the failure was seen:
`Jul 22 07:25:33 ip-10-0-0-92.ec2.internal kubelet-wrapper[1204]: E0722 07:25:33.121506 1204 kubelet.go:2745] Error updating node status, will retry: nodes "ip-10-0-0-92.ec2.internal" cannot be updated: the object has been modified; please apply your changes to the latest version and try again`
`Jul 22 07:25:34 ip-10-0-0-92.ec2.internal kubelet-wrapper[1204]: E0722 07:25:34.557047 1204 event.go:193] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"ip-10-0-0-92.ec2.internal.1462693ef85b56d8", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"4687622", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-0-92.ec2.internal", UID:"ip-10-0-0-92.ec2.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeHasSufficientDisk", Message:"Node ip-10-0-0-92.ec2.internal status is now: NodeHasSufficientDisk", Source:api.EventSource{Component:"kubelet", Host:"ip-10-0-0-92.ec2.internal"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63604448947, nsec:0, loc:(*time.Location)(0x3b1a5c0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63604769134, nsec:388015022, loc:(*time.Location)(0x3b1a5c0)}}, Count:2, Type:"Normal"}': 'events "ip-10-0-0-92.ec2.internal.1462693ef85b56d8" not found' (will not retry!)
Jul 22 07:25:34 ip-10-0-0-92.ec2.internal kubelet-wrapper[1204]: E0722 07:25:34.560636 1204 event.go:193] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"ip-10-0-0-92.ec2.internal.14626941554cc358", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"4687645", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-0-92.ec2.internal", UID:"ip-10-0-0-92.ec2.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeReady", Message:"Node ip-10-0-0-92.ec2.internal status is now: NodeReady", Source:api.EventSource{Component:"kubelet", Host:"ip-10-0-0-92.ec2.internal"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63604448957, nsec:0, loc:(*time.Location)(0x3b1a5c0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63604769134, nsec:388022975, loc:(*time.Location)(0x3b1a5c0)}}, Count:2, Type:"Normal"}': 'events "ip-10-0-0-92.ec2.internal.14626941554cc358" not found' (will not retry!)`
followed by what looks like some etcd errors:
`Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [WARNING][1305/140149086452400] calico.etcddriver.driver 810: etcd watch returned bad HTTP status topoll on index 5237916: 400
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [ERROR][1305/140149086452400] calico.etcddriver.driver 852: Error from etcd for index 5237916: {u'errorCode': 401, u'index': 5239005, u'message': u'The event in requested index is outdated and cleared', u'cause': u'the requested history has been cleared [5238006/5237916]'}; triggering a resync.
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [INFO][1305/140149086452400] calico.etcddriver.driver 916: STAT: Final watcher etcd response time: 0 in 630.6s (0.000/s) min=0.000ms mean=0.000ms max=0.000ms
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [INFO][1305/140149086452400] calico.etcddriver.driver 916: STAT: Final watcher processing time: 7 in 630.6s (0.011/s) min=90066.312ms mean=90078.569ms max=90092.505ms
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [INFO][1305/140149086452400] calico.etcddriver.driver 919: Watcher thread finished. Signalled to resync thread. Was at index 5237916. Queue length is 1.
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,743 [WARNING][1305/140149192694448] calico.etcddriver.driver 291: Watcher died; resyncing.`
and a few minutes later a large number of failed connections to the master (10.0.0.50):
`Jul 22 07:36:41 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:36:37,641 [WARNING][1305/140149086452400] urllib3.connectionpool 647: Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7700b85b90>: Failed to establish a new connection: [Errno 113] Host is unreachable',)': http://10.0.0.50:2379/v2/keys/calico/v1?waitIndex=5239006&recursive=true&wait=true
Jul 22 07:36:41 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:36:37,641 [INFO][1305/140149086452400] urllib3.connectionpool 213: Starting new HTTP connection (2): 10.0.0.50`
Although these errors are presumably related to the node/instance failure, these don't really mean a lot to me, and certainly don't seem to suggest the underlying cause - but if anyone can see anything here that would suggest a possible cause of the node/instance failure (and how we can go about rectifying this) that would be greatly appreciated!
Something in your description and log confuse me, you said that you use docker runtime which there is rkt in your log; you said that you use flannel in your cluster which there is calico in your log...
Anyway, from the log you provide, it's more like your etcd is down... which makes kubelet and calico can't update their state, and apiserver will regard they are down. There is not enough information here, I could only suggest that you need to backup etcd's log next time you see this...
Another suggestion is that better not use the same etcd for both kubenetes cluster and calico...

OpenStack HAproxy issues

I am using openshift-django17 to bootstrap my application on Openshift. Before I moved to Django 1.7, I was using authors previous repository for openshift-django16 and I did not have the problem which I will describe next. After running successfully for approximately 6h I get the following error:
Service Temporarily Unavailable The server is temporarily unable to
service your request due to maintenance downtime or capacity problems.
Please try again later.
After I restart the application it works without any problem for some hours, then I get this error again. Now gears should never enter idle mode, as I am posting some data every 5 minutes through RESTful POST API from outside of the app. I have run rhc tail command and I think the error lies in HAproxy:
==> app-root/logs/haproxy.log <== [WARNING] 081/155915 (497777) : config : log format ignored for proxy 'express' since it has no log
address. [WARNING] 081/155915 (497777) : Server express/local-gear is
DOWN, reason: Layer 4 connection problem, info: "Connection refused",
check duration: 0ms. 0 active and 0 backup servers left. 0 sessions
active, 0 requeued, 0 remaining in queue. [ALERT] 081/155915 (497777)
: proxy 'express' has no server available! [WARNING] 081/155948
(497777) : Server express/local-gear is UP, reason: Layer7 check
passed, code: 200, info: "HTTP status check returned code 200", ch eck
duration: 11ms. 1 active and 0 backup servers online. 0 sessions
requeued, 0 total in queue. [WARNING] 081/170359 (127633) : config :
log format ignored for proxy 'stats' si nce it has no log address.
[WARNING] 081/170359 (127633) : config : log format ignored for proxy
'express' since it has no log address. [WARNING] 081/170359 (497777) :
Stopping proxy stats in 0 ms. [WARNING] 081/170359 (497777) : Stopping
proxy express in 0 ms. [WARNING] 081/170359 (497777) : Proxy stats
stopped (FE: 1 conns, BE: 0 conns). [WARNING] 081/170359 (497777) :
Proxy express stopped (FE: 206 conns, BE: 312 co
I also run some CRON job once a day, but I am 99% sure it does not have to do anything with this. It looks like a problem on Openshift side, right? I have posted this issue on the github of the authors repository, where he suggested I try stackoverflow.
It turned out this was due to a bug in openshift-django17 setting DEBUG in settings.py to True even though it was specified in environment variables as False (pull request for fix here). The reason 503 Service Temporarily Unavailable appeared was because of Openshift memory limit violations due to DEBUG being turned ON as stated in Django settings documentation for DEBUG:
It is also important to remember that when running with DEBUG turned on, Django will remember every SQL query it executes. This is useful when you’re debugging, but it’ll rapidly consume memory on a production server.